How to deal with redirects to a bookmark within a page in Scrapy (911 error)












0















I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.



The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.



I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.



Spider 1:



myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))

class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs

def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}


Spider 2:



URLlist = 
baseURL = "https://www.ign.com/boards/"

import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])

class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist

# handle_httpstatus_list = [301]

def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}









share|improve this question

























  • I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

    – stranac
    Nov 23 '18 at 19:39











  • I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

    – theresearchant
    Nov 24 '18 at 2:10


















0















I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.



The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.



I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.



Spider 1:



myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))

class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs

def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}


Spider 2:



URLlist = 
baseURL = "https://www.ign.com/boards/"

import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])

class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist

# handle_httpstatus_list = [301]

def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}









share|improve this question

























  • I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

    – stranac
    Nov 23 '18 at 19:39











  • I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

    – theresearchant
    Nov 24 '18 at 2:10
















0












0








0








I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.



The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.



I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.



Spider 1:



myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))

class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs

def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}


Spider 2:



URLlist = 
baseURL = "https://www.ign.com/boards/"

import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])

class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist

# handle_httpstatus_list = [301]

def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}









share|improve this question
















I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.



The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.



I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.



Spider 1:



myURLs =  baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))

class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs

def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}


Spider 2:



URLlist = 
baseURL = "https://www.ign.com/boards/"

import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])

class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist

# handle_httpstatus_list = [301]

def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}






python scrapy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 19:34









stranac

13.8k31724




13.8k31724










asked Nov 23 '18 at 17:21









theresearchanttheresearchant

1




1













  • I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

    – stranac
    Nov 23 '18 at 19:39











  • I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

    – theresearchant
    Nov 24 '18 at 2:10





















  • I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

    – stranac
    Nov 23 '18 at 19:39











  • I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

    – theresearchant
    Nov 24 '18 at 2:10



















I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39





I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.

– stranac
Nov 23 '18 at 19:39













I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10







I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.

– theresearchant
Nov 24 '18 at 2:10














0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450765%2fhow-to-deal-with-redirects-to-a-bookmark-within-a-page-in-scrapy-911-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450765%2fhow-to-deal-with-redirects-to-a-bookmark-within-a-page-in-scrapy-911-error%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Berounka

Fiat S.p.A.

Type 'String' is not a subtype of type 'int' of 'index'