How to deal with redirects to a bookmark within a page in Scrapy (911 error)
I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.
The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.
I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.
Spider 1:
myURLs = baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))
class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs
def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}
Spider 2:
URLlist =
baseURL = "https://www.ign.com/boards/"
import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])
class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist
# handle_httpstatus_list = [301]
def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}
python scrapy
add a comment |
I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.
The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.
I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.
Spider 1:
myURLs = baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))
class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs
def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}
Spider 2:
URLlist =
baseURL = "https://www.ign.com/boards/"
import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])
class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist
# handle_httpstatus_list = [301]
def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}
python scrapy
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10
add a comment |
I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.
The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.
I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.
Spider 1:
myURLs = baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))
class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs
def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}
Spider 2:
URLlist =
baseURL = "https://www.ign.com/boards/"
import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])
class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist
# handle_httpstatus_list = [301]
def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}
python scrapy
I am very new to programming, so apologies if this is a rookie issue. I am a researcher, and I've been building spiders to allow me to crawl specific search results of IGN, the gaming forum. The first spider collects each entry in the search results, along with URLs, and then the second spider crawls each of those URLs for the content.
The problem is that IGN redirects URLs associated with a specific post to a new URL that incorporates a #bookmark at the end of the address. This allows the visitor to the page to jump directly down to the post in question, but I want my spider to crawl over the entire thread. In addition, my spider ends up with a (911) error after the redirect and returns no data. The only data retrieved is from any search results that linked directly to a thread rather than a post.
I am absolutely stumped and confused, so any help would amazing! Both spiders are attached below.
Spider 1:
myURLs = baselineURL = "https://www.ign.com/boards/search/186716896/?q=broforce&o=date&page=" for counter in range (1,5):
myURLs.append(baselineURL + str(counter))
class BroforceIGNScraper(scrapy.Spider):
name = "foundation"
start_urls = myURLs
def parse(self,response):
for post in response.css("div.main"):
yield {
'title': post.css("h3.title a::text").extract_first(),
'author': post.css("div.meta a.username::text").extract_first(),
'URL': post.css('h3 a').xpath('@href').extract_first(),
}
Spider 2:
URLlist =
baseURL = "https://www.ign.com/boards/"
import csv
with open('BroforceIGNbase.csv', 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
URLlist.append(baseURL + row['URL'])
class BroforceIGNScraper(scrapy.Spider):
name = "posts2"
start_urls = URLlist
# handle_httpstatus_list = [301]
def parse(self,response):
for post in response.css(".messageList"):
yield {
'URL': response.url,
'content': post.css(".messageContent article").extract_first(),
'commentauthor': post.css("div.messageMeta a::text").extract_first(),
'commentDateTime': post.css('div.messageMeta a span.DateTime').xpath('@title').extract_first(),
}
python scrapy
python scrapy
edited Nov 23 '18 at 19:34
stranac
13.8k31724
13.8k31724
asked Nov 23 '18 at 17:21
theresearchanttheresearchant
1
1
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10
add a comment |
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450765%2fhow-to-deal-with-redirects-to-a-bookmark-within-a-page-in-scrapy-911-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450765%2fhow-to-deal-with-redirects-to-a-bookmark-within-a-page-in-scrapy-911-error%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I can't reproduce any of the issues you describe, not the redirect nor the 911 error. Can you share a particular url where the problems occur? Also, the url segment (the #part) should be ignored when making the request, so that is unlikely to play any part in this.
– stranac
Nov 23 '18 at 19:39
I just reran the spider and for some reason it is working now without me having made any change to the code. Sorry for sending you on a rabbit trail. But it's only pulling the first post on a page - might you know why that's the case? It should run through everything on the page.
– theresearchant
Nov 24 '18 at 2:10