Scrape multiple JavaScript-based websites in R
These are my first steps in programming and I'm trying to learn as much as I can before bothering you guys. But right now I'm pretty much stuck after trying several ways (thought of by myself or found online).
What I'm trying to do now is saving multiple whole JavaScript pages to further work with them in R. As far as I understood this is just possible using phantomjs. I've managed to get a code loading the page. But I'm struggling with the loop:
writeLines("var page = new WebPage();
var fs = require('fs');
for (i = 101; i <= 150; i++) {
page.open('http://understat.com/match/' + i, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('match' + i + '.html', page.content, 'w');
phantom.exit();
}, 2500);
}
}
", con = "scrape.js")
js_scrape <- function(
js_path = "scrape.js",
phantompath = "/Users/Marek/Documents/Programmierung/Startversuche/phantomjs-2.1.1/bin/phantomjs"){
lines <- readLines(js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
js_scrape()
It's just saving the last page of the loop. Reading other threads I understood that the problem is that phantomJS is asynchronous and is pretty much closing the pages before they have been loaded. But I could not work a way around, so it's saving all of the pages in different files.
javascript r loops web-scraping phantomjs
add a comment |
These are my first steps in programming and I'm trying to learn as much as I can before bothering you guys. But right now I'm pretty much stuck after trying several ways (thought of by myself or found online).
What I'm trying to do now is saving multiple whole JavaScript pages to further work with them in R. As far as I understood this is just possible using phantomjs. I've managed to get a code loading the page. But I'm struggling with the loop:
writeLines("var page = new WebPage();
var fs = require('fs');
for (i = 101; i <= 150; i++) {
page.open('http://understat.com/match/' + i, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('match' + i + '.html', page.content, 'w');
phantom.exit();
}, 2500);
}
}
", con = "scrape.js")
js_scrape <- function(
js_path = "scrape.js",
phantompath = "/Users/Marek/Documents/Programmierung/Startversuche/phantomjs-2.1.1/bin/phantomjs"){
lines <- readLines(js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
js_scrape()
It's just saving the last page of the loop. Reading other threads I understood that the problem is that phantomJS is asynchronous and is pretty much closing the pages before they have been loaded. But I could not work a way around, so it's saving all of the pages in different files.
javascript r loops web-scraping phantomjs
add a comment |
These are my first steps in programming and I'm trying to learn as much as I can before bothering you guys. But right now I'm pretty much stuck after trying several ways (thought of by myself or found online).
What I'm trying to do now is saving multiple whole JavaScript pages to further work with them in R. As far as I understood this is just possible using phantomjs. I've managed to get a code loading the page. But I'm struggling with the loop:
writeLines("var page = new WebPage();
var fs = require('fs');
for (i = 101; i <= 150; i++) {
page.open('http://understat.com/match/' + i, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('match' + i + '.html', page.content, 'w');
phantom.exit();
}, 2500);
}
}
", con = "scrape.js")
js_scrape <- function(
js_path = "scrape.js",
phantompath = "/Users/Marek/Documents/Programmierung/Startversuche/phantomjs-2.1.1/bin/phantomjs"){
lines <- readLines(js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
js_scrape()
It's just saving the last page of the loop. Reading other threads I understood that the problem is that phantomJS is asynchronous and is pretty much closing the pages before they have been loaded. But I could not work a way around, so it's saving all of the pages in different files.
javascript r loops web-scraping phantomjs
These are my first steps in programming and I'm trying to learn as much as I can before bothering you guys. But right now I'm pretty much stuck after trying several ways (thought of by myself or found online).
What I'm trying to do now is saving multiple whole JavaScript pages to further work with them in R. As far as I understood this is just possible using phantomjs. I've managed to get a code loading the page. But I'm struggling with the loop:
writeLines("var page = new WebPage();
var fs = require('fs');
for (i = 101; i <= 150; i++) {
page.open('http://understat.com/match/' + i, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('match' + i + '.html', page.content, 'w');
phantom.exit();
}, 2500);
}
}
", con = "scrape.js")
js_scrape <- function(
js_path = "scrape.js",
phantompath = "/Users/Marek/Documents/Programmierung/Startversuche/phantomjs-2.1.1/bin/phantomjs"){
lines <- readLines(js_path)
command = paste(phantompath, js_path, sep = " ")
system(command)
}
js_scrape()
It's just saving the last page of the loop. Reading other threads I understood that the problem is that phantomJS is asynchronous and is pretty much closing the pages before they have been loaded. But I could not work a way around, so it's saving all of the pages in different files.
javascript r loops web-scraping phantomjs
javascript r loops web-scraping phantomjs
edited Nov 24 '18 at 13:03
Flimzy
37.7k96497
37.7k96497
asked Nov 23 '18 at 9:51
MBruceKeeMBruceKee
61
61
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444252%2fscrape-multiple-javascript-based-websites-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53444252%2fscrape-multiple-javascript-based-websites-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown