How to limit Apify web crawler scope to first three list pages?

I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.

The crawler scrapes articles I have published at https://www.beet.tv/author/randrews, a page which contains 102 paginated index pages, each containing 20 article links. The crawler works fine when executed manually and in full; it gets everything, 2,000+ articles.

However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).

The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...

{

  "customData": 3

}

... and then the below to take that value and use it to limit...

var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

    context.enqueuePage({

This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.

However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.

How can I ensure I only get the first three pages up to /page/3?

Have I malformed something?

In the code, you can see, now commented-out, my previous version of the above addition.

Those LIST pages should only be...

The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

https://www.beet.tv/author/randrews/page/2

https://www.beet.tv/author/randrews/page/3

... and not the likes of /page/101 or /page/102, which may surface.

Here are the key terms...

START https://www.beet.tv/author/randrews

LIST https://www.beet.tv/author/randrews/page/[d+]

DETAIL https://www.beet.tv/*

Clickable elements a.page-numbers

And here is the crawler script...

function pageFunction(context) {



 // Called on every page the crawler visits, use it to extract data from it

 var $ = context.jQuery;



 // If page is START or a LIST,

 if (context.request.label === 'START' || context.request.label === 'LIST') {



     context.skipOutput();



     // First, gather LIST page

     $('a.page-numbers').each(function() {

         // lines added to accept number of pages via customData in Scheduler...

         var pageNumber = parseInt($(this).text());

         // var maxListDepth = context.customData;

         var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

         if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

           context.enqueuePage({

               url: /*window.location.origin +*/ $(this).attr('href'),

               label: 'LIST'

           });

         }

     });



     // Then, gather every DETAIL page

     $('h3>a').each(function(){

         context.enqueuePage({

             url: /*window.location.origin +*/ $(this).attr('href'),

             label: 'DETAIL'

         });

     });



 // If page is actually a DETAIL target page

 } else if (context.request.label === 'DETAIL') {



     /* context.skipLinks(); */



     var categories = ;

     $('span.cat-links a').each( function() {

         categories.push($(this).text());    

     });

     var tags = ;

     $('span.tags-links a').each( function() {

         tags.push($(this).text());    

     });



     result = {

         "title": $('h1').text(),

         "entry": $('div.entry-content').html().trim(),

         "datestamp": $('time').attr('datetime'),

         "photo": $('meta[name="twitter:image"]').attr("content"),

         categories: categories,

         tags: tags

     };



 }

 return result;

 }

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

add a comment |

I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.

However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).

The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...

{

  "customData": 3

}

... and then the below to take that value and use it to limit...

var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

    context.enqueuePage({

This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.

However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.

How can I ensure I only get the first three pages up to /page/3?

Have I malformed something?

In the code, you can see, now commented-out, my previous version of the above addition.

Those LIST pages should only be...

The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

https://www.beet.tv/author/randrews/page/2

https://www.beet.tv/author/randrews/page/3

... and not the likes of /page/101 or /page/102, which may surface.

Here are the key terms...

START https://www.beet.tv/author/randrews

LIST https://www.beet.tv/author/randrews/page/[d+]

DETAIL https://www.beet.tv/*

Clickable elements a.page-numbers

And here is the crawler script...

function pageFunction(context) {



 // Called on every page the crawler visits, use it to extract data from it

 var $ = context.jQuery;



 // If page is START or a LIST,

 if (context.request.label === 'START' || context.request.label === 'LIST') {



     context.skipOutput();



     // First, gather LIST page

     $('a.page-numbers').each(function() {

         // lines added to accept number of pages via customData in Scheduler...

         var pageNumber = parseInt($(this).text());

         // var maxListDepth = context.customData;

         var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

         if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

           context.enqueuePage({

               url: /*window.location.origin +*/ $(this).attr('href'),

               label: 'LIST'

           });

         }

     });



     // Then, gather every DETAIL page

     $('h3>a').each(function(){

         context.enqueuePage({

             url: /*window.location.origin +*/ $(this).attr('href'),

             label: 'DETAIL'

         });

     });



 // If page is actually a DETAIL target page

 } else if (context.request.label === 'DETAIL') {



     /* context.skipLinks(); */



     var categories = ;

     $('span.cat-links a').each( function() {

         categories.push($(this).text());    

     });

     var tags = ;

     $('span.tags-links a').each( function() {

         tags.push($(this).text());    

     });



     result = {

         "title": $('h1').text(),

         "entry": $('div.entry-content').html().trim(),

         "datestamp": $('time').attr('datetime'),

         "photo": $('meta[name="twitter:image"]').attr("content"),

         categories: categories,

         tags: tags

     };



 }

 return result;

 }

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

add a comment |

I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.

However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).

The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...

{

  "customData": 3

}

... and then the below to take that value and use it to limit...

var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

    context.enqueuePage({

This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.

However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.

How can I ensure I only get the first three pages up to /page/3?

Have I malformed something?

In the code, you can see, now commented-out, my previous version of the above addition.

Those LIST pages should only be...

The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

https://www.beet.tv/author/randrews/page/2

https://www.beet.tv/author/randrews/page/3

... and not the likes of /page/101 or /page/102, which may surface.

Here are the key terms...

START https://www.beet.tv/author/randrews

LIST https://www.beet.tv/author/randrews/page/[d+]

DETAIL https://www.beet.tv/*

Clickable elements a.page-numbers

And here is the crawler script...

function pageFunction(context) {



 // Called on every page the crawler visits, use it to extract data from it

 var $ = context.jQuery;



 // If page is START or a LIST,

 if (context.request.label === 'START' || context.request.label === 'LIST') {



     context.skipOutput();



     // First, gather LIST page

     $('a.page-numbers').each(function() {

         // lines added to accept number of pages via customData in Scheduler...

         var pageNumber = parseInt($(this).text());

         // var maxListDepth = context.customData;

         var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

         if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

           context.enqueuePage({

               url: /*window.location.origin +*/ $(this).attr('href'),

               label: 'LIST'

           });

         }

     });



     // Then, gather every DETAIL page

     $('h3>a').each(function(){

         context.enqueuePage({

             url: /*window.location.origin +*/ $(this).attr('href'),

             label: 'DETAIL'

         });

     });



 // If page is actually a DETAIL target page

 } else if (context.request.label === 'DETAIL') {



     /* context.skipLinks(); */



     var categories = ;

     $('span.cat-links a').each( function() {

         categories.push($(this).text());    

     });

     var tags = ;

     $('span.tags-links a').each( function() {

         tags.push($(this).text());    

     });



     result = {

         "title": $('h1').text(),

         "entry": $('div.entry-content').html().trim(),

         "datestamp": $('time').attr('datetime'),

         "photo": $('meta[name="twitter:image"]').attr("content"),

         categories: categories,

         tags: tags

     };



 }

 return result;

 }

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

I have written the following web scraper in Apify (jQuery), but I am struggling to limit it to only look at certain list pages.

However, I wish to use Apify's scheduler to trigger an occasional crawl that only scrapes articles from the first three of those index (LIST) pages (ie. 60 articles).

The scheduler uses cron and allows the passing of settings via input Json. As advised, I am using "customData"...

{

  "customData": 3

}

... and then the below to take that value and use it to limit...

var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

    context.enqueuePage({

This should allow the script to limit the scope when executed via the scheduler, but to carry on as normal and get everything in full when executed manually.

However, whilst the scheduler successfully fires the crawler - the crawler still runs right through the whole set again; it doesn't cap out at /page/3.

How can I ensure I only get the first three pages up to /page/3?

Have I malformed something?

In the code, you can see, now commented-out, my previous version of the above addition.

Those LIST pages should only be...

The STARTing one, with an implied "/page/1" URL (https://www.beet.tv/author/randrews)

https://www.beet.tv/author/randrews/page/2

https://www.beet.tv/author/randrews/page/3

... and not the likes of /page/101 or /page/102, which may surface.

Here are the key terms...

START https://www.beet.tv/author/randrews

LIST https://www.beet.tv/author/randrews/page/[d+]

DETAIL https://www.beet.tv/*

Clickable elements a.page-numbers

And here is the crawler script...

function pageFunction(context) {



 // Called on every page the crawler visits, use it to extract data from it

 var $ = context.jQuery;



 // If page is START or a LIST,

 if (context.request.label === 'START' || context.request.label === 'LIST') {



     context.skipOutput();



     // First, gather LIST page

     $('a.page-numbers').each(function() {

         // lines added to accept number of pages via customData in Scheduler...

         var pageNumber = parseInt($(this).text());

         // var maxListDepth = context.customData;

         var maxListDepth = parseInt(context.customData); // Jakub's suggestion, Nov 20 2018

         if(!maxListDepth || (maxListDepth && pageNumber <= maxListDepth)) {

           context.enqueuePage({

               url: /*window.location.origin +*/ $(this).attr('href'),

               label: 'LIST'

           });

         }

     });



     // Then, gather every DETAIL page

     $('h3>a').each(function(){

         context.enqueuePage({

             url: /*window.location.origin +*/ $(this).attr('href'),

             label: 'DETAIL'

         });

     });



 // If page is actually a DETAIL target page

 } else if (context.request.label === 'DETAIL') {



     /* context.skipLinks(); */



     var categories = ;

     $('span.cat-links a').each( function() {

         categories.push($(this).text());    

     });

     var tags = ;

     $('span.tags-links a').each( function() {

         tags.push($(this).text());    

     });



     result = {

         "title": $('h1').text(),

         "entry": $('div.entry-content').html().trim(),

         "datestamp": $('time').attr('datetime'),

         "photo": $('meta[name="twitter:image"]').attr("content"),

         categories: categories,

         tags: tags

     };



 }

 return result;

 }

javascript jquery web-crawler apify

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

asked Nov 23 '18 at 17:05

Robert Andrews

3791420

add a comment |

1 Answer
1

active

oldest

votes

There are two options in advanced settings which can help: Max pages per crawl and Max result records. In your case, I would set Max result records to 60 and then crawler stops after outputting 60 pages (from the first 3 lists)

answered Nov 23 '18 at 21:14

Jakub Balada

161

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

1

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53450586%2fhow-to-limit-apify-web-crawler-scope-to-first-three-list-pages%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Nov 23 '18 at 21:14

Jakub Balada

161

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

1

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

add a comment |

answered Nov 23 '18 at 21:14

Jakub Balada

161

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

1

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

add a comment |

answered Nov 23 '18 at 21:14

Jakub Balada

161

answered Nov 23 '18 at 21:14

Jakub Balada

161

answered Nov 23 '18 at 21:14

Jakub Balada

161

answered Nov 23 '18 at 21:14

Jakub Balada

161

answered Nov 23 '18 at 21:14

Jakub Balada

161

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

1

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

add a comment |

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

1

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

Hi. Can I leave this switched off as standard but pass something only for the scheduled crawl (ie. manual execution gets everything, but scheduled execution only >60)? If (can) I pass {maxCrawledPages: Number} or {maxOutputPages: Number} via Input JSON in the Schduler, a) can I delete my current maxListDepth code, and b) do I need some code to handle that also in the crawler code?

– Robert Andrews
Nov 23 '18 at 21:58

Yes, when starting the crawler via scheduler (or API) you can override any crawler setting. So you can use something like { "maxCrawledPages": 60, "maxOutputPages": 60 } And yes, you can delete your maxListDepth code and you don't need to handle it in a PageFunction

– Jakub Balada
Nov 24 '18 at 0:19

I think the solution in my case was to remove "Clickable element" from the GUI. You have previously advised this, but it seems I let it creep back in. However, the Max-results-records route also seems like a better solution than the LIST pages route. That works, too, and allows me to remove some code. Thanks!

– Robert Andrews
Nov 24 '18 at 8:09

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut