-
Notifications
You must be signed in to change notification settings - Fork 624
Description
I have two crawlers, one on typescript and one on python.
In the python version, I have noticed that the crawl limit is only reached if request_finished >= max_requests_per_crawl. For example, I have set a maxRequestsPerCrawl of 1000, and I finished a crawl with the stats:
[crawlee.crawlers._playwright._playwright_crawler] INFO Final request statistics:
┌───────────────────────────────┬───────────────────┐
│ requests_finished │ 310 │
│ requests_failed │ 6030 │
│ retry_histogram │ [312, 0, 0, 6028] │
│ request_avg_failed_duration │ 231.3ms │
│ request_avg_finished_duration │ 2.01s │
│ requests_finished_per_minute │ 32 │
│ requests_failed_per_minute │ 625 │
│ request_total_duration │ 33min 39.3s │
│ requests_total │ 6340 │
│ crawler_runtime │ 9min 38.6s │
└───────────────────────────────┴───────────────────┘
However, on the javascript one, the crawl limit is triggered if the request_finished + request_failed >= maxRequestsPerCrawl
INFO SessionAwareCrawler: Crawler reached the maxRequestsPerCrawl limit of 1000 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO SessionAwareCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1000 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1000 requests and will shut down.
INFO SessionAwareCrawler: Final request statistics: {"requestsFinished":298,"requestsFailed":702,"retryHistogram":[1000],"requestAvgFailedDurationMillis":362,"requestAvgFinishedDurationMillis":1804,"requestsFinishedPerMinute":154,"requestsFailedPerMinute":363,"requestTotalDurationMillis":791397,"requestsTotal":1000,"crawlerRuntimeMillis":115970}
Looking in the code base, that seems to be the case? Is this intentional?
Python:
crawlee-python/src/crawlee/crawlers/_basic/_basic_crawler.py
Lines 564 to 572 in 142e4ef
| def _stop_if_max_requests_count_exceeded(self) -> None: | |
| """Call `stop` when the maximum number of requests to crawl has been reached.""" | |
| if self._max_requests_per_crawl is None: | |
| return | |
| if self._statistics.state.requests_finished >= self._max_requests_per_crawl: | |
| self.stop( | |
| reason=f'The crawler has reached its limit of {self._max_requests_per_crawl} requests per crawl. ' | |
| ) |
Javascript:
https://github.com/apify/crawlee/blob/c3a4b3b0d5be63f1f7a779ff43560ab2b426f3bb/packages/basic-crawler/src/internals/basic-crawler.ts#L812
I noticed the docs are the same verbiage as well:
https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl
https://crawlee.dev/python/api/class/PlaywrightCrawlerOptions#max_requests_per_crawl