Skip to content

max_crawl_limit behavior different than javascript version #1765

@jaemingo-hhh

Description

@jaemingo-hhh

I have two crawlers, one on typescript and one on python.

In the python version, I have noticed that the crawl limit is only reached if request_finished >= max_requests_per_crawl. For example, I have set a maxRequestsPerCrawl of 1000, and I finished a crawl with the stats:

[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────────────┐
│ requests_finished             │ 310               │
│ requests_failed               │ 6030              │
│ retry_histogram               │ [312, 0, 0, 6028] │
│ request_avg_failed_duration   │ 231.3ms           │
│ request_avg_finished_duration │ 2.01s             │
│ requests_finished_per_minute  │ 32                │
│ requests_failed_per_minute    │ 625               │
│ request_total_duration        │ 33min 39.3s       │
│ requests_total                │ 6340              │
│ crawler_runtime               │ 9min 38.6s        │
└───────────────────────────────┴───────────────────┘

However, on the javascript one, the crawl limit is triggered if the request_finished + request_failed >= maxRequestsPerCrawl

INFO  SessionAwareCrawler: Crawler reached the maxRequestsPerCrawl limit of 1000 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO  SessionAwareCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1000 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1000 requests and will shut down.
INFO  SessionAwareCrawler: Final request statistics: {"requestsFinished":298,"requestsFailed":702,"retryHistogram":[1000],"requestAvgFailedDurationMillis":362,"requestAvgFinishedDurationMillis":1804,"requestsFinishedPerMinute":154,"requestsFailedPerMinute":363,"requestTotalDurationMillis":791397,"requestsTotal":1000,"crawlerRuntimeMillis":115970}

Looking in the code base, that seems to be the case? Is this intentional?
Python:

def _stop_if_max_requests_count_exceeded(self) -> None:
"""Call `stop` when the maximum number of requests to crawl has been reached."""
if self._max_requests_per_crawl is None:
return
if self._statistics.state.requests_finished >= self._max_requests_per_crawl:
self.stop(
reason=f'The crawler has reached its limit of {self._max_requests_per_crawl} requests per crawl. '
)

Javascript:
https://github.com/apify/crawlee/blob/c3a4b3b0d5be63f1f7a779ff43560ab2b426f3bb/packages/basic-crawler/src/internals/basic-crawler.ts#L812

I noticed the docs are the same verbiage as well:
https://crawlee.dev/js/api/playwright-crawler/interface/PlaywrightCrawlerOptions#maxRequestsPerCrawl
https://crawlee.dev/python/api/class/PlaywrightCrawlerOptions#max_requests_per_crawl

Metadata

Metadata

Assignees

Labels

t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions