fix: Only apply requestHandlerTimeout to request handler #1474

janbuchar · 2025-10-10T15:59:20Z

I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂

vdusek

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

It makes sense when I'm reading this.

Do we have any examples of where the previous behavior caused any troubles?

janbuchar · 2025-10-14T09:24:14Z

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

This is correct.

Do we have any examples of where the previous behavior caused any troubles?

In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - navigationTimeout and requestHandlerTimeout. Then it does this: https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L395

This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR.

Pijukatel

Could you please add some test for the correct application of the timeout.

…awler

…to-request-handler

src/crawlee/http_clients/_impit.py

janbuchar · 2025-11-28T11:04:52Z

src/crawlee/crawlers/_playwright/_playwright_crawler.py

            use_incognito_pages: By default pages share the same browser context. If set to True each page uses its
                own context that is destroyed once the page is closed or crashes.
                This option should not be used if `browser_pool` is provided.
+            navigation_timeout: Timeout for navigation (the process between opening a Playwright page and calling


Open question - should the navigation_timeout also apply to pre-navigation hooks? Should they have their own limit? Should they share a limit with the request handler?

@B4nan your opinion on this is also welcome. This is due to change in crawlee js v4, so we should be aligned.

I would include the hooks, otherwise we would need another two options for timeouts of pre and post nav hooks. They need to be part of some timeout handler, and having 3 options just for the navigation feels too much.

@B4nan include them in navigation_timeout, you mean?

Yes, either that, or separate timeouts over pre and post nav hooks. User code needs to be wrapped in timeout handlers.

I would say we can start with one shared timeout for all, and adjust later if needed.

Well, that's how we got here 😁 but yeah, including hook in navigation timeout is fine with me.

Pijukatel · 2025-11-28T12:01:56Z

src/crawlee/crawlers/_basic/_basic_crawler.py

            raise

    async def _run_request_handler(self, context: BasicCrawlingContext) -> None:
-        await wait_for(


I am just wondering. Can the context pipeline now get stuck?

navigation_timeout(some parts of context pipeline) -> time unlimited parts of context pipeline -> request_handler_timeout(request_hanlder)

Realistically, that has always been the case, but yeah, this increases the odds by a bit. The hooks (discussed in #1474 (comment)) are probably the biggest risk.

Perhaps we could add some comically large timeout to the context pipeline execution as a whole, too.

In that case, it makes even more sense to include the hooks in the timeout.

Other steps do not appear to be a problem in our pipelines for now.

vdusek

Docstring suggestions. Otherwise LGTM.

vdusek · 2025-11-28T16:02:14Z

src/crawlee/http_clients/_base.py

            payload: The data to be sent as the request body.
            session: The session associated with the request.
            proxy_info: The information about the proxy to be used.
+            timeout: Request timeout


Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.

Suggested change

timeout: Request timeout

timeout: Maximum time allowed to process the request.

vdusek · 2025-11-28T16:02:20Z

src/crawlee/http_clients/_base.py

            session: The session associated with the request.
            proxy_info: The information about the proxy to be used.
            statistics: The statistics object to register status codes.
+            timeout: Request timeout


Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.

Suggested change

timeout: Request timeout

timeout: Maximum time allowed to process the request.

fix: Only apply requestHandlerTimeout to request handler

1b44070

janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Oct 10, 2025

janbuchar requested review from Pijukatel and vdusek October 10, 2025 15:59

github-actions bot assigned janbuchar Oct 10, 2025

github-actions bot added this to the 125th sprint - Tooling team milestone Oct 10, 2025

vdusek approved these changes Oct 14, 2025

View reviewed changes

janbuchar mentioned this pull request Oct 14, 2025

PlaywrightCrawler error_handler cannot access Page object #1482

Open

Pijukatel reviewed Oct 15, 2025

View reviewed changes

Mantisus self-requested a review October 21, 2025 04:28

Mantisus mentioned this pull request Nov 25, 2025

PlaywrightCrawler doesn't have gotoOptions #1576

Open

janbuchar added 3 commits November 27, 2025 14:24

Implement navigation_timeout for AbstractHttpCrawler and PlaywrightCr…

fb85108

…awler

Merge remote-tracking branch 'origin/master' into only-apply-timeout-…

eba3eff

…to-request-handler

Fix tests and bugs

3715db2

janbuchar commented Nov 28, 2025

View reviewed changes

janbuchar requested review from Pijukatel and vdusek November 28, 2025 11:05

Merge branch 'master' into only-apply-timeout-to-request-handler

ecd3c64

github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 28, 2025

Pijukatel reviewed Nov 28, 2025

View reviewed changes

vdusek approved these changes Nov 28, 2025

View reviewed changes

Mantisus approved these changes Nov 29, 2025

View reviewed changes

	timeout: Request timeout
	timeout: Maximum time allowed to process the request.

fix: Only apply requestHandlerTimeout to request handler #1474

Are you sure you want to change the base?

fix: Only apply requestHandlerTimeout to request handler #1474

Uh oh!

Conversation

janbuchar commented Oct 10, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar commented Oct 14, 2025

Uh oh!

Pijukatel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pijukatel Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Pijukatel Nov 28, 2025 •

edited

Loading

janbuchar Nov 28, 2025 •

edited

Loading