Skip to content

Conversation

@janbuchar
Copy link
Collaborator

I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂

@janbuchar janbuchar added t-tooling Issues with this label are in the ownership of the tooling team. adhoc Ad-hoc unplanned task added during the sprint. labels Oct 10, 2025
@janbuchar janbuchar requested review from Pijukatel and vdusek October 10, 2025 15:59
@github-actions github-actions bot added this to the 125th sprint - Tooling team milestone Oct 10, 2025
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

It makes sense when I'm reading this.

Do we have any examples of where the previous behavior caused any troubles?

@janbuchar
Copy link
Collaborator Author

Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.

This is correct.

Do we have any examples of where the previous behavior caused any troubles?

In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - navigationTimeout and requestHandlerTimeout. Then it does this: https://github.com/apify/crawlee/blob/master/packages/browser-crawler/src/internals/browser-crawler.ts#L395

This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR.

Copy link
Collaborator

@Pijukatel Pijukatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some test for the correct application of the timeout.

use_incognito_pages: By default pages share the same browser context. If set to True each page uses its
own context that is destroyed once the page is closed or crashes.
This option should not be used if `browser_pool` is provided.
navigation_timeout: Timeout for navigation (the process between opening a Playwright page and calling
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question - should the navigation_timeout also apply to pre-navigation hooks? Should they have their own limit? Should they share a limit with the request handler?

@B4nan your opinion on this is also welcome. This is due to change in crawlee js v4, so we should be aligned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include the hooks, otherwise we would need another two options for timeouts of pre and post nav hooks. They need to be part of some timeout handler, and having 3 options just for the navigation feels too much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@B4nan include them in navigation_timeout, you mean?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either that, or separate timeouts over pre and post nav hooks. User code needs to be wrapped in timeout handlers.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we can start with one shared timeout for all, and adjust later if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's how we got here 😁 but yeah, including hook in navigation timeout is fine with me.

@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 28, 2025
raise

async def _run_request_handler(self, context: BasicCrawlingContext) -> None:
await wait_for(
Copy link
Collaborator

@Pijukatel Pijukatel Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering. Can the context pipeline now get stuck?

navigation_timeout(some parts of context pipeline) -> time unlimited parts of context pipeline -> request_handler_timeout(request_hanlder)

Copy link
Collaborator Author

@janbuchar janbuchar Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realistically, that has always been the case, but yeah, this increases the odds by a bit. The hooks (discussed in #1474 (comment)) are probably the biggest risk.

Perhaps we could add some comically large timeout to the context pipeline execution as a whole, too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, it makes even more sense to include the hooks in the timeout.

Other steps do not appear to be a problem in our pipelines for now.

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring suggestions. Otherwise LGTM.

payload: The data to be sent as the request body.
session: The session associated with the request.
proxy_info: The information about the proxy to be used.
timeout: Request timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.

Suggested change
timeout: Request timeout
timeout: Maximum time allowed to process the request.

session: The session associated with the request.
proxy_info: The information about the proxy to be used.
statistics: The statistics object to register status codes.
timeout: Request timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.

Suggested change
timeout: Request timeout
timeout: Maximum time allowed to process the request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants