-
Notifications
You must be signed in to change notification settings - Fork 523
fix: Only apply requestHandlerTimeout to request handler #1474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so the requestHandlerTimeout is now applied only to the request handler (router(final_context)) and not the whole context pipeline.
It makes sense when I'm reading this.
Do we have any examples of where the previous behavior caused any troubles?
This is correct.
In the JS version, the browser crawler (playwright/puppeteer ancestor) has two kinds of timeout - This is super awkward and not really what the interface promises. However, the Python version doesn't have a navigation timeout. I believe I should add that as part of this PR. |
Pijukatel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add some test for the correct application of the timeout.
| use_incognito_pages: By default pages share the same browser context. If set to True each page uses its | ||
| own context that is destroyed once the page is closed or crashes. | ||
| This option should not be used if `browser_pool` is provided. | ||
| navigation_timeout: Timeout for navigation (the process between opening a Playwright page and calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question - should the navigation_timeout also apply to pre-navigation hooks? Should they have their own limit? Should they share a limit with the request handler?
@B4nan your opinion on this is also welcome. This is due to change in crawlee js v4, so we should be aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include the hooks, otherwise we would need another two options for timeouts of pre and post nav hooks. They need to be part of some timeout handler, and having 3 options just for the navigation feels too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@B4nan include them in navigation_timeout, you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, either that, or separate timeouts over pre and post nav hooks. User code needs to be wrapped in timeout handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say we can start with one shared timeout for all, and adjust later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's how we got here 😁 but yeah, including hook in navigation timeout is fine with me.
| raise | ||
|
|
||
| async def _run_request_handler(self, context: BasicCrawlingContext) -> None: | ||
| await wait_for( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering. Can the context pipeline now get stuck?
navigation_timeout(some parts of context pipeline) -> time unlimited parts of context pipeline -> request_handler_timeout(request_hanlder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, that has always been the case, but yeah, this increases the odds by a bit. The hooks (discussed in #1474 (comment)) are probably the biggest risk.
Perhaps we could add some comically large timeout to the context pipeline execution as a whole, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, it makes even more sense to include the hooks in the timeout.
Other steps do not appear to be a problem in our pipelines for now.
vdusek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring suggestions. Otherwise LGTM.
| payload: The data to be sent as the request body. | ||
| session: The session associated with the request. | ||
| proxy_info: The information about the proxy to be used. | ||
| timeout: Request timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.
| timeout: Request timeout | |
| timeout: Maximum time allowed to process the request. |
| session: The session associated with the request. | ||
| proxy_info: The information about the proxy to be used. | ||
| statistics: The statistics object to register status codes. | ||
| timeout: Request timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.
| timeout: Request timeout | |
| timeout: Maximum time allowed to process the request. |
I stumbled upon this when working on ContextPipeline for the JS version. I'm eager to hear your thoughts 🙂