You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@handle_urls("toscrape.com", instead_of=ProductPage)@attrs.defineclassToScrapeComProductPage(ProductPage):
downloader: HttpClient@fieldasyncdefname(self):
page_data=awaitself.get_page_data()
returnpage_data.get("name")
asyncdefget_page_data(self):
logger.debug(">>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<")
response=awaitself.downloader.post(
url="https://toscrape.com/graphql",
# ...
)
return {"name": "test name"}
When I crawl everything works as expected: first session is created and then I query for graphql while in this session:
2025-01-28 00:23:54 [majornetlocs.page_objects.toscrape.com.sessions] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Create session <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2025-01-28 00:23:54 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2025-01-28 00:23:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com> (referer: None) ['zyte-api']
2025-01-28 00:23:58 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2025-01-28 00:24:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com/p/316979607> (referer: None) ['zyte-api']
2025-01-28 00:24:01 [majornetlocs.page_objects.toscrape.com.products] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
So far so food. The problem is that my ToScrapeComProductPage class doesn't actually inherit from ProductPage but from BaseProductPage. This is because we don't want to actually retrieve HTML data as we're getting all we need from graphql request:
I don't enter input product url, and therefore my session is not created. Only sending graphql request is causing to create a session. As a result the ordering is incorrect:
2025-01-28 00:26:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com/p/12345> (referer: None)
2025-01-28 00:26:43 [majornetlocs.page_objects.toscrape.com.products] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2025-01-28 00:26:44 [majornetlocs.page_objects.toscrape.com.sessions] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Create session <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
This is problematic because inside get_page_data we already need to have session initialized in order to sent a proper request.
Therefore I'd like to request that session is created earlier when BaseProductPage is used.
The text was updated successfully, but these errors were encountered:
Consider simplified session config:
And a simplified page object:
When I crawl everything works as expected: first session is created and then I query for graphql while in this session:
So far so food. The problem is that my
ToScrapeComProductPage
class doesn't actually inherit fromProductPage
but fromBaseProductPage
. This is because we don't want to actually retrieve HTML data as we're getting all we need from graphql request:I don't enter input product url, and therefore my session is not created. Only sending graphql request is causing to create a session. As a result the ordering is incorrect:
This is problematic because inside
get_page_data
we already need to have session initialized in order to sent a proper request.Therefore I'd like to request that session is created earlier when
BaseProductPage
is used.The text was updated successfully, but these errors were encountered: