Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sessions are not created in advance if BaseProductPage is used. #244

Open
Nykakin opened this issue Jan 27, 2025 · 0 comments · May be fixed by #246
Open

Sessions are not created in advance if BaseProductPage is used. #244

Nykakin opened this issue Jan 27, 2025 · 0 comments · May be fixed by #246

Comments

@Nykakin
Copy link
Contributor

Nykakin commented Jan 27, 2025

Consider simplified session config:

@session_config("toscrape.com")
class ToScrapeComLocationSessionConfig(SessionConfig):
    def params(self, request):
        logger.debug(">>>>>>>>>>>>>>>>>>>>>>>>>>> Create session <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<")
        return {
            "url": "https://toscrape.com",
            "browserHtml": True,
            # ...
        }

    def check(self, response, request):
        return True

And a simplified page object:

@handle_urls("toscrape.com", instead_of=ProductPage)
@attrs.define
class ToScrapeComProductPage(ProductPage):
    downloader: HttpClient
    
    @field    
    async def name(self):
        page_data = await self.get_page_data()
        return page_data.get("name")
           
    async def get_page_data(self):
        logger.debug(">>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<")
        response = await self.downloader.post(
            url="https://toscrape.com/graphql",
            # ...
        )              
        return {"name": "test name"}

When I crawl everything works as expected: first session is created and then I query for graphql while in this session:

2025-01-28 00:23:54 [majornetlocs.page_objects.toscrape.com.sessions] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Create session <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2025-01-28 00:23:54 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2025-01-28 00:23:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com> (referer: None) ['zyte-api']
2025-01-28 00:23:58 [zyte_api._retry] DEBUG: Starting call to 'zyte_api._async.AsyncZyteAPI.get.<locals>.request', this is the 1st time calling it.
2025-01-28 00:24:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com/p/316979607> (referer: None) ['zyte-api']
2025-01-28 00:24:01 [majornetlocs.page_objects.toscrape.com.products] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

So far so food. The problem is that my ToScrapeComProductPage class doesn't actually inherit from ProductPage but from BaseProductPage. This is because we don't want to actually retrieve HTML data as we're getting all we need from graphql request:

@handle_urls("toscrape.com", instead_of=ProductPage)
@attrs.define
class ToScrapeComProductPage(BaseProductPage):
    downloader: HttpClient

I don't enter input product url, and therefore my session is not created. Only sending graphql request is causing to create a session. As a result the ordering is incorrect:

2025-01-28 00:26:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://toscrape.com/p/12345> (referer: None)
2025-01-28 00:26:43 [majornetlocs.page_objects.toscrape.com.products] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Get data from graphql <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2025-01-28 00:26:44 [majornetlocs.page_objects.toscrape.com.sessions] DEBUG: >>>>>>>>>>>>>>>>>>>>>>>>>>> Create session <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

This is problematic because inside get_page_data we already need to have session initialized in order to sent a proper request.

Therefore I'd like to request that session is created earlier when BaseProductPage is used.

@Gallaecio Gallaecio linked a pull request Jan 29, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant