Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawler for Google's CrUX top 1M per country #160

Merged
merged 5 commits into from
Dec 11, 2024

Conversation

romain-fontugne
Copy link
Member

This adds Google's CrUX top 1M as discussed here: #153

Slightly different modeling, I just simplified the properties of the RANK relationship. It includes on the rank and origin.

How Has This Been Tested?

Tested locally, it took over 1.5 hours to run. But after all this is a fairly big dataset containing 200+ 1M top lists.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

- Create everything in one batch
- Fix: reference time now points to the first day of the month at
  midnight
- Fix: reference time needs to be a datetime object, not Arrow
- Fix: fetch the newest data, not the oldest
- Make data URL more precise
- Add country code to RANK relationship
- Use itertuples instead of iterrows since it is faster
- Add COUNTRY relationship check to unit test
- Change log level to INFO
- Fix typos and consistency in README
Copy link
Member

@m-appel m-appel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some refactoring and fixes and also added the country code to the RANK relationship for easier queries. Merging now

@m-appel m-appel merged commit 72358a4 into InternetHealthReport:main Dec 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants