Skip to content
This repository has been archived by the owner on Aug 4, 2023. It is now read-only.

Use Python to group items by license to speed up the query #1045

Merged
merged 4 commits into from
Mar 16, 2023

Conversation

obulat
Copy link
Contributor

@obulat obulat commented Mar 13, 2023

Fixes

Fixes WordPress/openverse#1270

Description

This PR updates the second step in the DAG to replace many SELECT queries with SELECTing all items with NULL in meta_data, then grouping of the items to update using Python function, and then running one UPDATE query per license pair (if necessary). Hopefully, this should be much faster than the previous run.

This PR also sets the trigger rule for the last step, because otherwise it was skipped if the second step was skipped.

Testing instruction

Same as #1005.

@obulat obulat requested a review from a team as a code owner March 13, 2023 14:17
@obulat obulat requested review from krysal and stacimc March 13, 2023 14:17
@openverse-bot openverse-bot added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work and removed 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Mar 13, 2023
@obulat obulat added 🟨 priority: medium Not blocking but should be addressed soon 🛠 goal: fix Bug fix 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: catalog Related to the catalog and Airflow DAGs 💻 aspect: code Concerns the software code in the repository and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work 🛠 goal: fix Bug fix 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Mar 13, 2023
@obulat obulat self-assigned this Mar 13, 2023
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a few comments for format improvements, but I was able to test this locally and it ran great!

openverse_catalog/dags/maintenance/add_license_url.py Outdated Show resolved Hide resolved
updated_by_license = {}

for license_pair, identifiers in records_to_update.items():
license_, license_version = license_pair.split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're saving this as a string only to split it out again, we could probably use the license & license version pair in tuple form as the key itself, e.g. records_to_update[license_, license_version]. Then we can unpack it directly here too, e.g. for (license_, license_version), identifiers in records_to_update.items().

openverse_catalog/dags/maintenance/add_license_url.py Outdated Show resolved Hide resolved
openverse_catalog/dags/maintenance/add_license_url.py Outdated Show resolved Hide resolved
@obulat obulat requested a review from AetherUnbound March 15, 2023 13:38
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks and works great!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the late review -- this looks excellent!

I have been doing Flickr testing locally and happened to have ~1.55 million records locally 😅 I set the meta_data to null for all of them and the DAG ran successfully in just over six minutes :)

Screen Shot 2023-03-15 at 11 49 43 AM

@AetherUnbound
Copy link
Contributor

Heh, what a great test set to have! 😄

@obulat
Copy link
Contributor Author

obulat commented Mar 16, 2023

Apologies for the late review -- this looks excellent!

I have been doing Flickr testing locally and happened to have ~1.55 million records locally 😅 I set the meta_data to null for all of them and the DAG ran successfully in just over six minutes :)

Screen Shot 2023-03-15 at 11 49 43 AM

Thank you for your testing, it's more than what I could hope for, @stacimc 😆 I would not have guessed that the most popular license would be by-nd-nc/2.0/jp

@obulat obulat merged commit f6de0e3 into main Mar 16, 2023
@obulat obulat deleted the fix/add_license_url_dag branch March 16, 2023 01:54
@obulat
Copy link
Contributor Author

obulat commented Mar 16, 2023

I ran the DAG in prod, and this is the log output:

[2023-03-16, 03:45:26 UTC] {slack.py:321} INFO - 
`add_license_url` DAG run completed.
Update statistics:
https://creativecommons.org/licenses/by-nc-sa/3.0/: 6318 rows
https://creativecommons.org/licenses/by-nd/3.0/: 201 rows
https://creativecommons.org/licenses/by-sa/3.0/: 1545 rows
https://creativecommons.org/licenses/by-nc/3.0/: 2607 rows
https://creativecommons.org/publicdomain/zero/1.0/: 2919 rows
https://creativecommons.org/publicdomain/mark/1.0/: 374 rows
https://creativecommons.org/licenses/by-nc-sa/1.0/: 166 rows
https://creativecommons.org/licenses/by-nc-sa/2.0/: 2165 rows
https://creativecommons.org/licenses/by-nc-sa/2.5/: 14 rows
https://creativecommons.org/licenses/by-sa/2.5/: 25 rows
https://creativecommons.org/licenses/by-sa/2.0/: 213 rows
https://creativecommons.org/licenses/by-nc/4.0/: 26 rows
https://creativecommons.org/licenses/by-sa/4.0/: 7 rows
https://creativecommons.org/licenses/by-nd-nc/2.0/jp/: 98 rows
https://creativecommons.org/licenses/by-nd/2.0/: 16 rows
https://creativecommons.org/licenses/by-sa/1.0/: 1 rows

Now, there are 9382 records with NULL meta_data left.

From the logs, here are some problems with these 9382 records:

  1. Incorrectly capitalized license name: "CC0" instead of "cc0", "PDM" instead of "pdm"
[2023-03-16, 03:06:40 UTC] {add_license_url.py:102} INFO - No license pair (CC0, 1.0) in the license map.
[2023-03-16, 03:06:47 UTC] {add_license_url.py:102} INFO - No license pair (PDM, 1.0) in the license map.
  1. Incorrect version number: CC0 and PDM only have version 1.0
[2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (cc0, 3.0) in the license map.
[2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (pdm, 3.0) in the license map.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add_license_url DAG is inefficient and fails due to timeout
4 participants