-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify add_license_url
DAG for more specific null check
#4124
Conversation
a30e6e9
to
5e32387
Compare
5e32387
to
377c8f1
Compare
4418080
to
1f27849
Compare
Full-stack documentation: https://docs.openverse.org/_preview/4124 Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again. You can check the GitHub pages deployment action list to see the current status of the deployments. Changed files 🔄: |
I am going to test this PR locally now, but wanted to leave a couple of notes about the licenses in the catalog. I think we still have some items with There's also an invalid pair here: PDM does not have a 4.0 version, only 1.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DAG runs well locally. I was surprised that the SELECT query in prod was so quick!
I would love to see more detailed reporting of the results: the list of license pairs and updated item count for each, and the same for the invalid license pair. But I don't want to block on this.
select_query = dedent(""" | ||
SELECT license, license_version, count(identifier) | ||
FROM image WHERE meta_data->>'license_url' IS NULL | ||
GROUP BY license, license_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of GROUP BY
SQL power :)
Co-authored-by: Olga Bulat <[email protected]>
I queried before and this is not the case anymore! openledger> SELECT count(*) FROM image WHERE meta_data IS NULL;
+-------+
| count |
|-------|
| 0 |
+-------+
SELECT 1
Time: 2154.197s (35 minutes 54 seconds), executed in: 2154.180s (35 minutes 54 seconds)
Interesting. Currently, that pair will be skipped, but certainly, we should look to fix it 🤔 I'll create an issue for that.
Same! That was very useful to get more information on the problem.
Admittedly, I was lazy on the final report, given that the distribution to change is specified at the beginning of the DAG, and was expecting it to return 0 as the pending after finishing 😆 Let me see if it's easy to add. |
Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR: @dhruvkb Excluding weekend1 days, this PR was ready for review 4 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2. @krysal, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worked locally for me!
Fixes
Fixes #3885 by @krysal
Description
This PR modifies the existing DAG submitted by @obulat to correct all the rows without a
license_url
in theirmeta_data
field. Yesterday, I queried upstream DB and the distribution of rows to fix was the following:This version does not interact with AWS S3; given the low number of license-version combinations, it's not necessary. Instead, if some failed pairs are found, it's notified at the end.
Testing Instructions
Spin up the catalog stack,
just catalog/up
, and remove the targeted field from some rows. You could try the next update statement on the sample data.Then, run the dag and observe the original
meta_data
is preserved with the addition of the correspondinglicense_url
.To observe the failed pairs alert task being run, you can simply modify the
license_version
of one of the rows to a non-existing one with a high number. The individual mapped task for it will be skipped.Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin