Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reharvest collections to pick up date/decade enrichments: non-Nuxeo sources + Nuxeo sources (check w/ campuses) #1107

Open
christinklez opened this issue Sep 3, 2024 · 1 comment
Assignees

Comments

@christinklez
Copy link
Collaborator

christinklez commented Sep 3, 2024

==

@christinklez christinklez changed the title Apply decade enrichments to harvested collections Apply decade enrichments to harvested collections - reharvest collections Sep 9, 2024
@aturner aturner changed the title Apply decade enrichments to harvested collections - reharvest collections Reharvest collections to pick up date/decade enrichments: non-Nuxeo sources + Nuxeo sources (check w/ campuses) Sep 9, 2024
@amywieliczka
Copy link
Collaborator

amywieliczka commented Sep 9, 2024

rikolti-prd rikolti-stg
total 2,137,124 records 2,146,265 records
ETLed w/ date data 1,603,517 records (75%) 1,610,558 records (75%)
Harvested w/out date data 533,607 records (25%) 535,707 records (25%)

Honing in on that 25% of records without date data:

rikolti-prd rikolti-stg
w/out date data 533,607 records 535,707 records
w/ version_path 194,201 records (36%) 160,437 records (30%)
w/out version_path 339,406 records (64%) 375,270 records (70%)

Described in collections:

rikolti-prd rikolti-stg
w/out date data 1168 collections 1172 collections
w/ version_path 439 collections (37%) 412 collections (35%)
w/out version_path 729 collections (62%) 760 collections (65%)

So we know which vernacular version was run through the pipeline and published for 37% of published collections missing date data and 35% of staged collections missing date data.

Honing in on that 62-65% of collections without version paths:

rikolti-prd rikolti-stg
w/out version_path 729 collections 760 collections
w/ one vernacular version in s3 3 collections 30 collections
w/ many vernacular versions in s3 726 collections 720 collections

So we can infer which vernacular version was run through the pipeline and published for 3 more published collections and 30 more staged collections because there is only one vernacular version stored in s3.

Nuxeo Analysis

rikolti-prd rikolti-stg
total 2,137,124 records 2,146,265 records
Nuxeo w/out date data 124,089 records (6%) 124,213 records (6%)

Honing in on that 6% of records from Nuxeo and without date data:

rikolti-prd rikolti-stg
Nuxeo w/out date data 124,089 records 124,213 records
Nuxeo w/ version_path 2,335 records 1,921 records
Nuxeo w/out version_path 121,754 records 122,292 records

Described in collections:

rikolti-prd rikolti-stg
Nuxeo w/out date data 303 collections 306 collections
Nuxeo w/ version_path 13 collections 14 collections
Nuxeo w/out version_path 290 collections 292 collections

Honing in on those 290 Nuxeo collections without version paths:

rikolti-prd rikolti-stg
Nuxeo w/out version_path 290 collections 292 collections
Nuxeo w/ one vernacular version in s3 0 collections
Nuxeo w/ many vernacular versions in s3 290 collections

So we can't infer a version path for any of the 290 collections without version paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants