Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost records #162

Open
kkdavis14 opened this issue Nov 25, 2024 · 4 comments
Open

Ghost records #162

kkdavis14 opened this issue Nov 25, 2024 · 4 comments
Assignees
Labels
bug The code does not behave as expected / designed challenge This issue is hard!

Comments

@kkdavis14
Copy link
Contributor

Pipeline is losing some Agent records, which are being reidentified but not linked together properly.

Example:
This object:
https://lux.collections.yale.edu/view/object/ccca43ea-1fd7-4449-9f3f-fb026edf7b07

was published by Martinus van den Enden:
(ycba rec vended)
https://ycba-lux.s3.amazonaws.com/v3/person/a4/a4d1963c-d3cc-4f57-bb49-0204574106ca.json
(lux rec, which returns a 404):
https://lux.collections.yale.edu/data/person/0133a1e2-998e-447b-bd33-657d36941876

There's a live Martinus van den Enden in LUX:
https://lux.collections.yale.edu/view/person/e2990454-a285-4b92-bb4f-dcd8b62a344b

which doesn't have the YCBA as a contributor.

Brent to attach a list of 65 unique missing agents with this issue.

@kkdavis14 kkdavis14 added the bug The code does not behave as expected / designed label Nov 25, 2024
@brent-hartwig
Copy link

brent-hartwig commented Nov 25, 2024

dt-162-ghost-agents-report.xlsx contains three tabs:

  1. Unique Producers (item producers and work creators): The "Unique: Combined" column contains the unique values of the other two visible columns. The other two visible columns are the unique producers/creators from the other two tabs.
  2. Started with Items Report: provides the unique producer, item, set, curator, and unit combinations. The same producer may appear in multiple rows.
  3. Started with Works Report: same as above but also identifies the work.

Due to the amount of data in play, dt-162-ghost-agents-query.js.txt had to be run in three modes. The list numbers do not correlate to the above list numbers.

  1. Set startWithItems to true.
  2. Set startWithItems to false, worksOffset to 0, and worksLimit to 10000000.
  3. Set startWithItems to false, worksOffset to 10000000, and worksLimit to 11000000. There were about 20.7m rows.

@clarkepeterf and @azaroth42, below is the technique that was used to find the disjoint of IRIs found in the triple store and URIs of documents, where starter plan included the producer column that was either the item's agent of production or work's agent of creation.

starterPlan
  .notExistsJoin(
    op.fromLexicons({ iri: cts.iriReference() }),
    op.on(producer, op.col('iri'))
  )

Because the above does not also incorporate the URI lexicon, I'm left to believe the IRI lexicon is populated by the IRIs of the documents in the database, as opposed to all IRIs in the triple store.

See the attached query for additional context/details.

@kkdavis14
Copy link
Contributor Author

kkdavis14 commented Jan 15, 2025

apologies for coming back to this two months later. I blame the holidays.
The Martinus van den Eden example is failing to merge with the "real" one because of collector issues. The dates YCBA gives for his birth and death fail in comparison to the equivalent URIs with birth and death (collector allows 10 year difference, these are more than that). I don't know why the LUX record for Martinus on his own is then failing to get built/returning the 404, but reconciliation is the root cause of why the two Martinus's are not merging together.

I'm going to check the rest of the missing People to see if it's the same issue. Again, haven't deduced the 404 issue yet.

@kkdavis14 kkdavis14 added the challenge This issue is hard! label Jan 15, 2025
@kkdavis14
Copy link
Contributor Author

kkdavis14 commented Jan 15, 2025

update after some research: the majority of the 65 ghost People records are date issues (57 records, 56 of which are YCBA, one is YUL). The rest are a mixture of "I see no reason why these are not merging" and "there's no contributing records and this shouldn't exist" and "one supremely bizarre mystery". None of this research solves the problem of why these are returning 404s, because they should still be creating records, even with the wonky data issues.

Breaking them down below:

No contributing records

https://lux.collections.yale.edu/data/person/32be3d65-95e7-4d2a-a58d-a5f047d24498
https://lux.collections.yale.edu/data/person/9a037cc4-5528-4bed-83b3-e6b8bd3cac73

These two only have timestamps in their idmaps (both from December runs, not most recent January), no contributing records. I can only assume they are not meant to exist and will go away.

Why aren't these merging?
https://lux.collections.yale.edu/data/person/48b2b15f-e642-4918-b81c-8ace45a5eb53
https://lux.collections.yale.edu/data/person/78501394-cd2e-4a98-9971-f405936479ad
https://lux.collections.yale.edu/data/person/b16b546b-8ba3-4c49-a2a4-37c21296ea64
https://lux.collections.yale.edu/data/person/8ae8f9bb-f7ad-43b6-9737-0258b13dd14b
https://lux.collections.yale.edu/data/person/f519bdaa-2480-49c4-9095-6bd6636bf373

These should be merging. They're all YUL contributing records, with LC equivalents in the YUL URI data. The "real" LUX records for these people ALSO have that LC. There's no timespans to toss out the reconciliation/merging. So, I have no idea what's happening here.

One supremely bizarre mystery
https://lux.collections.yale.edu/data/person/9328d8f2-f2e7-4f77-a48f-f7074cb9bf27

This looks like a great record in Idmap. Tons of equivalents and two contributing YUL uris. One of those YUL uris actually also belongs to the "real" LUX record. The other seems like it is contributing to an overmerge issue in this record, because it has the wrong VIAF equiv. I created a unit-data ticket for that fix. However, I don't know if that will solve the problem.

Timespan mismatches causing the collector to throw out the equivalencies during reconciliation
This is the remainder of the uris, all of which are YCBA People except one YUL. The YUL timespan I asked them to fix here as it's clearly a typo. The YCBAs, I am not sure what to do about. They seem to be 100 year timespans which disagree enough with external equivalents to get thrown out. Obviously the YCBA timespans are not accurate, but they probably aren't meaning for them to be, this is just their "estimation". This is a case where pipeline assumes that information given to us is "correct" :/ ideally they wouldn't send us timespans if they weren't sure of the dates, but instead they generally send us everything they have, with notes about ambiguity (which is a truly art historical/museum-y practice). This has come up many times before and is not a hill worth dying on.

What is truly weird, pipeline-wise, is that in some cases (if not all, did not check them all), information from the suppressed YCBA records is still ending up in the "real" LUX records.
example:
the real LUX record for William Eliot:
https://lux.collections.yale.edu/data/person/ec7486f8-243d-4c15-b187-44145f4fcc98

has the death date from the contributing Wikidata record, but the birth from the YCBA:
https://ycba-lux.s3.amazonaws.com/v3/person/94/94f41d04-60e9-4079-992e-2c486c5f69e6.json

that is NOT attached to that record and a casual observer would have no way of knowing was extant.
it is attached to the "bad" 404 LUX:
https://lux.collections.yale.edu/data/person/04dd8871-0eea-4b8b-b3c2-8e1be2e91d67

So that seems like a real bug, but a challenge.

csv with my research
missing.csv

@kkdavis14
Copy link
Contributor Author

kkdavis14 commented Jan 15, 2025

tldr: This is a challenge. Some possible to dos are below, but also a more long term plan to refactor all this code may help as well.

Unit-data:

Pipeline tasks:

  • investigate the "why aren't they merging" to try to solve that riddle

  • investigate why info from YCBA records is ending up in LUX records without the YCBA rec attached. this at least should not be happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The code does not behave as expected / designed challenge This issue is hard!
Projects
None yet
Development

No branches or pull requests

3 participants