Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft proposal for connecting indexing status to SDR workflows #47

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thatbudakguy
Copy link

No description provided.

@thatbudakguy thatbudakguy force-pushed the index-reporting-proposal branch from 857ca8e to 224925d Compare April 10, 2024 21:30
Copy link
Contributor

@jcoyne jcoyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel pretty strongly that indexing is not part of the workflow. We may index an object many times, and not all of those is because it’s been pushed through a workflow. Furthermore coupling to workflows means that further accessioning/preservation work would block until the indexing is complete.

Have you considered just exposing the SDR indexing event in a better interface?

@thatbudakguy
Copy link
Author

thatbudakguy commented Apr 10, 2024

I might not have a good enough grasp of SDR yet to know whether to characterize it as a workflow thing...my understanding is that both accessionWF and releaseWF make calls to DSA to do the "publishing" of data, which is what results in indexing. When the PublishJob runs, it seems like it reports success based on whether or not it successfully notified purl-fetcher of the intention to index something.

Right now, accessioneers need a way to group and find all of the items that don't actually get indexed, even though their state in Argo indicates that they have been. The most logical way to do that seemed to be to not allow the workflows that do publishing to be completed unless the publishing actually resulted in the thing getting indexed. Then the "released to..." facet will be an indicator of whether an item actually made it to that platform.

In option number 2 in the proposal, we could stop short of connecting the indexing back to the workflow – we would call back to purl-fetcher from the indexer to update the ReleaseTag with the most recent indexing status, but what we do after that is kind of up to us. Do we dispatch some other kind of event if the indexing failed? Do we allow accessioneers to just query purl-fetcher directly to get a list of things that failed to index?

@thatbudakguy
Copy link
Author

Have you considered just exposing the SDR indexing event in a better interface?

I think this is kind of making the SDR events service load-bearing in a way that it wasn't intended to be, although that could be ok. The issue is that I think that interface needs to support faceted search, so that accessioneers can drill down into the collections they care about before faceting on things that failed to index. And...that's what Argo is already good at doing!

@justinlittman
Copy link
Contributor

If purl-fetcher exposed an API for querying the indexing status of an item, we could display on the Argo item display page.

Or, if purl-fetcher exposed an API for querying the indexing status of an item and published a RabbitMQ message for indexing success/failure, we could display on the Argo item display page and provide it as an Argo facet.

@jcoyne
Copy link
Contributor

jcoyne commented Apr 11, 2024

If purl-fetcher exposed an API for querying the indexing status of an item,

This would require a much more robust deployment for purl-fetcher as the number of requests handled would go up by several orders of magnitude.

@thatbudakguy
Copy link
Author

Could we instead reach into Argo's index and update it directly from purl-fetcher? I'm thinking of GeoMonitor, which sort of works that way:

  • it runs as a background job, checking what items can be successfully visited in a blacklight app
  • it keeps track of how often a page can be successfully loaded and creates an "availability score"
  • it stores and updates the score as a single numeric field on the item's solr document
  • the availability score field can be used to facet the items in the blacklight app

@jcoyne
Copy link
Contributor

jcoyne commented Apr 11, 2024

@thatbudakguy We could do atomic updates, but I think that requires all fields to be "stored" in Solr. I am not sure if that is the case for the Argo index. If that is done, I don't think that code belongs in purl-fetcher, but a RabbitMQ message could help some other process to do that.

@thatbudakguy
Copy link
Author

thatbudakguy commented Apr 11, 2024

From what I can tell, Earthworks (which uses GeoMonitor) doesn't have all fields stored, but is instead doing:

    data = [{
      layer_availability_score_f: { set: layer.availability_score },
      layer_slug_s: layer.slug
    }]
    Indexer.new.solr_update(data)

Which looks like in-place updates, where only the fields you're updating have to be stored. So that might make it easier?

edit maybe it does use only stored fields??

@jcoyne
Copy link
Contributor

jcoyne commented Apr 11, 2024

@thatbudakguy my understanding is that "in-place" updates are for non-indexed fields, which wouldn't allow driving a facet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants