Draft proposal for connecting indexing status to SDR workflows #47

thatbudakguy · 2024-04-10T21:27:47Z

No description provided.

jcoyne

I feel pretty strongly that indexing is not part of the workflow. We may index an object many times, and not all of those is because it’s been pushed through a workflow. Furthermore coupling to workflows means that further accessioning/preservation work would block until the indexing is complete.

Have you considered just exposing the SDR indexing event in a better interface?

thatbudakguy · 2024-04-10T22:42:52Z

I might not have a good enough grasp of SDR yet to know whether to characterize it as a workflow thing...my understanding is that both accessionWF and releaseWF make calls to DSA to do the "publishing" of data, which is what results in indexing. When the PublishJob runs, it seems like it reports success based on whether or not it successfully notified purl-fetcher of the intention to index something.

Right now, accessioneers need a way to group and find all of the items that don't actually get indexed, even though their state in Argo indicates that they have been. The most logical way to do that seemed to be to not allow the workflows that do publishing to be completed unless the publishing actually resulted in the thing getting indexed. Then the "released to..." facet will be an indicator of whether an item actually made it to that platform.

In option number 2 in the proposal, we could stop short of connecting the indexing back to the workflow – we would call back to purl-fetcher from the indexer to update the ReleaseTag with the most recent indexing status, but what we do after that is kind of up to us. Do we dispatch some other kind of event if the indexing failed? Do we allow accessioneers to just query purl-fetcher directly to get a list of things that failed to index?

thatbudakguy · 2024-04-10T22:44:14Z

Have you considered just exposing the SDR indexing event in a better interface?

I think this is kind of making the SDR events service load-bearing in a way that it wasn't intended to be, although that could be ok. The issue is that I think that interface needs to support faceted search, so that accessioneers can drill down into the collections they care about before faceting on things that failed to index. And...that's what Argo is already good at doing!

justinlittman · 2024-04-11T10:57:23Z

If purl-fetcher exposed an API for querying the indexing status of an item, we could display on the Argo item display page.

Or, if purl-fetcher exposed an API for querying the indexing status of an item and published a RabbitMQ message for indexing success/failure, we could display on the Argo item display page and provide it as an Argo facet.

jcoyne · 2024-04-11T11:55:04Z

If purl-fetcher exposed an API for querying the indexing status of an item,

This would require a much more robust deployment for purl-fetcher as the number of requests handled would go up by several orders of magnitude.

thatbudakguy · 2024-04-11T16:52:16Z

Could we instead reach into Argo's index and update it directly from purl-fetcher? I'm thinking of GeoMonitor, which sort of works that way:

it runs as a background job, checking what items can be successfully visited in a blacklight app
it keeps track of how often a page can be successfully loaded and creates an "availability score"
it stores and updates the score as a single numeric field on the item's solr document
the availability score field can be used to facet the items in the blacklight app

jcoyne · 2024-04-11T16:56:10Z

@thatbudakguy We could do atomic updates, but I think that requires all fields to be "stored" in Solr. I am not sure if that is the case for the Argo index. If that is done, I don't think that code belongs in purl-fetcher, but a RabbitMQ message could help some other process to do that.

thatbudakguy · 2024-04-11T17:30:42Z

From what I can tell, Earthworks (which uses GeoMonitor) doesn't have all fields stored, but is instead doing:

    data = [{
      layer_availability_score_f: { set: layer.availability_score },
      layer_slug_s: layer.slug
    }]
    Indexer.new.solr_update(data)

Which looks like in-place updates, where only the fields you're updating have to be stored. So that might make it easier?

edit maybe it does use only stored fields??

jcoyne · 2024-04-11T19:09:28Z

@thatbudakguy my understanding is that "in-place" updates are for non-indexed fields, which wouldn't allow driving a facet.

Draft proposal for connecting indexing status to SDR workflows

224925d

thatbudakguy force-pushed the index-reporting-proposal branch from 857ca8e to 224925d Compare April 10, 2024 21:30

jcoyne reviewed Apr 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft proposal for connecting indexing status to SDR workflows #47

Draft proposal for connecting indexing status to SDR workflows #47

thatbudakguy commented Apr 10, 2024

jcoyne left a comment •

edited

Loading

thatbudakguy commented Apr 10, 2024 •

edited

Loading

thatbudakguy commented Apr 10, 2024

justinlittman commented Apr 11, 2024

jcoyne commented Apr 11, 2024

thatbudakguy commented Apr 11, 2024

jcoyne commented Apr 11, 2024 •

edited

Loading

thatbudakguy commented Apr 11, 2024 •

edited

Loading

jcoyne commented Apr 11, 2024

Draft proposal for connecting indexing status to SDR workflows #47

Are you sure you want to change the base?

Draft proposal for connecting indexing status to SDR workflows #47

Conversation

thatbudakguy commented Apr 10, 2024

jcoyne left a comment • edited Loading

Choose a reason for hiding this comment

thatbudakguy commented Apr 10, 2024 • edited Loading

thatbudakguy commented Apr 10, 2024

justinlittman commented Apr 11, 2024

jcoyne commented Apr 11, 2024

thatbudakguy commented Apr 11, 2024

jcoyne commented Apr 11, 2024 • edited Loading

thatbudakguy commented Apr 11, 2024 • edited Loading

jcoyne commented Apr 11, 2024

jcoyne left a comment •

edited

Loading

thatbudakguy commented Apr 10, 2024 •

edited

Loading

jcoyne commented Apr 11, 2024 •

edited

Loading

thatbudakguy commented Apr 11, 2024 •

edited

Loading