Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use shorter interval for updates #350

Closed
TobiasNx opened this issue Sep 12, 2023 · 15 comments
Closed

Use shorter interval for updates #350

TobiasNx opened this issue Sep 12, 2023 · 15 comments
Assignees

Comments

@TobiasNx
Copy link

By W.G. from UB Müster came the request to use a shorter interval for the updates. Daily updates would not be suffixient for them and DNB provides updates all 10 min.

Perhaps we adjust our updates updates even if we do not meet 10 min.

@TobiasNx
Copy link
Author

@acka47 says we have to check if dnb offers 10 min updates as rdf

@dr0i
Copy link
Member

dr0i commented Sep 12, 2023

As it turns out the problem is: DNB doesn't update their RDF data this frequently but only the MARC-XML - so we would have to write some ETL (where is the DNB's morph - we could reuse it?). Last comment from @TobiasNx was not shown as I made this claim. So go on check ! 👍

@TobiasNx
Copy link
Author

"

Der Abfragezeitraum sollte nicht zu weit reichen, um eine Treffermenge über 100.000 Datensätzen zu vermeiden. Empfehlung bei nicht zeitkritischen Verfahren für Abfragezeitraum/Frequenz: 30 Minuten. Bei kleinen Sets (z. B. Online-Dissertationen) reicht ein einmal tägliches oder einmal wöchentliches Harvesting aus, da dadurch ein Datensatz, der in diesem Zeitraum mehrfach geändert wurde, nur einmal bezogen und die Treffermenge trotzdem nicht zu groß wird.
Wir empfehlen zudem als Wiederaufsetzzeitpunkt ("from") die Zeitangabe im Element "responseDate", z. B. 2017-08-30T08:12:54Z zu nutzen, da diese Zeitangabe der aktuellen Verfügbarkeit der Daten in unserem Repository am besten entspricht. Zusätzlich empfehlen wir das Harvesten mit einer geringen zeitlichen Überlappung ("responseDate" minus eine Minute = "from").
"

From: https://www.dnb.de/DE/Professionell/Metadatendienste/Datenbezug/OAI/oai_node.html

@acka47
Copy link
Contributor

acka47 commented Sep 13, 2023

Also in Der Linked-Data-Service der Deutschen Nationalbibliothek: Auslieferung der Metadaten it reads:

Die RDF-Daten sind über die DNB-Schnittstellen OAI 24 , SRU 25 und den Datenshop 26 zu beziehen.
Die auf diesen Wegen ausgelieferten Metadaten befinden sich auf dem aktuellen zeitlichen Stand.

So we should just try out shorter update intervals, I guess.

dr0i added a commit that referenced this issue Nov 10, 2023
This follows the recommendation of the DNB. This enables a bit of overlapping
to ensure to get all the data.

See #350 (comment).
dr0i added a commit that referenced this issue Nov 10, 2023
@dr0i dr0i self-assigned this Nov 10, 2023
@dr0i
Copy link
Member

dr0i commented Nov 10, 2023

If #355 is merged the cron scheduler can be adjusted to get the data e.g. every 10 minutes.

@dr0i
Copy link
Member

dr0i commented Nov 10, 2023

As getting data every 10 minutes often results in an empty data set we disable sending emails that warns about empty data sets for now. We may want to furtherdiscuss this, e.g. implement a daily report or getting the OAI-PMH's server header resp. answer and work on these (i.e. ignore if the server reports <error code="noRecordsMatch"/>).

dr0i added a commit that referenced this issue Nov 10, 2023
dr0i added a commit that referenced this issue Nov 10, 2023
@dr0i
Copy link
Member

dr0i commented Nov 10, 2023

Scheduled to get data every 10 minutes.
(Note that with every call a build is done via sbt (takes 400% CPU for around 10 seconds) and this is even done 3 times (1. ConvertUpdates 2. Index updates to gnd-test 3. Index updates to gnd production). Would be nice to have a running webhook listener who would just start the process from an already running instance like in hbz/lobid-resources#1159 ).
For the moment we test if the GND-updates.jsonl is empty, and if so, ignoring the 2. and 3. sbt build (indexing).

@dr0i
Copy link
Member

dr0i commented Nov 10, 2023

We checked this and it seems to work. Got 7 new resources in the last 20 minutes !
We should blog and inform users.

@dr0i
Copy link
Member

dr0i commented Nov 13, 2023

We should also update http://lobid.org/gnd/dataset:

Datenbasis sind die RDF-Version der GND (täglich aktualisiert)

(hm, wondering why the data seems to be updated only every hour (not every 10 minutes, even if we try to get data all 10 minutes). Maybe the RDF dumps are not provided as often as the PICA-data ? If that's the case we should decrease getting data interval @acka47 .)

@dr0i dr0i unassigned fsteeg Nov 16, 2023
@acka47
Copy link
Contributor

acka47 commented Nov 17, 2023

As we have just discussed in the review, we will schedule hourly updates.

@acka47 acka47 assigned dr0i and unassigned acka47 and TobiasNx Nov 17, 2023
@dr0i dr0i assigned dr0i and unassigned dr0i Nov 17, 2023
@dr0i
Copy link
Member

dr0i commented Nov 17, 2023

Done scheduling hourly. Every_hour:40m. 👍
Note: blog when new full indexing is done.

@dr0i
Copy link
Member

dr0i commented Apr 22, 2024

None is willing to write the blog. As this is not mandatory, I am closing this issue here.

@dr0i dr0i closed this as completed Apr 22, 2024
@TobiasNx
Copy link
Author

I think we still should do this.

We could keep it short:

Title: Hourly updates intervals for lobid-gnd

For lobid-gnd we fetch the GND as RDF-XML via OAI-PMH from the DNB and transform it to JSON-LD.
At SWIB 2023 a colleague from UB Münster suggested to shorten our interval for ingesting OAI-PMH updates
which we provided at a daily basis.

Now we are glad to announce that lobid-gnd provides hourly updates so you do not have to wait for the next day to get current GND data.

Have fun with it.

@dr0i
Copy link
Member

dr0i commented Apr 22, 2024

I've deployed it, see https://blog.lobid.org/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants