-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for OCR text as annotations (v3/v2) #48
Comments
Q: Where should the @context live in the manifest, now that we've embedded the annotation pages within the manifest? A: At the top of the manifest, along with the IIIF presentation API context. NB: the IIIF presentation API @context statement should be last, since it overrides other values. Q: Should these plaintext representations of a page of text be in a rendering element within the canvas, or in an annotation targeting the full canvas? A: Philosophically speaking, rendering is probably better. However, from a UI perspective, viewers are likely to present it to the user as a downloadable link (as one would with a PDF file). That behavior is probably not desired for some OCR -- for example when a user cannot read Fraktur typefaces, and wants to read the text of the page alongside the facsimile. Current plan is to implement it in one direction and test in viewers. |
Note from 9/7/2023: |
This is Johannes OCR viewer: https://mirador-textoverlay.netlify.app/ |
When we drop this manifest: Here's how the OCR text annotation is modeled in the manifest:
Any idea why not? |
The gist is a v2 manifest; therefore the annotations need to be "seeAlso" or "rendering" (more correctly rendering, but seeAlso is likely better supported.) The annotation seems right for v3; at least it matches the recipe. Where to test?? Maybe Johannes' mirador that takes hOCR or Alto would show it? Or perhaps it won't because it's just text? |
As mentioned I think this is syntacticlly correct and matches the recipe: https://iiif.io/api/cookbook/recipe/0068-newspaper/ but in v2 its probably seeAlso or rendering |
Ben to look at creating a mock up for v3 manifest. |
The problem @benwbrum ran into is a limitation of the viewers: the OCR text can be in AnnotationPages which are external to the manifest, linked in the To get over viewer problems making to hops (Manifest->AnnotiationPage->OCR URI), we will try these strategies (Glen is adding that below) |
Two options:
|
Maybe able to use the service mentioned in: #21 |
Can also just copy and paste the code from: Which uses the archive infrasturcure. Example with djvu: https://archive.org/details/journalofexpedit00ford |
Fulltext not so great as requires a search term. |
Action findout what parameters are avilaible for BookReaderGetTextWrapper.php service. |
Since all of our handy helper functions rely on the archivelabs services, it looks like the best option is to produce annotations from the DjVu XML file itself. This can be done (probably most easily) at Next step is to pseudocode the conversion from DjVu XML file representing multiple canvases into a set of annotations per canvas. FromThePage code that converts IA DjVu into canvas-specific text: |
To produce the leaf-level annotations for canvas https://iiif.archive.org/iiif/journalofexpedit00ford$4/canvas of https://iiif.archive.org/iiif/3/journalofexpedit00ford/manifest.json (with the page number/canvas label of 5),
To produce page-level annotations without coordinate values,
To produce paragraph-level or line-level annotations, follow the page-level annotation strategy for the appropriate |
It looks like DJVU coordinates are lower-left x,y; upper right x,y! |
Converting these into IIIF-style upper-left x,y; w,h wil ltake some calculations |
x = lx |
<moved from ArchiveLabs/iiif.archivelab.org#80 I'd recommend reading that one for the whole discussion, but I pulled most of the comments in here.>
OCR text created by the derivation process can be exposed as annotations for books and image-based media, enabling presentation and consumption of the text by IIIF clients.
v2 mock-up (see the second page)
Manifest
Annotation List
Notes from: 2023-08-17
@mekarpeles proposal: may want to port over the OCR archive lab into this manifest service / app
Code is contained here: https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/api/archive.py#L240-L268
In the annotation list mockup, I think the type should be dctypes:Text rather than cnt:ContentAsText
Context needed for textgranularity extension use of "page"
Decision to embed annotation lists to reduce potential number of requests on clients
Mek adds these resources to the AL APIs which should probably be incorporated into production:
https://github.com/archiveLabs/api.archivelab.org
https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/api/archive.py#L240-L268
The text was updated successfully, but these errors were encountered: