support for OCR text as annotations (v3/v2) #48

saracarl · 2024-01-17T17:16:43Z

OCR text created by the derivation process can be exposed as annotations for books and image-based media, enabling presentation and consumption of the text by IIIF clients.

v2 mock-up (see the second page)
Manifest
Annotation List
Notes from: 2023-08-17
@mekarpeles proposal: may want to port over the OCR archive lab into this manifest service / app
Code is contained here: https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/api/archive.py#L240-L268
In the annotation list mockup, I think the type should be dctypes:Text rather than cnt:ContentAsText
Context needed for textgranularity extension use of "page"
Decision to embed annotation lists to reduce potential number of requests on clients
Mek adds these resources to the AL APIs which should probably be incorporated into production:
https://github.com/archiveLabs/api.archivelab.org
https://github.com/ArchiveLabs/api.archivelab.org/blob/master/server/api/archive.py#L240-L268

saracarl · 2024-01-17T17:16:52Z

Q: Where should the @context live in the manifest, now that we've embedded the annotation pages within the manifest?

A: At the top of the manifest, along with the IIIF presentation API context. NB: the IIIF presentation API @context statement should be last, since it overrides other values.

Q: Should these plaintext representations of a page of text be in a rendering element within the canvas, or in an annotation targeting the full canvas?

A: Philosophically speaking, rendering is probably better. However, from a UI perspective, viewers are likely to present it to the user as a downloadable link (as one would with a PDF file). That behavior is probably not desired for some OCR -- for example when a user cannot read Fraktur typefaces, and wants to read the text of the page alongside the facsimile.

Current plan is to implement it in one direction and test in viewers.

saracarl · 2024-01-17T17:17:36Z

Note from 9/7/2023:
We should place the text granularity context after the IIIF presentation context in the v2 manifest, but before it in the v3 manifest.

saracarl · 2024-01-17T17:17:58Z

This is Johannes OCR viewer: https://mirador-textoverlay.netlify.app/
(To test with)

saracarl · 2024-01-17T17:18:43Z

When we drop this manifest:
https://gist.githubusercontent.com/benwbrum/e7e2fb9962a6aaba2cc0d6ae8f1b6d98/raw/df2598dd8a67ef941c2e03fa07dbe9485f736a9c/ia_ocr_annotation_mockup_v2.json
Into Mirador, the annotations don't show up.

Here's how the OCR text annotation is modeled in the manifest:

"otherContent": [
   {
      "@id": "https://iiif.archivelab.org/iiif/rbmsbk_ap2-v4_2001_V55N4$9/ocr",
      "@type": "sc:AnnotationList",
      "label": "OCR Text",
      "resources": [
         {
            "@type": "oa:Annotation",
            "motivation": "sc:painting",
            "textGranularity": "page",
            "on": "https://iiif.archivelab.org/iiif/books/rbmsbk_ap2-v4_2001_V55N4$9/canvas",
            "resource": {
               "@id": "https://api.archivelab.org/books/rbmsbk_ap2-v4_2001_V55N4/pages/9/plaintext",
               "@type": "dctypes:Text",
               "format": "text/plain"
            }
         }
      ]
   }
]

Any idea why not?

saracarl · 2024-01-17T17:18:59Z

The gist is a v2 manifest; therefore the annotations need to be "seeAlso" or "rendering" (more correctly rendering, but seeAlso is likely better supported.)

The annotation seems right for v3; at least it matches the recipe. Where to test?? Maybe Johannes' mirador that takes hOCR or Alto would show it? Or perhaps it won't because it's just text?

saracarl · 2024-01-17T17:19:09Z

As mentioned I think this is syntacticlly correct and matches the recipe:

https://iiif.io/api/cookbook/recipe/0068-newspaper/

but in v2 its probably seeAlso or rendering

glenrobson · 2024-02-22T22:33:34Z

Ben to look at creating a mock up for v3 manifest.

saracarl · 2024-03-07T23:00:48Z

The problem @benwbrum ran into is a limitation of the viewers: the OCR text can be in AnnotationPages which are external to the manifest, linked in the id property of the Annotation within the AnnotationPage. The motivation should be supplementing, following the second use case in https://iiif.io/api/cookbook/recipe/0231-transcript-meta-recipe/

To get over viewer problems making to hops (Manifest->AnnotiationPage->OCR URI), we will try these strategies (Glen is adding that below)

glenrobson · 2024-03-07T23:00:59Z

Two options:

Bring the external annotation into the Manifest
Mike can write a annotation page endpoint which will generate a annotationPage with the text content retrieved from the file.

glenrobson · 2024-03-21T22:09:57Z

Maybe able to use the service mentioned in: #21

glenrobson · 2024-03-21T22:17:36Z

Can also just copy and paste the code from:

https://github.com/ArchiveLabs/api.archivelab.org/blob/082c29bace2149d9c02d6b490d006fa27de0b447/server/api/archive.py#L240

Which uses the archive infrasturcure.

Example with djvu: https://archive.org/details/journalofexpedit00ford

glenrobson · 2024-03-21T22:22:25Z

Example: https://ia801302.us.archive.org/fulltext/inside.php?item_id=journalofexpedit00ford&doc=journalofexpedit00ford&path=/31/items/journalofexpedit00ford&q=ford

glenrobson · 2024-03-21T22:37:49Z

Fulltext not so great as requires a search term.

glenrobson · 2024-03-21T22:41:17Z

Action findout what parameters are avilaible for BookReaderGetTextWrapper.php service.

benwbrum · 2024-04-25T21:25:12Z

Since all of our handy helper functions rely on the archivelabs services, it looks like the best option is to produce annotations from the DjVu XML file itself. This can be done (probably most easily) at word-level granularity.

Next step is to pseudocode the conversion from DjVu XML file representing multiple canvases into a set of annotations per canvas.

FromThePage code that converts IA DjVu into canvas-specific text:

benwbrum · 2024-04-26T13:21:57Z

To produce the leaf-level annotations for canvas https://iiif.archive.org/iiif/journalofexpedit00ford$4/canvas of https://iiif.archive.org/iiif/3/journalofexpedit00ford/manifest.json (with the page number/canvas label of 5),

Fetch the DjVu XML file
Calculate the map name value corresponding to the canvas:
- The map name value for this canvas is journalofexpedit00ford_0005.djvu, which corresponds to the string we use for the journalofexpedit00ford_0005 painting annotation id and image service id
Find the OBJECT element with the usemap attribute value matching the map name value
For each WORD child element of the OBJECT,
- Convert the body of the element to a textual annotation of plaintext format
- Convert the coords into a fragment, transforming upper-left/lower-right into xywh. It is not clear to me that the coords values are correctly generated, since the values do not seem to match the expected min(x),min(y),max(x),max(y).

To produce page-level annotations without coordinate values,

Find the OBJECT element corresponding to the canvas as above
For each PARAGRAPH
- For each LINE
  - For each WORD
    - Extract the text body from the XML element
  - join word text (no whitespace padding should be needed, as spaces are in the DjVu elements)
- join line text using a newline
join paragraph text using two newlines

To produce paragraph-level or line-level annotations, follow the page-level annotation strategy for the appropriate PARAGRAPH or LINE element, but find the minimum/maximum coordinates from the WORD elements to generate a line/paragraph region fragment.

benwbrum · 2024-05-23T21:20:46Z

It looks like DJVU coordinates are lower-left x,y; upper right x,y!

benwbrum · 2024-05-23T21:21:35Z

<LINE>
<WORD coords="444,1353,635,1294" x-confidence="10">[David </WORD>
<WORD coords="635,1336,782,1294" x-confidence="7">Ford </WORD>
<WORD coords="782,1335,894,1305" x-confidence="2">was </WORD>
<WORD coords="894,1335,941,1305" x-confidence="10">a </WORD>
<WORD coords="941,1335,1112,1292" x-confidence="31">native </WORD>

Converting these into IIIF-style upper-left x,y; w,h wil ltake some calculations

glenrobson · 2024-05-23T21:26:36Z

<WORD coords="444,1353,635,1294" x-confidence="10">[David </WORD>
<WORD coords="lx,by,rx,ty" x-confidence="10">[David </WORD>

x = lx
y = ty
w = rx - lx
h = by - ty

benwbrum added this to the Spring 2024 sprint milestone Jan 18, 2024

glenrobson added the enhancement New feature or request label Jan 18, 2024

glenrobson assigned benwbrum Jan 18, 2024

glenrobson added the High Priority label Apr 25, 2024

glenrobson self-assigned this Apr 25, 2024

glenrobson linked a pull request May 23, 2024 that will close this issue

Adding annotations from djfu file #73

Merged

glenrobson closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for OCR text as annotations (v3/v2) #48

support for OCR text as annotations (v3/v2) #48

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

glenrobson commented Feb 22, 2024

saracarl commented Mar 7, 2024

glenrobson commented Mar 7, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

benwbrum commented Apr 25, 2024

benwbrum commented Apr 26, 2024 •

edited

Loading

benwbrum commented May 23, 2024

benwbrum commented May 23, 2024

glenrobson commented May 23, 2024

support for OCR text as annotations (v3/v2) #48

support for OCR text as annotations (v3/v2) #48

Comments

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

saracarl commented Jan 17, 2024

glenrobson commented Feb 22, 2024

saracarl commented Mar 7, 2024

glenrobson commented Mar 7, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

glenrobson commented Mar 21, 2024

benwbrum commented Apr 25, 2024

benwbrum commented Apr 26, 2024 • edited Loading

benwbrum commented May 23, 2024

benwbrum commented May 23, 2024

glenrobson commented May 23, 2024

benwbrum commented Apr 26, 2024 •

edited

Loading