Grobid consistently drops characters, e.g., "fi", "ff" #892

mfeblowitz · 2022-02-16T23:17:06Z

I am using grobid (0.6.1 and 0.7.0, ubuntu 18.04) to extract the content of pdf files into html format. (Separate from grobid, I further extract paragraph content for question answering).

I have noticed several cases where a pair of characters are being replaced by a space. Images below show the pdf document and the resultant extracted html content - direct output from grobid - where the changes have occurred.

For example, characters in the extracted paragraphs, e.g., "fi" in "financial", are replaced by a space. One example is in https://www.americanprogress.org/article/economic-impact-coronavirus-united-states-possible-economic-policy-responses/ - verified using the grobid web app TEI tab (so, independent of any code I've written).

See, for example, what happens to the original:

As reflected in the extracted html:

This is doing a number on our NLP/NLU processing of these documents.

Any suggestions adjustments?

kermitt2 · 2022-02-17T03:48:20Z

Hello @mfeblowitz !

I guess in this case you are producing a PDF from the HTML page, correct?
With which tool are you generating this PDF?

Apparently with Firefox/Linux, the generated PDF uses embedded fonts for the ligatures (ff, fi), so the unicode for these glyphs is not the correct unicode of these characters, but an index to the glyph in the local fonts.

For instance if you cut and paste the text from this PDF with a basic PDF viewer, or with pdftotext command line:

can have disruptive e�ects on the economy.
... making it harder for U.S. �rms to �ll orders...

This is unfortunately very frequent in PDF, particularly for scientific text with a lot of special characters not associated with the right unicode but just to an embedded glyph.
There's nothing magic in Grobid for the moment to "guess" the valid unicode of embedded fonts. We started to work on an custom OCR solution in pdfalto to recover the right unicode, but it's a lot of work.

What you could do is to try to mitigate this issue at the level of html-to-pdf tool or try to change the font in the HTML before generating the PDF, so that the PDF contains a more standard font.

mfeblowitz · 2022-02-17T14:27:05Z

Um, no. Sorry to have not been clear. Updating the description. I'm pulling the pdfs from the web and extracting from them. Thus, I have no control of the production of the pdfs.

kermitt2 · 2022-02-17T15:05:40Z

Do you have an example of such PDF? Where does it come form? because this article seems originally in html first.

The problem applies similarly to native PDF using embedded fonts for ligatures, but it's somehow worse because there is no solution upstream, except using an OCR, which might degrade other aspects of the document.

mfeblowitz · 2022-02-17T15:25:53Z

Interesting...
The origin of the pdf document (linked above) was the product of saving that web page to a pdf file. The contents are (mostly) binary. And pdftotext indeed revealed the same behavior. On a hunch, I tried the print to pdf using firefox rather that the "export to pdf" or "print... save as to pdf" in safari. Firefox did the right thing.
So I do have control over which source (of pdf documents) to use!

mfeblowitz · 2022-02-17T16:12:44Z

Now, if only there was a way to be alerted when the ligature substitution might have occurred, so excruciating manual examination of all processed documents would not be required...

kermitt2 · 2022-02-17T16:20:33Z

mmm check if "fi" "ff" occurs in the text of not?
At least it would cover the ligature case, but the embedded font issue can happen for many characters in general.

mfeblowitz · 2022-02-17T23:02:25Z

That's the rub. To know whether it has the characters, you'd need a good extraction to compare against.

Or you'd need a comprehensive (huge) set of patterns to look for in the bad text: "e ect" for effect, " nance" or " nancial" for finance or financial, ...

Maybe use some machine learning to learn the patterns.

Or maybe some nlp to detect nonsense sentences...

kermitt2 · 2022-04-13T13:58:46Z

For info I've worked on an OCR quality scorer to detect documents with noise coming from terrible OCR (like OCR of the nineties), so that the document can be filtered out or re-OCR with a modern OCR. It might be possible to apply it to your use case, as the "non sense" texts due to the destructive HTML to PDF conversion might be considered to lower the quality score of the converted document. It's based on a DL language model applied to chunks of an input document, then normalized with an XGBoost model.

https://github.com/science-miner/ocr_scorer

lfoppiano · 2025-01-06T13:36:07Z

IMHO this should be fixed in #1216 by adding some mapping for ligatures loaded from outside pdfalto.

kermitt2 added the wontfix label Apr 16, 2022

lfoppiano added the pdfalto Issue related to pdfalto label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grobid consistently drops characters, e.g., "fi", "ff" #892

Grobid consistently drops characters, e.g., "fi", "ff" #892

mfeblowitz commented Feb 16, 2022 •

edited

Loading

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022 •

edited

Loading

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

kermitt2 commented Apr 13, 2022

lfoppiano commented Jan 6, 2025 •

edited

Loading

Grobid consistently drops characters, e.g., "fi", "ff" #892

Grobid consistently drops characters, e.g., "fi", "ff" #892

Comments

mfeblowitz commented Feb 16, 2022 • edited Loading

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022 • edited Loading

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

kermitt2 commented Feb 17, 2022

mfeblowitz commented Feb 17, 2022

kermitt2 commented Apr 13, 2022

lfoppiano commented Jan 6, 2025 • edited Loading

mfeblowitz commented Feb 16, 2022 •

edited

Loading

mfeblowitz commented Feb 17, 2022 •

edited

Loading

lfoppiano commented Jan 6, 2025 •

edited

Loading