Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grobid consistently drops characters, e.g., "fi", "ff" #892

Open
mfeblowitz opened this issue Feb 16, 2022 · 9 comments
Open

Grobid consistently drops characters, e.g., "fi", "ff" #892

mfeblowitz opened this issue Feb 16, 2022 · 9 comments
Labels
pdfalto Issue related to pdfalto wontfix

Comments

@mfeblowitz
Copy link

mfeblowitz commented Feb 16, 2022

I am using grobid (0.6.1 and 0.7.0, ubuntu 18.04) to extract the content of pdf files into html format. (Separate from grobid, I further extract paragraph content for question answering).

I have noticed several cases where a pair of characters are being replaced by a space. Images below show the pdf document and the resultant extracted html content - direct output from grobid - where the changes have occurred.

For example, characters in the extracted paragraphs, e.g., "fi" in "financial", are replaced by a space. One example is in https://www.americanprogress.org/article/economic-impact-coronavirus-united-states-possible-economic-policy-responses/ - verified using the grobid web app TEI tab (so, independent of any code I've written).

See, for example, what happens to the original:

grobid_droppage_orig

As reflected in the extracted html:

grobid_droppage

This is doing a number on our NLP/NLU processing of these documents.

Any suggestions adjustments?

@kermitt2
Copy link
Owner

Hello @mfeblowitz !

I guess in this case you are producing a PDF from the HTML page, correct?
With which tool are you generating this PDF?

Apparently with Firefox/Linux, the generated PDF uses embedded fonts for the ligatures (ff, fi), so the unicode for these glyphs is not the correct unicode of these characters, but an index to the glyph in the local fonts.

For instance if you cut and paste the text from this PDF with a basic PDF viewer, or with pdftotext command line:

can have disruptive e�ects on the economy.
... making it harder for U.S. �rms to �ll orders...

This is unfortunately very frequent in PDF, particularly for scientific text with a lot of special characters not associated with the right unicode but just to an embedded glyph.
There's nothing magic in Grobid for the moment to "guess" the valid unicode of embedded fonts. We started to work on an custom OCR solution in pdfalto to recover the right unicode, but it's a lot of work.

What you could do is to try to mitigate this issue at the level of html-to-pdf tool or try to change the font in the HTML before generating the PDF, so that the PDF contains a more standard font.

@mfeblowitz
Copy link
Author

mfeblowitz commented Feb 17, 2022

Um, no. Sorry to have not been clear. Updating the description. I'm pulling the pdfs from the web and extracting from them. Thus, I have no control of the production of the pdfs.

@kermitt2
Copy link
Owner

Do you have an example of such PDF? Where does it come form? because this article seems originally in html first.

The problem applies similarly to native PDF using embedded fonts for ligatures, but it's somehow worse because there is no solution upstream, except using an OCR, which might degrade other aspects of the document.

@mfeblowitz
Copy link
Author

Interesting...
The origin of the pdf document (linked above) was the product of saving that web page to a pdf file. The contents are (mostly) binary. And pdftotext indeed revealed the same behavior. On a hunch, I tried the print to pdf using firefox rather that the "export to pdf" or "print... save as to pdf" in safari. Firefox did the right thing.
So I do have control over which source (of pdf documents) to use!

@mfeblowitz
Copy link
Author

Now, if only there was a way to be alerted when the ligature substitution might have occurred, so excruciating manual examination of all processed documents would not be required...

@kermitt2
Copy link
Owner

mmm check if "fi" "ff" occurs in the text of not?
At least it would cover the ligature case, but the embedded font issue can happen for many characters in general.

@mfeblowitz
Copy link
Author

That's the rub. To know whether it has the characters, you'd need a good extraction to compare against.

Or you'd need a comprehensive (huge) set of patterns to look for in the bad text: "e ect" for effect, " nance" or " nancial" for finance or financial, ...

Maybe use some machine learning to learn the patterns.

Or maybe some nlp to detect nonsense sentences...

@kermitt2
Copy link
Owner

For info I've worked on an OCR quality scorer to detect documents with noise coming from terrible OCR (like OCR of the nineties), so that the document can be filtered out or re-OCR with a modern OCR. It might be possible to apply it to your use case, as the "non sense" texts due to the destructive HTML to PDF conversion might be considered to lower the quality score of the converted document. It's based on a DL language model applied to chunks of an input document, then normalized with an XGBoost model.

https://github.com/science-miner/ocr_scorer

@lfoppiano
Copy link
Collaborator

lfoppiano commented Jan 6, 2025

IMHO this should be fixed in #1216 by adding some mapping for ligatures loaded from outside pdfalto.

@lfoppiano lfoppiano added the pdfalto Issue related to pdfalto label Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pdfalto Issue related to pdfalto wontfix
Projects
None yet
Development

No branches or pull requests

3 participants