-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grobid consistently drops characters, e.g., "fi", "ff" #892
Comments
Hello @mfeblowitz ! I guess in this case you are producing a PDF from the HTML page, correct? Apparently with Firefox/Linux, the generated PDF uses embedded fonts for the ligatures (ff, fi), so the unicode for these glyphs is not the correct unicode of these characters, but an index to the glyph in the local fonts. For instance if you cut and paste the text from this PDF with a basic PDF viewer, or with
This is unfortunately very frequent in PDF, particularly for scientific text with a lot of special characters not associated with the right unicode but just to an embedded glyph. What you could do is to try to mitigate this issue at the level of html-to-pdf tool or try to change the font in the HTML before generating the PDF, so that the PDF contains a more standard font. |
Um, no. Sorry to have not been clear. Updating the description. I'm pulling the pdfs from the web and extracting from them. Thus, I have no control of the production of the pdfs. |
Do you have an example of such PDF? Where does it come form? because this article seems originally in html first. The problem applies similarly to native PDF using embedded fonts for ligatures, but it's somehow worse because there is no solution upstream, except using an OCR, which might degrade other aspects of the document. |
Interesting... |
Now, if only there was a way to be alerted when the ligature substitution might have occurred, so excruciating manual examination of all processed documents would not be required... |
mmm check if "fi" "ff" occurs in the text of not? |
That's the rub. To know whether it has the characters, you'd need a good extraction to compare against. Or you'd need a comprehensive (huge) set of patterns to look for in the bad text: "e ect" for effect, " nance" or " nancial" for finance or financial, ... Maybe use some machine learning to learn the patterns. Or maybe some nlp to detect nonsense sentences... |
For info I've worked on an OCR quality scorer to detect documents with noise coming from terrible OCR (like OCR of the nineties), so that the document can be filtered out or re-OCR with a modern OCR. It might be possible to apply it to your use case, as the "non sense" texts due to the destructive HTML to PDF conversion might be considered to lower the quality score of the converted document. It's based on a DL language model applied to chunks of an input document, then normalized with an XGBoost model. |
IMHO this should be fixed in #1216 by adding some mapping for ligatures loaded from outside pdfalto. |
I am using grobid (0.6.1 and 0.7.0, ubuntu 18.04) to extract the content of pdf files into html format. (Separate from grobid, I further extract paragraph content for question answering).
I have noticed several cases where a pair of characters are being replaced by a space. Images below show the pdf document and the resultant extracted html content - direct output from grobid - where the changes have occurred.
For example, characters in the extracted paragraphs, e.g., "fi" in "financial", are replaced by a space. One example is in https://www.americanprogress.org/article/economic-impact-coronavirus-united-states-possible-economic-policy-responses/ - verified using the grobid web app TEI tab (so, independent of any code I've written).
See, for example, what happens to the original:
As reflected in the extracted html:
This is doing a number on our NLP/NLU processing of these documents.
Any suggestions adjustments?
The text was updated successfully, but these errors were encountered: