Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many ���� after converting when using pymupdf4llm #213

Open
cyc00518 opened this issue Jan 2, 2025 · 5 comments
Open

Many ���� after converting when using pymupdf4llm #213

cyc00518 opened this issue Jan 2, 2025 · 5 comments

Comments

@cyc00518
Copy link

cyc00518 commented Jan 2, 2025

First, thank you for your excellent work!

I have a PDF file that shows ������� after conversion using pymupdf4llm.

import pymupdf4llm
content = pymupdf4llm.to_markdown(file_path, pages = [0])
print(content)
**��������������**


���������������������������������������������
�������������������������������������������������
�������������������������������������������������
������������������������������������������������
����������������������

# 4-9 HUMIDITY MEASUREMENT

��������������������������������

However, when simply opening it with fitz, there are no issues.

import fitz
doc = fitz.open(file_path)
first_page = doc[0]
text = first_page.get_text()
print(text[:300])
frequency measured at the generator terminals may be
used instead of shaft speed to correct gas turbine perfor-
mance since the shaft speed is directly coupled to the line
frequency. The chosen method shall meet the uncertainty
requirement in this Code.
4-9 HUMIDITY MEASUREMENT
The moisture content 

Here is the document:
Fail.pdf

@yumingmin88
Copy link

No garbled text, like missing fonts?

@cyc00518
Copy link
Author

cyc00518 commented Jan 2, 2025

@yumingmin88
pymupdf4llm should be implemented based on fitz,
I don't think it has much to do with the system fonts?

@yumingmin88
Copy link

@yumingmin88 pymupdf4llm应该是基于fitz实现的, 我觉得和系统字体关系不大?

Not quite clear, I use this tool similar to your situation is the pdf has garbled files, this file has no garbled

@JorjMcKie
Copy link
Contributor

Interesting situation!
The reason is that standard text extraction in PyMuPDF uses an extraction flag with the special flag bit pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE set to on.
This bit uses the so-called "CID" if the font has no back-translation info for a glyph. Sometimes, not always, this helps making text readable. In many (!) other cases text still does not become readable: instead of � arbitrary weird symbols appear which makes the situation even more confusing.
But in your case, it helps!
In PyMuPDF4LLM i have set that flag to off. So we do see the Invalid Unicode symbols.
Unfortunately, there is no way to predict whether the bit option will help or not. Only after extraction, the we can tell that a text was readable.
BTW if looking at the full page under PyMuPDF, other problems occur ... look closer!
I am hesitating to add a new parameter for this bit: it is hard to explain what it does and the user will need trial-and-error if anything goes wrong ...

@cyc00518
Copy link
Author

cyc00518 commented Jan 2, 2025

@JorjMcKie

Thank you for your explanation!
Currently, my approach is to use fitz to reprocess if pymupdf4llm detects a lot of � characters.

However, the result is just the same as the original output from pymupdf, and it doesn’t convert to Markdown.

If pymupdf4llm could implement similar retry logic in the backend, that would be fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants