Many �� after converting when using `pymupdf4llm` #213

cyc00518 · 2025-01-02T01:19:18Z

First, thank you for your excellent work!

I have a PDF file that shows �� after conversion using pymupdf4llm.

import pymupdf4llm
content = pymupdf4llm.to_markdown(file_path, pages = [0])
print(content)

**��������������**


���������������������������������������������
�������������������������������������������������
�������������������������������������������������
������������������������������������������������
����������������������

# 4-9 HUMIDITY MEASUREMENT

��������������������������������

However, when simply opening it with fitz, there are no issues.

import fitz
doc = fitz.open(file_path)
first_page = doc[0]
text = first_page.get_text()
print(text[:300])

frequency measured at the generator terminals may be
used instead of shaft speed to correct gas turbine perfor-
mance since the shaft speed is directly coupled to the line
frequency. The chosen method shall meet the uncertainty
requirement in this Code.
4-9 HUMIDITY MEASUREMENT
The moisture content

Here is the document:
Fail.pdf

The text was updated successfully, but these errors were encountered:

yumingmin88 · 2025-01-02T03:14:05Z

No garbled text, like missing fonts?

cyc00518 · 2025-01-02T06:40:15Z

@yumingmin88
pymupdf4llm should be implemented based on fitz,
I don't think it has much to do with the system fonts?

yumingmin88 · 2025-01-02T06:44:42Z

@yumingmin88 pymupdf4llm应该是基于fitz实现的，我觉得和系统字体关系不大？

Not quite clear, I use this tool similar to your situation is the pdf has garbled files, this file has no garbled

JorjMcKie · 2025-01-02T10:20:23Z

Interesting situation!
The reason is that standard text extraction in PyMuPDF uses an extraction flag with the special flag bit pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE set to on.
This bit uses the so-called "CID" if the font has no back-translation info for a glyph. Sometimes, not always, this helps making text readable. In many (!) other cases text still does not become readable: instead of � arbitrary weird symbols appear which makes the situation even more confusing.
But in your case, it helps!
In PyMuPDF4LLM i have set that flag to off. So we do see the Invalid Unicode symbols.
Unfortunately, there is no way to predict whether the bit option will help or not. Only after extraction, the we can tell that a text was readable.
BTW if looking at the full page under PyMuPDF, other problems occur ... look closer!
I am hesitating to add a new parameter for this bit: it is hard to explain what it does and the user will need trial-and-error if anything goes wrong ...

cyc00518 · 2025-01-02T13:25:36Z

@JorjMcKie

Thank you for your explanation!
Currently, my approach is to use fitz to reprocess if pymupdf4llm detects a lot of � characters.

However, the result is just the same as the original output from pymupdf, and it doesn’t convert to Markdown.

If pymupdf4llm could implement similar retry logic in the backend, that would be fantastic!

JorjMcKie added the not a bug label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many �� after converting when using `pymupdf4llm` #213

Many �� after converting when using `pymupdf4llm` #213

cyc00518 commented Jan 2, 2025

yumingmin88 commented Jan 2, 2025

cyc00518 commented Jan 2, 2025

yumingmin88 commented Jan 2, 2025

JorjMcKie commented Jan 2, 2025

cyc00518 commented Jan 2, 2025

Many ���� after converting when using pymupdf4llm #213

Many ���� after converting when using pymupdf4llm #213

Comments

cyc00518 commented Jan 2, 2025

yumingmin88 commented Jan 2, 2025

cyc00518 commented Jan 2, 2025

yumingmin88 commented Jan 2, 2025

JorjMcKie commented Jan 2, 2025

cyc00518 commented Jan 2, 2025

Many �� after converting when using `pymupdf4llm` #213

Many �� after converting when using `pymupdf4llm` #213