bug/Partition-PDF-empty-elements #3885

MackBlackburn · 2025-01-23T17:36:45Z

Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.

Reproduction

from unstructured.partition.pdf import partition_pdf
import pymupdf

fname = 'file.PDF'

elements = partition_pdf(filename=fname, strategy='fast')
elements
Out[18]: []

with pymupdf.open(fname) as doc:
     text = chr(12).join([page.get_text() for page in doc])
Out: ...many pages of text

Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text

Environment Info
Please run python scripts/collect_env.py and paste the output here.

OS version:  Linux-5.14.0-427.26.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version:  3.12.8
unstructured version:  0.16.15
unstructured-inference version:  0.8.1
pytesseract is not installed
Torch version:  2.5.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.39
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

The text was updated successfully, but these errors were encountered:

MackBlackburn added the bug Something isn't working label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/Partition-PDF-empty-elements #3885

bug/Partition-PDF-empty-elements #3885

MackBlackburn commented Jan 23, 2025

bug/Partition-PDF-empty-elements #3885

bug/Partition-PDF-empty-elements #3885

Comments

MackBlackburn commented Jan 23, 2025