Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug/Partition-PDF-empty-elements #3885

Open
MackBlackburn opened this issue Jan 23, 2025 · 0 comments
Open

bug/Partition-PDF-empty-elements #3885

MackBlackburn opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@MackBlackburn
Copy link

Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.

Reproduction

from unstructured.partition.pdf import partition_pdf
import pymupdf

fname = 'file.PDF'

elements = partition_pdf(filename=fname, strategy='fast')
elements
Out[18]: []

with pymupdf.open(fname) as doc:
     text = chr(12).join([page.get_text() for page in doc])
Out: ...many pages of text

Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text

Environment Info
Please run python scripts/collect_env.py and paste the output here.

OS version:  Linux-5.14.0-427.26.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version:  3.12.8
unstructured version:  0.16.15
unstructured-inference version:  0.8.1
pytesseract is not installed
Torch version:  2.5.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.39
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

@MackBlackburn MackBlackburn added the bug Something isn't working label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant