You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.
Reproduction
from unstructured.partition.pdf import partition_pdf
import pymupdf
fname = 'file.PDF'
elements = partition_pdf(filename=fname, strategy='fast')
elements
Out[18]: []
with pymupdf.open(fname) as doc:
text = chr(12).join([page.get_text() for page in doc])
Out: ...many pages of text
Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text
Environment Info
Please run python scripts/collect_env.py and paste the output here.
OS version: Linux-5.14.0-427.26.1.el9_4.x86_64-x86_64-with-glibc2.34
Python version: 3.12.8
unstructured version: 0.16.15
unstructured-inference version: 0.8.1
pytesseract is not installed
Torch version: 2.5.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.39
magic file from /etc/magic:/usr/share/misc/magic
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
The text was updated successfully, but these errors were encountered:
Describe the bug
Partition PDF with 'fast' strategy returns an empty list of elements when OCR is not needed. Text is returned instantly with other libraries like PyMuPDF.
Reproduction
Expected behavior
Partition PDF should return chunks of text without running OCR when PDF has embedded text
Environment Info
Please run
python scripts/collect_env.py
and paste the output here.The text was updated successfully, but these errors were encountered: