Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In pymupdf4llm, if a page has multiple images, only 1 image per-page is extracted #212

Open
dhavalsingh opened this issue Dec 26, 2024 · 2 comments

Comments

@dhavalsingh
Copy link

Description of the bug

test_with_images (5).pdf

For example, this pdf with multiple images per page, using

pymupdf4llm.to_markdown(
                updated_save_path, write_images=True, image_path=image_path
            )

we get only 3 images, also, the last image per page.

How to reproduce the bug

  1. use the above pdf or any pdf with multiple images per page
  2. Run the code
pymupdf4llm.to_markdown(
                updated_save_path, write_images=True, image_path=image_path
            )

Check the output folder.

PyMuPDF version

1.24.13

Operating system

MacOS

Python version

3.12

@JorjMcKie JorjMcKie transferred this issue from pymupdf/PyMuPDF Dec 27, 2024
@JorjMcKie
Copy link
Contributor

This has nothing to do with PyMuPDF itself, but with this package.
For this package, you did not mention a version number.
Your finding looks like a know bug in v0.0.17. Please try the version 0.0.16 instead.

@Amraneze
Copy link

Amraneze commented Jan 20, 2025

@JorjMcKie You are correct, the issue is with PyMuPDFLLM and exactly in this line.

#

del parms.img_rects[i] # do not touch this image twice

The problem is that if you remove an element of the list, which means the index has been changed, if we have a list of [1, 2, 3, 4, 5], in the first iteration i = 0 we remove the first element of the list, that will results us in the following list [2, 3, 4, 5], but with enumerate it will goes to the element 3 and skip 2 because the next index is i = 1. Instead of processing the 5 elements it will process only 3 elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants