Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pdf2markdown #406

Draft
wants to merge 67 commits into
base: staging
Choose a base branch
from
Draft

Feat/pdf2markdown #406

wants to merge 67 commits into from

Conversation

Greenheart
Copy link
Collaborator

@Greenheart Greenheart commented Dec 8, 2024

Goals:

  • Get PDF parsing to work again by replacing nlm-ingestor with docling
  • Refactor nlmParsePDF and nlmExtractTables to run in a separate container, separating dependencies and allowing easier scaling, both vertically and horizontally.

Several key tasks have been completed in the past 36 hours. We're getting closer to a working pdf2markdown service.

Next steps:

  • Remove embedded PNGs from the JSON output file. Or prehaps we'd like to use these base64 encoded images directly, instead of writing them as PNG files to the tmp directory? It might be worth reducing file I/O and rather use a huge JSON DoclingDocument which includes the relevant images.
    • Determine if the base64-encoded images are the same as the images stored to disk.
    • Only keep images for pages where tables were found.
  • See if we can remove the $refs from the Docling JSON format. Otherwise, in Node.js, we could wrap the parsed DoclingDocument with a Proxy to replace specific properties, or just add a parsing function which replaces all $refs manually. Need to be careful with circular $refs though.
  • Refactor JSON parsing to use DoclingDocument and pass relevant page images to the Vision API for extraction.
    • Improve context passed when extracting tables using the vision API. See comment in parse_pdf.py about parsing the table/page markdown and add it to each page for easy usage later.
    • Maybe just start with extracting table data without any additional text context. In that case, we could ignore the $ref parsing.
    • We could also extract all page markdown from the JSON structure, by filtering for json.texts.prov[0].page_no. And extracting all markdown for a specific page should be possible in Python too. We could for example get a document only representing the specific page, and then render that as markdown. Then we add that to a property like json.pages.markdown, which can easily be used later in the pipeline. We only need to do this for the pages where we need the page markdown as context for the table extraction.
    • Maybe only save the JSON data we actually care about: json.pages with page_no, markdown and image. We can always generate more JSON context if we need it in the future. Or for example if we want to replace the low quality tables in the original extracted markdown, with the (hopefully) better quality markdown extracted via the Vision API.

Preparing pdf2markdown for deployment

  • Verify tesserocr has been correctly installed. Perhaps it should be added to pdf2markdown/pyproject.toml?
  • Remove NLM-ingestor code, config and dependencies, in both garbo and in pdf2markdown.
    • Remove config in Dockerfile
    • Remove env variables
  • Remove pdf2pic, related code in both garbo and pdf2markdown, and native its dependencies. Docling handles screenshots already so no need to have another stack for processing PDFs into images.
    • Remove config in Dockerfile (native libs which we no longer need)
  • Complete the Docker container for pdf2markdown and test that it works as expected by running it locally. Verify all dependencies and especially native libs.
  • Refactor nlmParsePDF to make use of the new pdf2markdown service, and start a flow for indexMarkdown -> precheck with the result returned by pdf2markdown
  • Investigate if we can configure GPU acceleration for Docling / PyTorch / Tesseract
  • Set up k8s config for the pdf2markdown service. Ideally 5 replicas in the coming days, which can be decreased after the release.

Medium prio

  • Evaluate JSON references to get the correct strings, and allow filtering tables based on search terms. This would allow keeping the relevant tables, and skipping a lot of unwanted processing. Both speeding up pdf2markdown, and saving money.

Nice to haves (after release):

  • See if we can replace the input.pdf file and communicate between processes using stdio instead of writing and reading tmp files.

  • TODO: Figure out how to get both the stdout and stderr from the child process when running python parse_pdf.py

  • Instead of writing a tmp file, we might be able to stream the file using stdin directly to the python code. Ser korrekta example: https://github.com/microsoft/markitdown/blob/main/src/markitdown/__main__.py

irony and others added 30 commits November 28, 2024 16:31
This greatly simplifies the following steps, and reduces the tmp storage used for each report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants