Feat/pdf2markdown #406

Greenheart · 2024-12-08T06:31:49Z

Goals:

Get PDF parsing to work again by replacing nlm-ingestor with docling
Refactor nlmParsePDF and nlmExtractTables to run in a separate container, separating dependencies and allowing easier scaling, both vertically and horizontally.

Several key tasks have been completed in the past 36 hours. We're getting closer to a working pdf2markdown service.

Next steps:

Remove embedded PNGs from the JSON output file. Or prehaps we'd like to use these base64 encoded images directly, instead of writing them as PNG files to the tmp directory? It might be worth reducing file I/O and rather use a huge JSON DoclingDocument which includes the relevant images.
- Determine if the base64-encoded images are the same as the images stored to disk.
- Only keep images for pages where tables were found.
See if we can remove the $refs from the Docling JSON format. Otherwise, in Node.js, we could wrap the parsed DoclingDocument with a Proxy to replace specific properties, or just add a parsing function which replaces all $refs manually. Need to be careful with circular $refs though.
Refactor JSON parsing to use DoclingDocument and pass relevant page images to the Vision API for extraction.
- Improve context passed when extracting tables using the vision API. See comment in parse_pdf.py about parsing the table/page markdown and add it to each page for easy usage later.
- Maybe just start with extracting table data without any additional text context. In that case, we could ignore the $ref parsing.
- We could also extract all page markdown from the JSON structure, by filtering for json.texts.prov[0].page_no. And extracting all markdown for a specific page should be possible in Python too. We could for example get a document only representing the specific page, and then render that as markdown. Then we add that to a property like json.pages.markdown, which can easily be used later in the pipeline. We only need to do this for the pages where we need the page markdown as context for the table extraction.
- Maybe only save the JSON data we actually care about: json.pages with page_no, markdown and image. We can always generate more JSON context if we need it in the future. Or for example if we want to replace the low quality tables in the original extracted markdown, with the (hopefully) better quality markdown extracted via the Vision API.

Preparing `pdf2markdown` for deployment

Medium prio

Evaluate JSON references to get the correct strings, and allow filtering tables based on search terms. This would allow keeping the relevant tables, and skipping a lot of unwanted processing. Both speeding up pdf2markdown, and saving money.

Nice to haves (after release):

See if we can replace the input.pdf file and communicate between processes using stdio instead of writing and reading tmp files.
TODO: Figure out how to get both the stdout and stderr from the child process when running python parse_pdf.py
Instead of writing a tmp file, we might be able to stream the file using stdin directly to the python code. Ser korrekta example: https://github.com/microsoft/markitdown/blob/main/src/markitdown/__main__.py

…I Vision API

…sonToMarkdown

… file cleanup

…ary files

…ing and logging

…esponse

…traction

…or handling

…aders, and paragraphs

Co-authored-by: jomiq <[email protected]>

…ned to catch them all

…h terms

This greatly simplifies the following steps, and reduces the tmp storage used for each report

irony and others added 30 commits November 28, 2024 16:31

feat: Create PDF to Markdown microservice with NLM ingestor and OpenA…

6850654

…I Vision API

feat: Return markdown text directly instead of JSON response

cfd22b3

refactor: Optimize table extraction by directly using Vision API in j…

bbaa8a2

…sonToMarkdown

refactor: Remove duplicate code, add error handling, and improve temp…

4338766

… file cleanup

refactor: Remove unused imports and vision API extraction function

757a199

refactor: Optimize table extraction by using buffer instead of tempor…

18168b3

…ary files

fix: bugfixes

7e52a8d

refactor: Add NLM ingestor health check and improve error logging

5b7f950

refactor: Remove unnecessary lastTableMarkdown assignment

2a13539

fix: Update NLM ingestor schema to match new response structure

68937db

fix: Handle undefined block content in PDF conversion

02d4273

refactor: Improve JSON to Markdown conversion with better block handl…

bc75430

…ing and logging

refactor: Update NLM ingestor health check and remove schema validation

44c6fc4

refactor: Add schema validation and error handling for PDF conversion

43d5fcc

refactor: Add detailed schema validation debugging for NLM Ingestor r…

8f7b761

…esponse

fix: Make NLM ingestor schema more flexible for optional content

f5d1b73

fix: Import ParsedDocumentSchema in jsonExtraction.ts

2f0df9d

fix: Add detailed logging to diagnose PDF content extraction issue

5d56097

refactor: Simplify imports and improve block type detection in PDF ex…

5d1b49c

…traction

feat: Improve PDF content extraction with enhanced validation and err…

8d91c6f

…or handling

refactor: Enhance block parsing with modular functions for tables, he…

7316d25

…aders, and paragraphs

fix: remove temp files - we are using buffer

44191c4

Merge branch 'staging' into feat/pdf2markdown

f63089a

Add Docling docker config

b53c0e2

Parse PDF to JSON with Docling

fbcc701

Co-authored-by: jomiq <[email protected]>

Try page and table screenshots

9000b54

Use DocumentStream as input

2a4b1b8

Cleanup

d36b0d4

Parse document from Node

1ff9883

Pass docId from Node to Python and ensure it exists

23c0ad8

Greenheart added 27 commits December 8, 2024 02:40

Use PythonShell to get better error logging

a99c9c4

Clarify usage

1907cc5

Make regular logs show up

feff2e7

Log completion time for the request handler

56e010b

Clarify types

3e6300f

Parse DoclingDocument and improve how unique page numbers are determi…

7ff5e6a

…ned to catch them all

Save page images with Docling

3f09a44

Extract images for pages with tables using Docling

da2083d

Improve type

0002730

Rename

85a6046

Experiment with preserving both stdout and stderr from the child process

fe43fd2

Start refactroing JSON processing to use Docling format

5a3ea87

Group all tables based on pages until we can filter by relevant searc…

f20e23e

…h terms

Automatically load a specific python version in the project

01f2767

Parse images exported by Docling

816c21b

Verify that all images match the base64 encoded representations

eb648e2

Increase image quality

a7c126b

Only keep images for wanted pages

cb3d076

Save Base64-encoded images only for pages with tables.

1f9e972

This greatly simplifies the following steps, and reduces the tmp storage used for each report

Format Python code using the Black opinionated formatter

fb48ff6

WIP: Refactor table extraction

106fdcb

Simplify

0fc46c7

WIP: Experiment with dereferencing in Node.js. Not working

4409aff

WIP: Experiment with dereferencing json in Python. Not working

d8951a8

Add ideas for providing more relevant context to the Vision API

2d10ff3

WIP: Try different args

bca466d

Add notes

2ad225f

Greenheart mentioned this pull request Dec 9, 2024

Add docling test conversion #405

Closed

Make parsing work again and add idea for image extraction

d4b0e3f

Greenheart mentioned this pull request Dec 6, 2024

Investigate if Docling could replace nlm-ingestor + Open AI Vision API #404

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/pdf2markdown #406

Feat/pdf2markdown #406

Greenheart commented Dec 8, 2024 •

edited

Loading

Feat/pdf2markdown #406

Are you sure you want to change the base?

Feat/pdf2markdown #406

Conversation

Greenheart commented Dec 8, 2024 • edited Loading

Goals:

Next steps:

Preparing pdf2markdown for deployment

Medium prio

Nice to haves (after release):

Greenheart commented Dec 8, 2024 •

edited

Loading

Preparing `pdf2markdown` for deployment