-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pdf2markdown #406
Draft
Greenheart
wants to merge
67
commits into
staging
Choose a base branch
from
feat/pdf2markdown
base: staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Feat/pdf2markdown #406
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…aders, and paragraphs
Co-authored-by: jomiq <[email protected]>
…ned to catch them all
This greatly simplifies the following steps, and reduces the tmp storage used for each report
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Goals:
nlm-ingestor
withdocling
nlmParsePDF
andnlmExtractTables
to run in a separate container, separating dependencies and allowing easier scaling, both vertically and horizontally.Several key tasks have been completed in the past 36 hours. We're getting closer to a working
pdf2markdown
service.Next steps:
$ref
s from the Docling JSON format. Otherwise, in Node.js, we could wrap the parsedDoclingDocument
with aProxy
to replace specific properties, or just add a parsing function which replaces all$refs
manually. Need to be careful with circular$refs
though.parse_pdf.py
about parsing the table/page markdown and add it to each page for easy usage later.$ref
parsing.json.texts.prov[0].page_no
. And extracting all markdown for a specific page should be possible in Python too. We could for example get a document only representing the specific page, and then render that as markdown. Then we add that to a property likejson.pages.markdown
, which can easily be used later in the pipeline. We only need to do this for the pages where we need the page markdown as context for the table extraction.json.pages
withpage_no
,markdown
andimage
. We can always generate more JSON context if we need it in the future. Or for example if we want to replace the low quality tables in the original extracted markdown, with the (hopefully) better quality markdown extracted via the Vision API.Preparing
pdf2markdown
for deploymenttesserocr
has been correctly installed. Perhaps it should be added topdf2markdown/pyproject.toml
?garbo
and inpdf2markdown
.pdf2pic
, related code in bothgarbo
andpdf2markdown
, and native its dependencies.Docling
handles screenshots already so no need to have another stack for processing PDFs into images.pdf2markdown
and test that it works as expected by running it locally. Verify all dependencies and especially native libs.nlmParsePDF
to make use of the newpdf2markdown
service, and start a flow forindexMarkdown -> precheck
with the result returned bypdf2markdown
pdf2markdown
service. Ideally 5 replicas in the coming days, which can be decreased after the release.Medium prio
pdf2markdown
, and saving money.Nice to haves (after release):
See if we can replace the
input.pdf
file and communicate between processes usingstdio
instead of writing and reading tmp files.TODO: Figure out how to get both the
stdout
andstderr
from the child process when runningpython parse_pdf.py
Instead of writing a tmp file, we might be able to stream the file using stdin directly to the python code. Ser korrekta example: https://github.com/microsoft/markitdown/blob/main/src/markitdown/__main__.py