Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pdf2markdown #406

Draft
wants to merge 67 commits into
base: staging
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
6850654
feat: Create PDF to Markdown microservice with NLM ingestor and OpenA…
irony Nov 28, 2024
cfd22b3
feat: Return markdown text directly instead of JSON response
irony Nov 28, 2024
bbaa8a2
refactor: Optimize table extraction by directly using Vision API in j…
irony Nov 28, 2024
4338766
refactor: Remove duplicate code, add error handling, and improve temp…
irony Nov 28, 2024
757a199
refactor: Remove unused imports and vision API extraction function
irony Nov 28, 2024
18168b3
refactor: Optimize table extraction by using buffer instead of tempor…
irony Nov 28, 2024
7e52a8d
fix: bugfixes
irony Nov 28, 2024
5b7f950
refactor: Add NLM ingestor health check and improve error logging
irony Nov 28, 2024
2a13539
refactor: Remove unnecessary lastTableMarkdown assignment
irony Nov 28, 2024
68937db
fix: Update NLM ingestor schema to match new response structure
irony Nov 28, 2024
02d4273
fix: Handle undefined block content in PDF conversion
irony Nov 29, 2024
bc75430
refactor: Improve JSON to Markdown conversion with better block handl…
irony Nov 29, 2024
44c6fc4
refactor: Update NLM ingestor health check and remove schema validation
irony Nov 29, 2024
43d5fcc
refactor: Add schema validation and error handling for PDF conversion
irony Nov 29, 2024
8f7b761
refactor: Add detailed schema validation debugging for NLM Ingestor r…
irony Nov 29, 2024
f5d1b73
fix: Make NLM ingestor schema more flexible for optional content
irony Nov 29, 2024
2f0df9d
fix: Import ParsedDocumentSchema in jsonExtraction.ts
irony Nov 29, 2024
5d56097
fix: Add detailed logging to diagnose PDF content extraction issue
irony Nov 29, 2024
5d1b49c
refactor: Simplify imports and improve block type detection in PDF ex…
irony Nov 29, 2024
8d91c6f
feat: Improve PDF content extraction with enhanced validation and err…
irony Nov 29, 2024
7316d25
refactor: Enhance block parsing with modular functions for tables, he…
irony Nov 29, 2024
44191c4
fix: remove temp files - we are using buffer
irony Dec 4, 2024
f63089a
Merge branch 'staging' into feat/pdf2markdown
Greenheart Dec 7, 2024
b53c0e2
Add Docling docker config
Greenheart Dec 7, 2024
fbcc701
Parse PDF to JSON with Docling
Greenheart Dec 7, 2024
9000b54
Try page and table screenshots
Greenheart Dec 7, 2024
2a4b1b8
Use DocumentStream as input
Greenheart Dec 7, 2024
d36b0d4
Cleanup
Greenheart Dec 7, 2024
1ff9883
Parse document from Node
Greenheart Dec 7, 2024
23c0ad8
Pass docId from Node to Python and ensure it exists
Greenheart Dec 7, 2024
ca49503
Remove duplicate time logging (docling already does this)
Greenheart Dec 7, 2024
ae45036
Improve structure
Greenheart Dec 7, 2024
e8ecdb5
Output both JSON and Markdown to the temporary docId directory
Greenheart Dec 7, 2024
c37c626
Add env for openai and keep .env.example
Greenheart Dec 7, 2024
efffe10
Remove old script
Greenheart Dec 7, 2024
dfe7231
Safe parse
Greenheart Dec 7, 2024
4ca156e
Improve args for parsing script
Greenheart Dec 7, 2024
55ee960
WIP: POST /convert to upload a PDF and convert to markdown
Greenheart Dec 8, 2024
97854f0
Improve logging
Greenheart Dec 8, 2024
a99c9c4
Use PythonShell to get better error logging
Greenheart Dec 8, 2024
1907cc5
Clarify usage
Greenheart Dec 8, 2024
feff2e7
Make regular logs show up
Greenheart Dec 8, 2024
56e010b
Log completion time for the request handler
Greenheart Dec 8, 2024
3e6300f
Clarify types
Greenheart Dec 8, 2024
7ff5e6a
Parse DoclingDocument and improve how unique page numbers are determi…
Greenheart Dec 8, 2024
3f09a44
Save page images with Docling
Greenheart Dec 8, 2024
da2083d
Extract images for pages with tables using Docling
Greenheart Dec 8, 2024
0002730
Improve type
Greenheart Dec 8, 2024
85a6046
Rename
Greenheart Dec 8, 2024
fe43fd2
Experiment with preserving both stdout and stderr from the child process
Greenheart Dec 8, 2024
5a3ea87
Start refactroing JSON processing to use Docling format
Greenheart Dec 8, 2024
f20e23e
Group all tables based on pages until we can filter by relevant searc…
Greenheart Dec 8, 2024
01f2767
Automatically load a specific python version in the project
Greenheart Dec 8, 2024
816c21b
Parse images exported by Docling
Greenheart Dec 8, 2024
eb648e2
Verify that all images match the base64 encoded representations
Greenheart Dec 8, 2024
a7c126b
Increase image quality
Greenheart Dec 8, 2024
cb3d076
Only keep images for wanted pages
Greenheart Dec 8, 2024
1f9e972
Save Base64-encoded images only for pages with tables.
Greenheart Dec 8, 2024
fb48ff6
Format Python code using the Black opinionated formatter
Greenheart Dec 8, 2024
106fdcb
WIP: Refactor table extraction
Greenheart Dec 8, 2024
0fc46c7
Simplify
Greenheart Dec 8, 2024
4409aff
WIP: Experiment with dereferencing in Node.js. Not working
Greenheart Dec 8, 2024
d8951a8
WIP: Experiment with dereferencing json in Python. Not working
Greenheart Dec 8, 2024
2d10ff3
Add ideas for providing more relevant context to the Vision API
Greenheart Dec 8, 2024
bca466d
WIP: Try different args
Greenheart Dec 8, 2024
2ad225f
Add notes
Greenheart Dec 9, 2024
d4b0e3f
Make parsing work again and add idea for image extraction
Greenheart Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
node_modules
dist
.env
.aider*
.env*
!.env.example
.aider*
.DS_Store
!env.example
prisma/generated
*.xlsx

Expand Down
6 changes: 6 additions & 0 deletions pdf2markdown/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# This file contains the environment variables that are used in the project.
# Copy this file to .env and fill in the values for the environment variables.

NODE_ENV=development

OPENAI_API_KEY=
10 changes: 10 additions & 0 deletions pdf2markdown/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
node_modules
dist
.env
.env*
!.env.example
.aider*
.DS_Store

*.pdf
*.code-workspace
1 change: 1 addition & 0 deletions pdf2markdown/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
47 changes: 47 additions & 0 deletions pdf2markdown/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
FROM python:3.12-slim-bookworm

ARG CPU_ONLY=false

WORKDIR /app

RUN apt-get update \
&& apt-get install -y libgl1 libglib2.0-0 curl wget git \
&& apt-get clean

# Install Poetry and configure it
RUN pip install poetry \
&& poetry config virtualenvs.create false

COPY pyproject.toml poetry.lock ./

# Install dependencies before torch
RUN poetry install --no-interaction --no-root

# Install PyTorch separately based on CPU_ONLY flag
# TODO: Use correct GPU build - see https://pytorch.org/ for details
RUN if [ "$CPU_ONLY" = "true" ]; then \
pip install --no-cache-dir torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu; \
else \
pip install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121; \
fi

ENV HF_HOME=/tmp/
ENV TORCH_HOME=/tmp/
ENV OMP_NUM_THREADS=4

RUN python -c 'from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline; artifacts_path = StandardPdfPipeline.download_models_hf(force=True);'

# Install Node.js 22
RUN curl -fsSL https://deb.nodesource.com/setup_22.x | bash - \
&& apt-get install -y nodejs

# Install app dependencies
COPY package*.json ./
RUN npm ci --omit=dev

# Copy app source
COPY . .

EXPOSE 3000

CMD npm start
Loading