The goal of this project is to create a prototype which is focusing on structural section parsing of PDF textbooks 📖.
PDF metadata from the Table of Contents (TOC) are used to match headings in the fulltext with various strategies, such as regex or making use of the structural tags in the XML file (that we converted the PDF into). Then, we assign a hierarchy level (1-7) to the headings. The output can be used for knowledge engineering (e.g., finding entities sharing a section and analyzing their relationship).
Here, you will find:
- 📚 The code for reproducing our experiments
- 📚 The ground truth dataset consisting of textbooks full of headings
- 📚 An issue tracker for the dataset
You can find the dataset here:
- The Original PDF files (see attribution ) and license information
- The Ground Truth Dataset, with the following format:
- Level, Heading, Page
- Make sure you have poppler-utils installed, including pdftohtml
- pdftoxml.py
- toc_processing_segmentation.ipynb
- Get the repo from here: https://github.com/ChrizH/pdfstructure
- Replicate the folder structure from this repository, and insert the current code of pdfstructure in the folder: ./pdfstructure-master/pdfstructure/
- Fix the bug that may still be in pdfminer (otherwise almost no PDF will process). Follow the instruction in ./pdfstructure-master/bugfixing_modification_in_pdfminer.txt.
- extract_structure.ipynb
- evaluate_hierarchies.ipynb (it is important to run this evaluation before evaluate_toc.ipynb)
- evaluate_toc.ipynb
- evaluate_segments.ipynb
- Please report any inconsistencies or doubtful ground truth entries as a regular repository issue.
This code was tested with Python 3.11 on Xubuntu 22.04.4 LTS x86_64
The content of this project itself is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license, and the underlying source code is licensed under the MIT license.