📕 HiPS: Hierarchical PDF Segmentation 📕

The goal of this project is to create a prototype which is focusing on structural section parsing of PDF textbooks 📖.

PDF metadata from the Table of Contents (TOC) are used to match headings in the fulltext with various strategies, such as regex or making use of the structural tags in the XML file (that we converted the PDF into). Then, we assign a hierarchy level (1-7) to the headings. The output can be used for knowledge engineering (e.g., finding entities sharing a section and analyzing their relationship).

Here, you will find:

📚 The code for reproducing our experiments
📚 The ground truth dataset consisting of textbooks full of headings
📚 An issue tracker for the dataset

Dataset

You can find the dataset here:

The Original PDF files (see attribution ) and license information
The Ground Truth Dataset, with the following format:
- Level, Heading, Page

Steps for the Reproduction of the Experiments

TOC-based PageParser

Make sure you have poppler-utils installed, including pdftohtml
pdftoxml.py
toc_processing_segmentation.ipynb

Pdfstructure

Get the repo from here: https://github.com/ChrizH/pdfstructure
Replicate the folder structure from this repository, and insert the current code of pdfstructure in the folder: ./pdfstructure-master/pdfstructure/
Fix the bug that may still be in pdfminer (otherwise almost no PDF will process). Follow the instruction in ./pdfstructure-master/bugfixing_modification_in_pdfminer.txt.
extract_structure.ipynb

Evaluation

evaluate_hierarchies.ipynb (it is important to run this evaluation before evaluate_toc.ipynb)
evaluate_toc.ipynb
evaluate_segments.ipynb

Issue Tracker

Please report any inconsistencies or doubtful ground truth entries as a regular repository issue.

Environment Information

This code was tested with Python 3.11 on Xubuntu 22.04.4 LTS x86_64

License

The content of this project itself is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license, and the underlying source code is licensed under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
pdfstructure-master		pdfstructure-master
toc-based-page-parser		toc-based-page-parser
LICENSE		LICENSE
LICENSE_DATA.md		LICENSE_DATA.md
README.md		README.md
evaluate_hierarchies.ipynb		evaluate_hierarchies.ipynb
evaluate_segments.ipynb		evaluate_segments.ipynb
evaluate_toc.ipynb		evaluate_toc.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📕 HiPS: Hierarchical PDF Segmentation 📕

Dataset

Steps for the Reproduction of the Experiments

TOC-based PageParser

Pdfstructure

Evaluation

Issue Tracker

Environment Information

License

About

Releases

Packages

Languages

License

anybass/HiPS

Folders and files

Latest commit

History

Repository files navigation

📕 HiPS: Hierarchical PDF Segmentation 📕

Dataset

Steps for the Reproduction of the Experiments

TOC-based PageParser

Pdfstructure

Evaluation

Issue Tracker

Environment Information

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages