Open Data Toolkit

The Open Data Toolkit is a library for open-source data processing tools to create language model training datasets. This library will integrate PleIAs's existing tools seamlessly into a variety of pipelines. This will also make our tools more readily available to a broader range of users.

Existing Tools

OCRonos: an OCR correction model, which removes artifacts from digitization, drastically improving text quality
OCRerrcr: OCR error detection model for estimating OCR quality and highlighting errors
Segmentext: segmentation tool, which re-structures text. It identifies and re-formats headings and reconstitute paragraphs, which were split apart during digitization.
Celadon: a multilingual toxicity classifier that can be used to filter out toxic and harmful content from pretraining datasets
Topical: topic classification tool, which can be used to curate specialized datasets

Under Development

Vision-language models for PDF segmentation for higher quality text extraction
Models for generating synthetic data
Additionally, updated OCR correction and topic classification models are being trained

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
funding.json		funding.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Data Toolkit

Existing Tools

Under Development

About

Releases

Packages

Pleias/open_data_toolkit

Folders and files

Latest commit

History

Repository files navigation

Open Data Toolkit

Existing Tools

Under Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages