Skip to content

A library for open-source data processing tools to create language model training datasets

Notifications You must be signed in to change notification settings

Pleias/open_data_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

Open Data Toolkit

The Open Data Toolkit is a library for open-source data processing tools to create language model training datasets. This library will integrate PleIAs's existing tools seamlessly into a variety of pipelines. This will also make our tools more readily available to a broader range of users.

Existing Tools

  • OCRonos: an OCR correction model, which removes artifacts from digitization, drastically improving text quality
  • OCRerrcr: OCR error detection model for estimating OCR quality and highlighting errors
  • Segmentext: segmentation tool, which re-structures text. It identifies and re-formats headings and reconstitute paragraphs, which were split apart during digitization.
  • Celadon: a multilingual toxicity classifier that can be used to filter out toxic and harmful content from pretraining datasets
  • Topical: topic classification tool, which can be used to curate specialized datasets

Under Development

  • Vision-language models for PDF segmentation for higher quality text extraction
  • Models for generating synthetic data
  • Additionally, updated OCR correction and topic classification models are being trained

About

A library for open-source data processing tools to create language model training datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published