The Open Data Toolkit is a library for open-source data processing tools to create language model training datasets. This library will integrate PleIAs's existing tools seamlessly into a variety of pipelines. This will also make our tools more readily available to a broader range of users.
- OCRonos: an OCR correction model, which removes artifacts from digitization, drastically improving text quality
- OCRerrcr: OCR error detection model for estimating OCR quality and highlighting errors
- Segmentext: segmentation tool, which re-structures text. It identifies and re-formats headings and reconstitute paragraphs, which were split apart during digitization.
- Celadon: a multilingual toxicity classifier that can be used to filter out toxic and harmful content from pretraining datasets
- Topical: topic classification tool, which can be used to curate specialized datasets
- Vision-language models for PDF segmentation for higher quality text extraction
- Models for generating synthetic data
- Additionally, updated OCR correction and topic classification models are being trained