CLOCR-C: Context Leveraging OCR Correction using Language Models

CLOCR-C (Clock-Er-Sea) is a post-OCR correction approach that utilizes the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. This repository contains the academic work used to demonstrate the concept.

Key Notebooks

This work is arranged by notebooks. The key notebooks are numbered as follows:

01_prompt_testing.ipynb: Testing the range of sub-prompt combinations on the three datasets across all models using the dev set.
02_llm_comparison.ipynb: Using the best performing prompt run tests across the test set.
03_downstream_ner.ipynb: Measure how the corrected texts have improved NER.
04_sociocultural_context.ipynb: Explore the importance of the socio-cultural context of the prompt and the Task Induced In Context Learning.

Dataset

This study uses three datasets: the 19th Century Serials Edition (NCSE) and two datasets from the Overproof collection. The NCSE dataset, consisting of 91 transcribed articles with a total of 40 thousand words, is available in a separate data repository at [link to repo].

Requirements

To run the notebooks, you will need the following:

Python 3.11
Jupyter Notebook
Required Python packages are found in the requirements.txt file

Models

The experiments in this study were conducted using seven language models:

GPT-4
GPT-3.5
Llama 3
Gemma
Mixtral 8x7b
Claude 3 (Opus)
Claude 3 (Haiku)

API key's for the models are required to reproduce this work

License

This project is licensed under the MIT. See the LICENSE file for more information.

Abstract

The digitisation of historical print media archives is crucial for increasing accessibility to contemporary records. However, the process of Optical Character Recognition (OCR) used to convert physical records to digital text is prone to errors, particularly in the case of newspapers and periodicals due to their complex layouts. This paper introduces Context Leveraging OCR Correction (CLOCR-C), which utilises the infilling and context-adaptive abilities of transformer-based language models (LMs) to improve OCR quality. The study aims to determine if LMs can perform post-OCR correction, improve downstream NLP tasks, and the value of providing the socio-cultural context as part of the correction process. Experiments were conducted using seven LMs, including GPT-4, GPT-3.5, Llama 3, Gemma, Mixtral 8x7b, and Claude 3 (Opus and Haiku), on three datasets: the 19th Century Serials Edition and two datasets from the Overproof collection. The results demonstrate that some LMs can significantly reduce error rates, with the top-performing model achieving over a 60% reduction in character error rate on the NCSE dataset. Furthermore, the study shows that providing socio-cultural context in the prompts improves performance, while misleading prompts lower performance. The OCR improvements extend to downstream tasks, such as Named Entity Recognition, with increased Cosine Named Entity Similarity and F1 scores. The findings suggest that CLOCR-C is a promising approach for enhancing the quality of existing digital archives by leveraging the socio-cultural information embedded in the LMs and the text requiring correction.

Citing

Citation information not yet available

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.Rhistory		.Rhistory
.gitignore		.gitignore
01_prompt_testing.ipynb		01_prompt_testing.ipynb
02_llm_comparison.ipynb		02_llm_comparison.ipynb
03_downstream_ner.ipynb		03_downstream_ner.ipynb
04_sociocultural_context.ipynb		04_sociocultural_context.ipynb
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
create_test_set.ipynb		create_test_set.ipynb
evaluation_funcs.py		evaluation_funcs.py
get-pip.py		get-pip.py
helper_functions.py		helper_functions.py
llm_comparison_toolkit.py		llm_comparison_toolkit.py
messy_stuff.ipynb		messy_stuff.ipynb
overproof_helpers.py		overproof_helpers.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLOCR-C: Context Leveraging OCR Correction using Language Models

Key Notebooks

Dataset

Requirements

Models

License

Abstract

Citing

About

Releases

Packages

Languages

License

JonnoB/clocrc

Folders and files

Latest commit

History

Repository files navigation

CLOCR-C: Context Leveraging OCR Correction using Language Models

Key Notebooks

Dataset

Requirements

Models

License

Abstract

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages