This project demonstrates how language models can potentially leak sensitive training data. It provides a modular implementation of an LSTM-based text generation model, complete with preprocessing, training, and generation capabilities.
leaky_model/
├── data/
│ ├── raw/ # Raw PDF files
│ ├── processed/ # Preprocessed markdown files
│ └── tmp/ # Temporary files during processing
├── model/ # Trained models and processors
│ ├── text_generation_model.keras
│ └── text_processor.pkl
├── src/
│ ├── preprocessing/ # PDF and image processing
│ │ ├── pdf_processor.py
│ │ ├── image_enhancer.py
│ │ └── text_cleaner.py
│ ├── training/ # Model training components
│ │ ├── model_builder.py
│ │ └── text_processor.py
│ ├── utils/ # Utility functions
│ │ ├── graceful_killer.py
│ │ ├── progress_tracker.py
│ │ └── text_file_reader.py
│ ├── config.py # Configuration settings
│ └── prompts.txt # Example prompts for generation
├── preprocess.py # PDF preprocessing script
├── train.py # Model training script
└── generate.py # Text generation script
This project uses poetry for dependency management. To get started:
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies
poetry install
# Activate the virtual environment
poetry shell
Alternatively, you can use pip with the provided requirements.txt:
pip install -r requirements.txt
- Python 3.12
- Tesseract OCR
- OpenCV
The project is divided into three main steps:
Convert PDF files to preprocessed markdown format:
python preprocess.py --input-dir data/raw --output-dir data/processed
# Additional options:
# --temp-dir data/tmp # Directory for temporary files
Features:
- Multi-threaded PDF processing
- OCR for scanned documents
- Image enhancement for better text extraction
- Progress tracking and resumable processing
- Graceful shutdown handling
Train the LSTM model on preprocessed data:
python train.py --data-dir data/processed --model-dir model
# Additional options:
# --sequence-length 50 # Length of input sequences
# --embedding-dim 100 # Dimension of word embeddings
# --batch-size 128 # Training batch size
The training process:
- Processes markdown files
- Fits tokenizer to vocabulary
- Creates training sequences
- Trains LSTM model with progress tracking
- Saves model and processor files
Generate text using the trained model:
python generate.py --prompts-file src/prompts.txt
# Additional options:
# --model-dir model # Directory containing model files
# --num-words 50 # Number of words to generate
# --temperature 1.0 # Sampling temperature (higher = more random)
# --top-k 0 # Top-k sampling parameter
# --top-p 0.0 # Nucleus sampling parameter
PDFProcessor
: Handles PDF reading and text extractionImageEnhancer
: Improves image quality for OCRTextCleaner
: Normalizes and cleans extracted text
ModelBuilder
: Creates and configures the LSTM modelTextProcessor
: Handles text tokenization and sequence creation
GracefulKiller
: Manages graceful shutdown of long-running processesProgressTracker
: Tracks and saves processing progressTextFileReader
: Efficient reading of text/markdown files
Global settings are managed in src/config.py
:
- Path configurations
- Model parameters
- Processing settings
- Default values
Here are some examples of model outputs that illustrate potential data leakage:
This model is designed to demonstrate how training data can be leaked through language models. It should be used responsibly and only with data you have permission to use.
Apache 2.0 License