SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models
This is the official implementation of NeurIPS 2024 paper "SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models".
Authors: Yu Yang, Siddhartha Mishra, Jeffrey N Chiang, Baharan Mirzasoleiman
SmallToLarge (S2L) is a scalable and efficient data selection algorithm designed to optimize the supervised fine-tuning (SFT) of large language models (LLMs) for specialized domains. The method bridges the gap between data efficiency and performance, enabling fine-tuning on significantly reduced datasets while achieving comparable or superior results to training on full datasets.
S2L leverages the training trajectories of smaller proxy models to identify and cluster data points with similar learning dynamics. By selecting representative samples from these clusters, the method ensures comprehensive coverage of the domain's knowledge while dramatically reducing computational and data requirements. This approach is grounded in rigorous theoretical analysis, demonstrating that examples with similar trajectories impact model gradients in comparable ways, thus supporting the efficacy of S2L.
-
Clone the repository.
-
Follow the installation guide to create a new conda environment and install vllm. Make sure to check the CUDA version and install the corresponding version of vllm.
-
Install the following required packages.
pip install accelerate wandb
conda install -c pytorch -c nvidia faiss-gpu=1.9.0
-
If you have your own training script, please go ahead and use it to train the small proxy model and save the desired number of model checkpoints during training.
-
If you want to reproduce our experiments, you can run the following command, using the example configuration file
configs/pythia-70m-deduped_checkpoints.yml
to train a Pythia-70M model on the MathInstruct dataset.CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES nohup torchrun --nproc_per_node=$NPROC_PER_NODE --master_port=$MASTER_PORT train.py --config_file configs/pythia-70m-deduped_checkpoints.yml --wandb_key $WANDB_KEY
The training script will save the model checkpoints to the
res/full_mathinstruct_pythia-70m-deduped_3epochs_512_checkpoints
directory.
If you used the training script provided in this repo, you can collect the training trajectories of the small model. We provide two methods:
# Process all checkpoints found in the model directory:
python run_distributed_trajectories.py --model_path res/full_mathinstruct_pythia-70m-deduped_3epochs_512_checkpoints --config_file configs/pythia-70m-deduped_checkpoints.yml --checkpoints all
# Or specify specific checkpoints:
# Comma-separated list:
python run_distributed_trajectories.py --model_path /path/to/model --config_file config.yaml --checkpoints 1000,2000,3000,4000
# Using a range (start:end:step):
python run_distributed_trajectories.py --model_path /path/to/model --config_file config.yaml --checkpoints 1000:5000:1000
The distributed script will automatically detect available GPUs and distribute the checkpoint processing across them for faster computation.
This script requires the configuration file used for training the small model.
If you want to compute the loss for a specific checkpoint, you can run the following command with the checkpoint path by --ckpt
.
python get_trajectories.py --config_file configs/pythia-70m-deduped_checkpoints.yml --ckpt 1000
CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES nohup torchrun --nproc_per_node=$NPROC_PER_NODE --master_port=$MASTER_PORT train.py --config_file configs/s2l/full-70m_100_phi-3-mini-4k-instruct_130k_3epochs.yml --wandb_key $WANDB_KEY
If you find this work useful for your research, please consider citing our paper:
@inproceedings{
yang2024smalltolarge,
title={SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models},
author={Yu Yang and Siddhartha Mishra and Jeffrey N Chiang and Baharan Mirzasoleiman},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=K9IGlMQpif}
}
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or suggestions, please contact Yu Yang. 🤗