From 027a43e0f50ebcf322939a49fe5c509df40170ae Mon Sep 17 00:00:00 2001 From: Christopher Date: Mon, 3 Jun 2024 22:49:50 +0000 Subject: [PATCH 1/4] Adds simplified way to download the already tokenized tinyshakespeare dataset and gpt2 weights --- README.md | 12 +-- ...gpt2weights_and_tinyshakespeare_dataset.sh | 75 +++++++++++++++++++ 2 files changed, 79 insertions(+), 8 deletions(-) create mode 100755 scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh diff --git a/README.md b/README.md index b6c6128dd..a72da78ea 100644 --- a/README.md +++ b/README.md @@ -11,28 +11,24 @@ The best introduction to the llm.c repo today is reproducing the GPT-2 (124M) mo If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this: ```bash -pip install -r requirements.txt -python dev/data/tinyshakespeare.py -python train_gpt2.py +./download_gpt2weights_and_tinyshakespeare_dataset.sh make train_gpt2fp32cu ./train_gpt2fp32cu ``` -The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C/CUDA and train for one epoch on tineshakespeare with AdamW (using batch size 4, context length 1024, total of 74 steps), evaluate validation loss, and sample some text. +The above lines (1) downloads a tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset (2) downloads GPT-2 (124M) weights, (3) inits from them in C/CUDA and train for one epoch on tineshakespeare with AdamW (using batch size 4, context length 1024, total of 74 steps), evaluate validation loss, and sample some text. ## quick start (CPU) The "I am so GPU poor that I don't even have one GPU" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example: ```bash -pip install -r requirements.txt -python dev/data/tinyshakespeare.py -python train_gpt2.py +./download_gpt2weights_and_tinyshakespeare_dataset.sh make train_gpt2 OMP_NUM_THREADS=8 ./train_gpt2 ``` -The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max): +The above lines (1) download an already tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max): ``` [GPT-2] diff --git a/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh b/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh new file mode 100755 index 000000000..fa7f992c1 --- /dev/null +++ b/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh @@ -0,0 +1,75 @@ +#!/bin/bash + +# Get the directory of the script +SCRIPT_DIR=$(dirname "$(realpath "$0")") + +# Base URL +BASE_URL="https://huggingface.co/datasets/chrisdryden/llmcDatasets/resolve/main/" + +# Directory paths based on script location +SAVE_DIR_PARENT="$SCRIPT_DIR/.." +SAVE_DIR_TINY="$SCRIPT_DIR/../dev/data/tinyshakespeare" + +# Create the directories if they don't exist +mkdir -p "$SAVE_DIR_TINY" + +# Files to download +FILES=( + "gpt2_124M.bin" + "gpt2_124M_bf16.bin" + "gpt2_124M_debug_state.bin" + "gpt2_tokenizer.bin" + "tiny_shakespeare_train.bin" + "tiny_shakespeare_val.bin" +) + +# Function to download files to the appropriate directory +download_file() { + local FILE_NAME=$1 + local FILE_URL="${BASE_URL}${FILE_NAME}?download=true" + local FILE_PATH + + # Determine the save directory based on the file name + if [[ "$FILE_NAME" == tiny_shakespeare* ]]; then + FILE_PATH="${SAVE_DIR_TINY}/${FILE_NAME}" + else + FILE_PATH="${SAVE_DIR_PARENT}/${FILE_NAME}" + fi + + # Download the file + curl -s -L -o "$FILE_PATH" "$FILE_URL" + echo "Downloaded $FILE_NAME to $FILE_PATH" +} + +# Export the function so it's available in subshells +export -f download_file + +# Generate download commands +download_commands=() +for FILE in "${FILES[@]}"; do + download_commands+=("download_file \"$FILE\"") +done + +# Function to manage parallel jobs in increments of a given size +run_in_parallel() { + local batch_size=$1 + shift + local i=0 + local command + + for command; do + eval "$command" & + ((i = (i + 1) % batch_size)) + if [ "$i" -eq 0 ]; then + wait + fi + done + + # Wait for any remaining jobs to finish + wait +} + +# Run the download commands in parallel in batches of 2 +run_in_parallel 2 "${download_commands[@]}" + +echo "All files downloaded and saved in their respective directories" \ No newline at end of file From 3936492620dda5402da6ae4ffd52d3a152e8fe14 Mon Sep 17 00:00:00 2001 From: Christopher Date: Mon, 3 Jun 2024 22:51:57 +0000 Subject: [PATCH 2/4] Forgot the folder name in the readme --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a72da78ea..5daa4359d 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ The best introduction to the llm.c repo today is reproducing the GPT-2 (124M) mo If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this: ```bash -./download_gpt2weights_and_tinyshakespeare_dataset.sh +./scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh make train_gpt2fp32cu ./train_gpt2fp32cu ``` @@ -23,7 +23,7 @@ The above lines (1) downloads a tokenized [tinyshakespeare](https://raw.githubus The "I am so GPU poor that I don't even have one GPU" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example: ```bash -./download_gpt2weights_and_tinyshakespeare_dataset.sh +./scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh make train_gpt2 OMP_NUM_THREADS=8 ./train_gpt2 ``` From ec3921047c04b17db3b25689852e6214bf37584b Mon Sep 17 00:00:00 2001 From: Christopher Date: Mon, 3 Jun 2024 22:54:25 +0000 Subject: [PATCH 3/4] updated the script to run the same number of threads as files --- scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh b/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh index fa7f992c1..91850287c 100755 --- a/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh +++ b/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh @@ -70,6 +70,6 @@ run_in_parallel() { } # Run the download commands in parallel in batches of 2 -run_in_parallel 2 "${download_commands[@]}" +run_in_parallel 6 "${download_commands[@]}" echo "All files downloaded and saved in their respective directories" \ No newline at end of file From b29478e20b08efc32ec14fbd603a8980e9f9c421 Mon Sep 17 00:00:00 2001 From: Andrej Karpathy Date: Wed, 19 Jun 2024 01:44:36 +0000 Subject: [PATCH 4/4] add starter pack .sh script for faster quickstart --- README.md | 16 +++++++++++++--- .../download_starter_pack.sh | 2 +- 2 files changed, 14 insertions(+), 4 deletions(-) rename scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh => dev/download_starter_pack.sh (96%) diff --git a/README.md b/README.md index b1ad92f94..fa5dfcf7d 100644 --- a/README.md +++ b/README.md @@ -13,23 +13,33 @@ debugging tip: when you run the `make` command to build the binary, modify it by If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this: ```bash -./scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh +chmod u+x ./dev/download_starter_pack.sh +./dev/download_starter_pack.sh make train_gpt2fp32cu ./train_gpt2fp32cu ``` -The above lines (1) downloads a tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset (2) downloads GPT-2 (124M) weights, (3) inits from them in C/CUDA and train for one epoch on tineshakespeare with AdamW (using batch size 4, context length 1024, total of 74 steps), evaluate validation loss, and sample some text. +The download_starter_pack.sh script is a quick & easy way to get started and it downloads a bunch of .bin files that help get you off the ground. These contain: 1) the GPT-2 124M model saved in fp32, in bfloat16, 2) a "debug state" used in unit testing (a small batch of data, and target activations and gradients), 3) the GPT-2 tokenizer, and 3) the tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset. Alternatively, instead of running the .sh script, you can re-create these artifacts manually as follows: + +```bash +pip install -r requirements.txt +python dev/data/tinyshakespeare.py +python train_gpt2.py +``` ## quick start (CPU) The "I am so GPU poor that I don't even have one GPU" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example: ```bash -./scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh +chmod u+x ./dev/download_starter_pack.sh +./dev/download_starter_pack.sh make train_gpt2 OMP_NUM_THREADS=8 ./train_gpt2 ``` +If you'd prefer to avoid running the starter pack script, then as mentioned in the previous section you can reproduce the exact same .bin files and artifacts by running `python dev/data/tinyshakespeare.py` and then `python train_gpt2.py`. + The above lines (1) download an already tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max): ``` diff --git a/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh b/dev/download_starter_pack.sh similarity index 96% rename from scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh rename to dev/download_starter_pack.sh index 91850287c..4034a1c81 100755 --- a/scripts/download_gpt2weights_and_tinyshakespeare_dataset.sh +++ b/dev/download_starter_pack.sh @@ -8,7 +8,7 @@ BASE_URL="https://huggingface.co/datasets/chrisdryden/llmcDatasets/resolve/main/ # Directory paths based on script location SAVE_DIR_PARENT="$SCRIPT_DIR/.." -SAVE_DIR_TINY="$SCRIPT_DIR/../dev/data/tinyshakespeare" +SAVE_DIR_TINY="$SCRIPT_DIR/data/tinyshakespeare" # Create the directories if they don't exist mkdir -p "$SAVE_DIR_TINY"