Merge branch 'ChrisDryden-script_to_download_tokenized_dataset'

ahrefs · Jun 19, 2024 · 3bce68b · 3bce68b
2 parents 483f675 + b29478e
commit 3bce68b
Show file tree

Hide file tree

Showing 2 changed files with 89 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -13,28 +13,34 @@ debugging tip: when you run the `make` command to build the binary, modify it by
 If you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were "checkpointed" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this:
 
 ```bash
-pip install -r requirements.txt
-python dev/data/tinyshakespeare.py
-python train_gpt2.py
+chmod u+x ./dev/download_starter_pack.sh
+./dev/download_starter_pack.sh
 make train_gpt2fp32cu
 ./train_gpt2fp32cu
 ```
 
-The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C/CUDA and train for one epoch on tineshakespeare with AdamW (using batch size 4, context length 1024, total of 74 steps), evaluate validation loss, and sample some text.
+The download_starter_pack.sh script is a quick & easy way to get started and it downloads a bunch of .bin files that help get you off the ground. These contain: 1) the GPT-2 124M model saved in fp32, in bfloat16, 2) a "debug state" used in unit testing (a small batch of data, and target activations and gradients), 3) the GPT-2 tokenizer, and 3) the tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset. Alternatively, instead of running the .sh script, you can re-create these artifacts manually as follows:
+
+```bash
+pip install -r requirements.txt
+python dev/data/tinyshakespeare.py
+python train_gpt2.py
+```
 
 ## quick start (CPU)
 
 The "I am so GPU poor that I don't even have one GPU" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example:
 
 ```bash
-pip install -r requirements.txt
-python dev/data/tinyshakespeare.py
-python train_gpt2.py
+chmod u+x ./dev/download_starter_pack.sh
+./dev/download_starter_pack.sh
 make train_gpt2
 OMP_NUM_THREADS=8 ./train_gpt2
 ```
 
-The above lines (1) download the [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset, tokenize it with the GPT-2 Tokenizer, (2) download and save the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):
+If you'd prefer to avoid running the starter pack script, then as mentioned in the previous section you can reproduce the exact same .bin files and artifacts by running `python dev/data/tinyshakespeare.py` and then `python train_gpt2.py`.
+
+The above lines (1) download an already tokenized [tinyshakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo/reference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):
 
 ```
 [GPT-2]

diff --git a/dev/download_starter_pack.sh b/dev/download_starter_pack.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+
+# Get the directory of the script
+SCRIPT_DIR=$(dirname "$(realpath "$0")")
+
+# Base URL
+BASE_URL="https://huggingface.co/datasets/chrisdryden/llmcDatasets/resolve/main/"
+
+# Directory paths based on script location
+SAVE_DIR_PARENT="$SCRIPT_DIR/.."
+SAVE_DIR_TINY="$SCRIPT_DIR/data/tinyshakespeare"
+
+# Create the directories if they don't exist
+mkdir -p "$SAVE_DIR_TINY"
+
+# Files to download
+FILES=(
+    "gpt2_124M.bin"
+    "gpt2_124M_bf16.bin"
+    "gpt2_124M_debug_state.bin"
+    "gpt2_tokenizer.bin"
+    "tiny_shakespeare_train.bin"
+    "tiny_shakespeare_val.bin"
+)
+
+# Function to download files to the appropriate directory
+download_file() {
+    local FILE_NAME=$1
+    local FILE_URL="${BASE_URL}${FILE_NAME}?download=true"
+    local FILE_PATH
+
+    # Determine the save directory based on the file name
+    if [[ "$FILE_NAME" == tiny_shakespeare* ]]; then
+        FILE_PATH="${SAVE_DIR_TINY}/${FILE_NAME}"
+    else
+        FILE_PATH="${SAVE_DIR_PARENT}/${FILE_NAME}"
+    fi
+
+    # Download the file
+    curl -s -L -o "$FILE_PATH" "$FILE_URL"
+    echo "Downloaded $FILE_NAME to $FILE_PATH"
+}
+
+# Export the function so it's available in subshells
+export -f download_file
+
+# Generate download commands
+download_commands=()
+for FILE in "${FILES[@]}"; do
+    download_commands+=("download_file \"$FILE\"")
+done
+
+# Function to manage parallel jobs in increments of a given size
+run_in_parallel() {
+    local batch_size=$1
+    shift
+    local i=0
+    local command
+
+    for command; do
+        eval "$command" &
+        ((i = (i + 1) % batch_size))
+        if [ "$i" -eq 0 ]; then
+            wait
+        fi
+    done
+
+    # Wait for any remaining jobs to finish
+    wait
+}
+
+# Run the download commands in parallel in batches of 2
+run_in_parallel 6 "${download_commands[@]}"
+
+echo "All files downloaded and saved in their respective directories"