HF Transformers has awesome finetuning examples https://github.com/huggingface/transformers/tree/main/examples/pytorch, that cover pretty much any modality and these examples work out of box.
But what if you wanted to re-train from scratch rather than finetune.
Here is a simple hack to accomplish that.
We will use facebook/opt-1.3b
and we will plan to use bf16 training regime as an example here:
cat << EOT > prep-bf16.py
from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch
mname = "facebook/opt-1.3b"
config = AutoConfig.from_pretrained(mname)
model = AutoModel.from_config(config, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(mname)
path = "opt-1.3b-bf16"
model.save_pretrained(path)
tokenizer.save_pretrained(path)
EOT
now run:
python prep-bf16.py
This will create a folder: opt-1.3b-bf16
with everything you need to train the model from scratch. In other words you have a pretrained-like model, except it only had its initializations done and none of the training yet.
Adjust to script above to use torch.float16
or torch.float32
if that's what you plan to use instead.
Now you can proceed with finetuning this saved model as normal:
python -m torch.distributed.run \
--nproc_per_node=1 --nnode=1 --node_rank=0 \
--master_addr=127.0.0.1 --master_port=9901 \
examples/pytorch/language-modeling/run_clm.py --bf16 \
--seed 42 --model_name_or_path opt-1.3b-bf16 \
--dataset_name wikitext --dataset_config_name wikitext-103-raw-v1 \
--per_device_train_batch_size 12 --per_device_eval_batch_size 12 \
--gradient_accumulation_steps 1 --do_train --do_eval --logging_steps 10 \
--save_steps 1000 --eval_steps 100 --weight_decay 0.1 --num_train_epochs 1 \
--adam_beta1 0.9 --adam_beta2 0.95 --learning_rate 0.0002 --lr_scheduler_type \
linear --warmup_steps 500 --report_to tensorboard --output_dir save_dir
The key entry being:
--model_name_or_path opt-1.3b-bf16
where opt-1.3b-bf16
is your local directory you have just generated in the previous step.
Sometimes it's possible to find the same dataset that the original model was trained on, sometimes you have to use an alternative dataset.
The rest of the hyper-parameters can often be found in the paper or documentation that came with the model.
To summarize, this recipe allows you to use finetuning examples to re-train whatever model you can find on the hub.