Replies: 2 comments
-
#253 在这里问可以获得更多人回答 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
/lib/python3.10/site-packages/transformers/trainer.py", line 2383, in _save_checkpoint
os.rename(staging_output_dir, output_dir)
FileExistsError: [Errno 17] File exists:xxxx 多卡微调就会报这个错,但是单卡微调就不会报错,就可以保存checkpoint
以下是我的训练脚本
set -ex
PRE_SEQ_LEN=128
LR=2e-2
NUM_GPUS=4
MAX_SEQ_LEN=2048
DEV_BATCH_SIZE=1
GRAD_ACCUMULARION_STEPS=16
MAX_STEP=1000
SAVE_INTERVAL=500
DATESTR=
date +%Y%m%d-%H%M%S
RUN_NAME=test1
BASE_MODEL_PATH=/data/resources/chatglm3_6B
DATASET_PATH=medical_prompt.json
OUTPUT_DIR=output/${RUN_NAME}-${DATESTR}-${PRE_SEQ_LEN}-${LR}
mkdir -p $OUTPUT_DIR
torchrun --standalone --nnodes=1 --nproc_per_node=$NUM_GPUS finetune.py$PRE_SEQ_LEN 2>&1 | tee $ {OUTPUT_DIR}/train.log
--train_format multi-turn
--train_file $DATASET_PATH
--max_seq_length $MAX_SEQ_LEN
--preprocessing_num_workers 1
--model_name_or_path $BASE_MODEL_PATH
--output_dir $OUTPUT_DIR
--per_device_train_batch_size $DEV_BATCH_SIZE
--gradient_accumulation_steps $GRAD_ACCUMULARION_STEPS
--max_steps $MAX_STEP
--logging_steps 1
--save_steps $SAVE_INTERVAL
--learning_rate $LR
--pre_seq_len
Beta Was this translation helpful? Give feedback.
All reactions