-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hunyuan lora微调nccl超时 #162
Comments
Can you have a check if you can inference correctly? If can, I think the nccl part is OK. |
bash scripts/inference/inference_hunyuan_hf_quantization.sh可以正常推理 哪里有问题吗,感谢您的回答 |
可能是fsdp全分片导致通信量变大,我改变fsdp显存会OOM,请问一下你们在训练hunyuanlora过程训练用多长时间 |
We use A100 80GB for lora finetuning. For my experience, it is fast to get a first wandb log. @BrianChen1129 Have you tested the code on A6000 or GPUs with 48GB? |
Haven't tried on A6000. For 720P input we finetuned roughly 24h. |
我是一名初学者,我发现一个问题,我很迷惑。 当我处理数据时,即 bash scripts/preprocess/preprocess_hunyuan_data.sh 我使用 仅仅可以生成vae_latents,不可以生成text_embeddings 我发现是FastVideo/fastvideo/models/hunyuan/text_encoder/init.py", line 130行 于是我使用FastHunyuan模型来生成text_encoder 我不知道这两个模型对于同一个文本生成的text_encoder是否一样,是否对我的训练数据形状有影响。 感谢您的解答 |
Environment
torch 2.5.0+cu121
fastvideo 1.2.0 /storage/ytm/FastVideo
python 3.10.0
Describe the bug
(fastvideo) root@ubuntu-server:/storage/ytm/FastVideo# torchrun --nnodes 1 --nproc_per_node 2 --master_port 29903 fastvideo/train.py --seed 1024 --pretrained_model_name_or_path /storage/ytm/FastVideo/data/FastHunyuan-diffusers --model_type hunyuan_hf --cache_dir data/.cache --data_json_path /storage/ytm/FastVideo/data/Image-Vid-Finetune-HunYuan/videos2caption.json --validation_prompt_dir data/Black-Myth-Wukong/validation --gradient_checkpointing --train_batch_size 1 --num_latent_t 32 --sp_size 2 --train_sp_batch_size 1 --dataloader_num_workers 4 --gradient_accumulation_steps 4 --max_train_steps 1000 --learning_rate 8e-5 --mixed_precision bf16 --checkpointing_steps 500 --validation_steps 100 --validation_sampling_steps 50 --checkpoints_total_limit 3 --allow_tf32 --ema_start_step 0 --cfg 0.0 --ema_decay 0.999 --log_validation --output_dir data/outputs/Hunyuan-lora-finetuning-Black-Myth-Wukong --tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong --num_frames 125 --validation_guidance_scale "1.0" --shift 7 --use_lora --lora_rank 32 --lora_alpha 32
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793]
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] *****************************************
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] *****************************************
[W121 09:42:10.077291564 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
--> loading model from /storage/ytm/FastVideo/data/FastHunyuan-diffusers
[W121 09:42:10.098619396 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
Total training parameters = 40.894464 M
--> Initializing FSDP with sharding strategy: full
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
lr: 8e-05
maximize: False
weight_decay: 0.01
)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.18.5
wandb: W&B syncing is set to
offline
in this directory.wandb: Run
wandb online
or set WANDB_MODE=online to enable cloud syncing.***** Running training *****
Num examples = 84
Dataloader size = 42
Num Epochs = 48
Resume training from step 0
Instantaneous batch size per device = 1
Total train batch size (w. data & sequence parallel, accumulation) = 4.0
Gradient Accumulation steps = 4
Total optimization steps = 1000
Total training parameters per FSDP shard = 0.020447232 B
Master weight dtype: torch.float32
Steps: 0%| | 0/1000 [00:00<?, ?it/s]/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]:[E121 09:55:38.949066081 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E121 09:55:38.949544360 ProcessGroupNCCL.cpp:679] [Rank 0] Work WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank1]:[E121 09:55:38.957012505 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank1]:[E121 09:55:38.957737010 ProcessGroupNCCL.cpp:679] [Rank 1] Work WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank1]:[E121 09:55:38.624584541 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E121 09:55:38.624604711 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E121 09:55:38.624732603 ProcessGroupNCCL.cpp:542] [Rank 1] Collective WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4e2b948446 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptrc10d::NCCLComm&) + 0x220 (0x7f4e2cc8f290 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f4e2cc8f4dc in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f4e2cc96ea3 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4e2cc9892d in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x145c0 (0x7f4e75c745c0 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #6: + 0x8609 (0x7f4e7820e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f4e78133353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E121 09:55:38.630887654 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1687, last enqueued NCCL work: 1687, last completed NCCL work: 1686.
[rank1]:[E121 09:55:38.630907984 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 1] Timeout at NCCL work: 1687, last enqueued NCCL work: 1687, last completed NCCL work: 1686.
[rank1]:[E121 09:55:38.630927810 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
[rank1]: main(args)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
[rank1]: loss, grad_norm = train_one_step(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 122, in train_one_step
[rank1]: ) = next(loader)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 317, in sp_parallel_dataloader_wrapper
[rank1]: latents, cond, attn_mask, cond_mask = prepare_sequence_parallel_data(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 294, in prepare_sequence_parallel_data
[rank1]: ) = prepare(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 268, in prepare
[rank1]: hidden_states = all_to_all(hidden_states, scatter_dim=2, gather_dim=0)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 201, in all_to_all
[rank1]: return AllToAll.apply(input, nccl_info.group, scatter_dim, gather_dim)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 175, in forward
[rank1]: output = all_to_all(input, ctx.world_size, process_group,
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 155, in _all_to_all
[rank1]: dist.all_to_all(output_list, input_list, group=group)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4119, in all_to_all
[rank1]: work.wait()
[rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank0]:[E121 09:55:39.925951012 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E121 09:55:39.925967416 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E121 09:55:39.926106476 ProcessGroupNCCL.cpp:542] [Rank 0] Collective WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f43ee4de446 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptrc10d::NCCLComm&) + 0x220 (0x7f43ef825290 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f43ef8254dc in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f43ef82cea3 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f43ef82e92d in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x145c0 (0x7f443880a5c0 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #6: + 0x8609 (0x7f443ada4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f443acc9353 in /lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E121 09:55:39.932326608 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 4142, last enqueued NCCL work: 4142, last completed NCCL work: 4141.
[rank0]:[E121 09:55:39.932348580 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 4142, last enqueued NCCL work: 4142, last completed NCCL work: 4141.
[rank0]:[E121 09:55:39.932365400 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Traceback (most recent call last):
File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
main(args)
File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
loss, grad_norm = train_one_step(
File "/storage/ytm/FastVideo/fastvideo/train.py", line 161, in train_one_step
model_pred = transformer(**input_kwargs)[0]
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 850, in forward
args, kwargs = _pre_forward(
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 382, in _pre_forward
unshard_fn(state, handle)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 417, in _pre_forward_unshard
_unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 301, in _unshard
handle.unshard()
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1312, in unshard
padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1403, in _all_gather_flat_param
dist.all_gather_into_tensor(
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
work.wait()
torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]: Traceback (most recent call last):
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
[rank0]: main(args)
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
[rank0]: loss, grad_norm = train_one_step(
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 161, in train_one_step
[rank0]: model_pred = transformer(**input_kwargs)[0]
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 850, in forward
[rank0]: args, kwargs = _pre_forward(
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 382, in _pre_forward
[rank0]: unshard_fn(state, handle)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 417, in _pre_forward_unshard
[rank0]: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 301, in _unshard
[rank0]: handle.unshard()
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1312, in unshard
[rank0]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1403, in _all_gather_flat_param
[rank0]: dist.all_gather_into_tensor(
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
[rank0]: work.wait()
[rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
Reproduction
torchrun --nnodes 1 --nproc_per_node 2 --master_port 29903 fastvideo/train.py --seed 1024 --pretrained_model_name_or_path /storage/ytm/FastVideo/data/FastHunyuan-diffusers --model_type hunyuan_hf --cache_dir data/.cache --data_json_path /storage/ytm/FastVideo/data/Image-Vid-Finetune-HunYuan/videos2caption.json --validation_prompt_dir data/Black-Myth-Wukong/validation --gradient_checkpointing --train_batch_size 1 --num_latent_t 32 --sp_size 2 --train_sp_batch_size 1 --dataloader_num_workers 4 --gradient_accumulation_steps 4 --max_train_steps 1000 --learning_rate 8e-5 --mixed_precision bf16 --checkpointing_steps 500 --validation_steps 100 --validation_sampling_steps 50 --checkpoints_total_limit 3 --allow_tf32 --ema_start_step 0 --cfg 0.0 --ema_decay 0.999 --log_validation --output_dir data/outputs/Hunyuan-lora-finetuning-Black-Myth-Wukong --tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong --num_frames 125 --validation_guidance_scale "1.0" --shift 7 --use_lora --lora_rank 32 --lora_alpha 32
请帮助我看看是什么问题
The text was updated successfully, but these errors were encountered: