hunyuan lora微调nccl超时 #162

ytm-01 · 2025-01-21T02:02:32Z

Environment

torch 2.5.0+cu121
fastvideo 1.2.0 /storage/ytm/FastVideo
python 3.10.0

Describe the bug

(fastvideo) root@ubuntu-server:/storage/ytm/FastVideo# torchrun --nnodes 1 --nproc_per_node 2 --master_port 29903 fastvideo/train.py --seed 1024 --pretrained_model_name_or_path /storage/ytm/FastVideo/data/FastHunyuan-diffusers --model_type hunyuan_hf --cache_dir data/.cache --data_json_path /storage/ytm/FastVideo/data/Image-Vid-Finetune-HunYuan/videos2caption.json --validation_prompt_dir data/Black-Myth-Wukong/validation --gradient_checkpointing --train_batch_size 1 --num_latent_t 32 --sp_size 2 --train_sp_batch_size 1 --dataloader_num_workers 4 --gradient_accumulation_steps 4 --max_train_steps 1000 --learning_rate 8e-5 --mixed_precision bf16 --checkpointing_steps 500 --validation_steps 100 --validation_sampling_steps 50 --checkpoints_total_limit 3 --allow_tf32 --ema_start_step 0 --cfg 0.0 --ema_decay 0.999 --log_validation --output_dir data/outputs/Hunyuan-lora-finetuning-Black-Myth-Wukong --tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong --num_frames 125 --validation_guidance_scale "1.0" --shift 7 --use_lora --lora_rank 32 --lora_alpha 32
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793]
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] *****************************************
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0121 09:42:04.934275 1764804 site-packages/torch/distributed/run.py:793] *****************************************
[W121 09:42:10.077291564 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
--> loading model from /storage/ytm/FastVideo/data/FastHunyuan-diffusers
[W121 09:42:10.098619396 Utils.hpp:164] Warning: Environment variable NCCL_BLOCKING_WAIT is deprecated; use TORCH_NCCL_BLOCKING_WAIT instead (function operator())
Total training parameters = 40.894464 M
--> Initializing FSDP with sharding strategy: full
--> applying fdsp activation checkpointing...
--> model loaded
--> applying fdsp activation checkpointing...
optimizer: AdamW (
Parameter Group 0
amsgrad: False
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
lr: 8e-05
maximize: False
weight_decay: 0.01
)
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.18.5
wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
***** Running training *****
Num examples = 84
Dataloader size = 42
Num Epochs = 48
Resume training from step 0
Instantaneous batch size per device = 1
Total train batch size (w. data & sequence parallel, accumulation) = 4.0
Gradient Accumulation steps = 4
Total optimization steps = 1000
Total training parameters per FSDP shard = 0.020447232 B
Master weight dtype: torch.float32
Steps: 0%| | 0/1000 [00:00<?, ?it/s]/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/graph.py:825: UserWarning: cuDNN SDPA backward got grad_output.strides() != output.strides(), attempting to materialize a grad_output with matching strides... (Triggered internally at ../aten/src/ATen/native/cudnn/MHA.cpp:674.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]:[E121 09:55:38.949066081 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E121 09:55:38.949544360 ProcessGroupNCCL.cpp:679] [Rank 0] Work WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank1]:[E121 09:55:38.957012505 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank1]:[E121 09:55:38.957737010 ProcessGroupNCCL.cpp:679] [Rank 1] Work WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) timed out in blocking wait (TORCH_NCCL_BLOCKING_WAIT=1).
[rank1]:[E121 09:55:38.624584541 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E121 09:55:38.624604711 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E121 09:55:38.624732603 ProcessGroupNCCL.cpp:542] [Rank 1] Collective WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:

Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f4e2b948446 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptrc10d::NCCLComm&) + 0x220 (0x7f4e2cc8f290 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f4e2cc8f4dc in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f4e2cc96ea3 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4e2cc9892d in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x145c0 (0x7f4e75c745c0 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #6: + 0x8609 (0x7f4e7820e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f4e78133353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E121 09:55:38.630887654 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1687, last enqueued NCCL work: 1687, last completed NCCL work: 1686.
[rank1]:[E121 09:55:38.630907984 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 1] Timeout at NCCL work: 1687, last enqueued NCCL work: 1687, last completed NCCL work: 1686.
[rank1]:[E121 09:55:38.630927810 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
[rank1]: main(args)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
[rank1]: loss, grad_norm = train_one_step(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 122, in train_one_step
[rank1]: ) = next(loader)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 317, in sp_parallel_dataloader_wrapper
[rank1]: latents, cond, attn_mask, cond_mask = prepare_sequence_parallel_data(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 294, in prepare_sequence_parallel_data
[rank1]: ) = prepare(
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 268, in prepare
[rank1]: hidden_states = all_to_all(hidden_states, scatter_dim=2, gather_dim=0)
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 201, in all_to_all
[rank1]: return AllToAll.apply(input, nccl_info.group, scatter_dim, gather_dim)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 175, in forward
[rank1]: output = all_to_all(input, ctx.world_size, process_group,
[rank1]: File "/storage/ytm/FastVideo/fastvideo/utils/communications.py", line 155, in _all_to_all
[rank1]: dist.all_to_all(output_list, input_list, group=group)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4119, in all_to_all
[rank1]: work.wait()
[rank1]: torch.distributed.DistBackendError: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1687, OpType=ALLTOALL, NumelIn=2442240, NumelOut=2442240, Timeout(ms)=600000) ran for 600007 milliseconds before timing out.
[rank0]:[E121 09:55:39.925951012 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E121 09:55:39.925967416 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E121 09:55:39.926106476 ProcessGroupNCCL.cpp:542] [Rank 0] Collective WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:

Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f43ee4de446 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptrc10d::NCCLComm&) + 0x220 (0x7f43ef825290 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7f43ef8254dc in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x213 (0x7f43ef82cea3 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f43ef82e92d in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x145c0 (0x7f443880a5c0 in /data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #6: + 0x8609 (0x7f443ada4609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f443acc9353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E121 09:55:39.932326608 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 4142, last enqueued NCCL work: 4142, last completed NCCL work: 4141.
[rank0]:[E121 09:55:39.932348580 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 4142, last enqueued NCCL work: 4142, last completed NCCL work: 4141.
[rank0]:[E121 09:55:39.932365400 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Traceback (most recent call last):
File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
main(args)
File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
loss, grad_norm = train_one_step(
File "/storage/ytm/FastVideo/fastvideo/train.py", line 161, in train_one_step
model_pred = transformer(**input_kwargs)[0]
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 850, in forward
args, kwargs = _pre_forward(
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 382, in _pre_forward
unshard_fn(state, handle)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 417, in _pre_forward_unshard
_unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 301, in _unshard
handle.unshard()
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1312, in unshard
padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1403, in _all_gather_flat_param
dist.all_gather_into_tensor(
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
work.wait()
torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]: Traceback (most recent call last):
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 759, in
[rank0]: main(args)
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 410, in main
[rank0]: loss, grad_norm = train_one_step(
[rank0]: File "/storage/ytm/FastVideo/fastvideo/train.py", line 161, in train_one_step
[rank0]: model_pred = transformer(**input_kwargs)[0]
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 850, in forward
[rank0]: args, kwargs = _pre_forward(
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 382, in _pre_forward
[rank0]: unshard_fn(state, handle)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 417, in _pre_forward_unshard
[rank0]: _unshard(state, handle, state._unshard_stream, state._pre_unshard_stream)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 301, in _unshard
[rank0]: handle.unshard()
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1312, in unshard
[rank0]: padded_unsharded_flat_param = self._all_gather_flat_param(unsharded_flat_param)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 1403, in _all_gather_flat_param
[rank0]: dist.all_gather_into_tensor(
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/data/anaconda3/envs/fastvideo/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
[rank0]: work.wait()
[rank0]: torch.distributed.DistBackendError: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4142, OpType=_ALLGATHER_BASE, NumelIn=180357152, NumelOut=360714304, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.

Reproduction

torchrun --nnodes 1 --nproc_per_node 2 --master_port 29903 fastvideo/train.py --seed 1024 --pretrained_model_name_or_path /storage/ytm/FastVideo/data/FastHunyuan-diffusers --model_type hunyuan_hf --cache_dir data/.cache --data_json_path /storage/ytm/FastVideo/data/Image-Vid-Finetune-HunYuan/videos2caption.json --validation_prompt_dir data/Black-Myth-Wukong/validation --gradient_checkpointing --train_batch_size 1 --num_latent_t 32 --sp_size 2 --train_sp_batch_size 1 --dataloader_num_workers 4 --gradient_accumulation_steps 4 --max_train_steps 1000 --learning_rate 8e-5 --mixed_precision bf16 --checkpointing_steps 500 --validation_steps 100 --validation_sampling_steps 50 --checkpoints_total_limit 3 --allow_tf32 --ema_start_step 0 --cfg 0.0 --ema_decay 0.999 --log_validation --output_dir data/outputs/Hunyuan-lora-finetuning-Black-Myth-Wukong --tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong --num_frames 125 --validation_guidance_scale "1.0" --shift 7 --use_lora --lora_rank 32 --lora_alpha 32
请帮助我看看是什么问题

The text was updated successfully, but these errors were encountered:

foreverpiano · 2025-01-21T04:11:39Z

Can you have a check if you can inference correctly? If can, I think the nccl part is OK.

ytm-01 · 2025-01-21T05:49:35Z

bash scripts/inference/inference_hunyuan_hf_quantization.sh可以正常推理
我的步骤：
1.准备环境
./env_setup.sh fastvideo
Python 3.10.0, CUDA 12.1 and A6000
2.准备数据
下载数据：
python scripts/huggingface/download_hf.py --repo_id=FastVideo/Image-Vid-Finetune-Src --local_dir=data/Image-Vid-Finetune-Src --repo_type=dataset
预处理数据：
bash scripts/preprocess/preprocess_hunyuan_data.sh
3.lora微调
bash scripts/finetune/finetune_hunyuan_hf_lora.sh
export WANDB_BASE_URL="https://api.wandb.ai"
export WANDB_MODE=online
torchrun --nnodes 1 --nproc_per_node 2 --master_port 29903
fastvideo/train.py
--seed 1024
--pretrained_model_name_or_path /storage/ytm/FastVideo/data/FastHunyuan-diffusers
--model_type hunyuan_hf
--cache_dir data/.cache
--data_json_path /storage/ytm/FastVideo/data/Image-Vid-Finetune-HunYuan/videos2caption.json
--validation_prompt_dir data/Black-Myth-Wukong/validation
--gradient_checkpointing
--train_batch_size 1
--num_latent_t 32
--sp_size 2
--train_sp_batch_size 1
--dataloader_num_workers 4
--gradient_accumulation_steps 4
--max_train_steps 1000
--learning_rate 8e-5
--mixed_precision bf16
--checkpointing_steps 500
--validation_steps 100
--validation_sampling_steps 50
--checkpoints_total_limit 3
--allow_tf32
--ema_start_step 0
--cfg 0.0
--ema_decay 0.999
--log_validation
--output_dir data/outputs/Hunyuan-lora-finetuning-Black-Myth-Wukong
--tracker_project_name Hunyuan-lora-finetuning-Black-Myth-Wukong
--num_frames 125
--validation_guidance_scale "1.0"
--shift 7
--use_lora
--lora_rank 32
--lora_alpha 32

哪里有问题吗，感谢您的回答

ytm-01 · 2025-01-21T07:07:12Z

可能是fsdp全分片导致通信量变大，我改变fsdp显存会OOM，请问一下你们在训练hunyuanlora过程训练用多长时间

foreverpiano · 2025-01-21T13:37:20Z

We use A100 80GB for lora finetuning. For my experience, it is fast to get a first wandb log.

@BrianChen1129 Have you tested the code on A6000 or GPUs with 48GB?

BrianChen1129 · 2025-01-21T21:33:46Z

Haven't tried on A6000. For 720P input we finetuned roughly 24h.

ytm-01 · 2025-01-22T07:27:03Z

我是一名初学者，我发现一个问题，我很迷惑。

当我处理数据时，即 bash scripts/preprocess/preprocess_hunyuan_data.sh

我使用
MODEL_PATH="/storage/ytm/FastVideo/data/FastHunyuan-diffusers"
MODEL_TYPE="hunyuan_hf"

仅仅可以生成vae_latents，不可以生成text_embeddings

我发现是FastVideo/fastvideo/models/hunyuan/text_encoder/init.py", line 130行
self.tokenizer_path = (tokenizer_path if tokenizer_path is not None else text_encoder_path)
而FastHunyuan-diffusers模型的text_encoder和tokenizer是分开的，两者地址不一样

于是我使用FastHunyuan模型来生成text_encoder

我不知道这两个模型对于同一个文本生成的text_encoder是否一样，是否对我的训练数据形状有影响。

感谢您的解答

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hunyuan lora微调nccl超时 #162

hunyuan lora微调nccl超时 #162

ytm-01 commented Jan 21, 2025

foreverpiano commented Jan 21, 2025 •

edited

Loading

ytm-01 commented Jan 21, 2025

ytm-01 commented Jan 21, 2025

foreverpiano commented Jan 21, 2025 •

edited

Loading

BrianChen1129 commented Jan 21, 2025

ytm-01 commented Jan 22, 2025

hunyuan lora微调nccl超时 #162

hunyuan lora微调nccl超时 #162

Comments

ytm-01 commented Jan 21, 2025

Environment

Describe the bug

Reproduction

foreverpiano commented Jan 21, 2025 • edited Loading

ytm-01 commented Jan 21, 2025

ytm-01 commented Jan 21, 2025

foreverpiano commented Jan 21, 2025 • edited Loading

BrianChen1129 commented Jan 21, 2025

ytm-01 commented Jan 22, 2025

foreverpiano commented Jan 21, 2025 •

edited

Loading

foreverpiano commented Jan 21, 2025 •

edited

Loading