Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlapping issue about backward of LayerNormLinear #1353

Closed
cos120 opened this issue Dec 3, 2024 · 5 comments
Closed

overlapping issue about backward of LayerNormLinear #1353

cos120 opened this issue Dec 3, 2024 · 5 comments

Comments

@cos120
Copy link

cos120 commented Dec 3, 2024

Hi, folks.

I am using 4 A100-SXM4 for with pytorch2.4.0 and mcore0.9.0 with transformer engine(0.11.0+fc03478) with tp2/pp2 and sequence parallel.

I found that if i set TORCH_NCCL_ENABLE_TIMING=1 for timing all nccl operations, the ag/rs of sp in LayerNormLinear and LayerNormMLP will not overlap with dgrad/wgrad.

timeline without TORCH_NCCL_ENABLE_TIMING=1

there are 4 cudaEventRecord, but those events should create without timing flag.

Image

Image

timeline with TORCH_NCCL_ENABLE_TIMING=1

the there are 5 cudaEventRecord, torch will let two of them create with timing flag.

Image

Image

Why event record with timing will break the overlapping?? nccl operators use 24 sm, the matmul has enough space to launch.

Image

Here are timeline files, torch_record.json with TORCH_NCCL_ENABLE_TIMING=1 and no_record.json does not set TORCH_NCCL_ENABLE_TIMING
timeline.tar.gz

@timmoon10
Copy link
Collaborator

It's strange that TORCH_NCCL_ENABLE_TIMING=1 has this effect. I haven't been able to fully dig into what it does in PyTorch, but as far as I can tell the only extra thing is that it records a CUDA event before each NCCL collective: https://github.com/pytorch/pytorch/blob/e499b46465bc6e5f1a95f158e44bbf0f8356a220/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2997
However, it's possible the profiler could have more complicated interactions.

Can you provide a timeline of the LayerNormLinear backward with TORCH_NCCL_ENABLE_TIMING=1? The provided timeline shows Linear, which does not overlap its tensor-parallel communication.

@cos120
Copy link
Author

cos120 commented Dec 3, 2024

It's strange that TORCH_NCCL_ENABLE_TIMING=1 has this effect. I haven't been able to fully dig into what it does in PyTorch, but as far as I can tell the only extra thing is that it records a CUDA event before each NCCL collective: https://github.com/pytorch/pytorch/blob/e499b46465bc6e5f1a95f158e44bbf0f8356a220/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2997
However, it's possible the profiler could have more complicated interactions.

Can you provide a timeline of the LayerNormLinear backward with TORCH_NCCL_ENABLE_TIMING=1? The provided timeline shows Linear, which does not overlap its tensor-parallel communication.

thanks for your reply, I correct the image upload two timeline files, torch_record.json with TORCH_NCCL_ENABLE_TIMING=1 and no_record.json does not set TORCH_NCCL_ENABLE_TIMING

@cos120
Copy link
Author

cos120 commented Dec 12, 2024

@timmoon10 any update?😭

@timmoon10
Copy link
Collaborator

I reproduce this bug in a pure PyTorch script (see pytorch/pytorch#143890 (comment)), so it doesn't seem specific to TE.

@cos120
Copy link
Author

cos120 commented Jan 8, 2025

I reproduce this bug in a pure PyTorch script (see pytorch/pytorch#143890 (comment)), so it doesn't seem specific to TE.

Thanks for replay, let's discuss in Pytorch. I will close this issue.

@cos120 cos120 closed this as completed Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants