Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update broadcast algo #447

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

update broadcast algo #447

wants to merge 27 commits into from

Conversation

Binyang2014
Copy link
Contributor

No description provided.

chhwang and others added 27 commits November 9, 2024 01:40
…#417)

Encountered a hang with rccl-test's ncclBcast runs. In rccl-test with
ncclBcast, the buffer changes only for the root-gpu b/w test runs. So,
the channel key changes only for the root gpu and remains the same for
the other gpus. So, the root gpu starts to create a new sm channel while
the other gpus do not create new sm channels. Hence, a hang occurs.

To fix this, we have to use copy-based implementation. This is the
simplest copy-based implementation. Uses a static channel key with sm
channels created using scratch buffer as the remote memories.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants