Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_ddp, process_group: fixes so CUDA works e2e #5

Merged
merged 1 commit into from
Nov 3, 2024
Merged

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented Nov 3, 2024

ProcessGroupBaby

This adds get_future() support to BabyWork which is required for using ProcessGroupBaby with the torchft ddp integration.

This uses a thread with an extra queue to handle future completions.

Notably this is partial support:

  • the future returned currently is a None future rather than propagating the tensors which works fine as these are inplace operations
  • only one of wait/get_future can be called, if you try to call both it will throw an error

ProcessGroupBabyNCCL

Fixes so it actually uses NCCL instead of Gloo and updated unit test to pass.

Manager

Fixes so we always apply the state_dict from the main thread in a safe spot to avoid version counter errors.

Test plan:

pytest

run train_ddp.py on two GPUs

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 3, 2024
@d4l3k d4l3k merged commit c7e231e into main Nov 3, 2024
3 of 4 checks passed
@d4l3k d4l3k deleted the d4l3k/ddp_cuda branch November 3, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants