train_ddp, process_group: fixes so CUDA works e2e #5

d4l3k · 2024-11-03T01:13:38Z

ProcessGroupBaby

This adds get_future() support to BabyWork which is required for using ProcessGroupBaby with the torchft ddp integration.

This uses a thread with an extra queue to handle future completions.

Notably this is partial support:

the future returned currently is a None future rather than propagating the tensors which works fine as these are inplace operations
only one of wait/get_future can be called, if you try to call both it will throw an error

Fixes so it actually uses NCCL instead of Gloo and updated unit test to pass.

Fixes so we always apply the state_dict from the main thread in a safe spot to avoid version counter errors.

pytest

run train_ddp.py on two GPUs

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 3, 2024

d4l3k force-pushed the d4l3k/ddp_cuda branch from 808e6fb to 1c51f51 Compare November 3, 2024 01:26

train_ddp, process_group: fixes so CUDA works e2e

1efa49d

d4l3k force-pushed the d4l3k/ddp_cuda branch from 1c51f51 to 1efa49d Compare November 3, 2024 01:52

d4l3k merged commit c7e231e into main Nov 3, 2024
3 of 4 checks passed

d4l3k deleted the d4l3k/ddp_cuda branch November 3, 2024 01:59