-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for tensor parallel training workflow with accelerate #34194
base: main
Are you sure you want to change the base?
Conversation
Such timing! I have similar thought here. Shall we collaborate? |
@kwen2501 Absolutely, please let me know, how you want to take this forward. Thank you. |
cce95b6
to
c3fe4bf
Compare
0a4fe4e
to
e79f4d1
Compare
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@ArthurZucker @muellerzr since accelerate PR (huggingface/accelerate#3173) is merged. Requesting review and merge of this PR which would allow for complete e2e training workflow using tensor parallelism. Thank you. |
Signed-off-by: Mehant Kammakomati <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM ! Could you also update the docs (Efficient Training on Multiple GPUs doc / trainer doc ) ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
What does this PR do?
1. AddAlready merged in to HF/transformers.apply_tensor_parallel
API to apply TP plan to Llama and Granite models2. Introduce
tp_size
user facing argument to be further consumed by accelerate (see huggingface/accelerate#3173)3. Allows for e2e TP training.
Please review in conjunction with huggingface/accelerate#3173
Fixes #32470
Results
See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.
Note: Please be aware that the effective TPS for FSDP would be multiplicative of the parallel factor (number of GPUs/devices engaged in distributed training) whereas that is not the case with TP. Therefore, when effective throughput is considered we can find FSDP is better than TP in terms of throughput. However, that may be compensated by increasing the batch size utilizing the memory gains etc.
Done on two models
Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you
HF projects: