feat: add support for tensor parallel training workflow with accelerate #34194

kmehant · 2024-10-16T10:42:58Z

What does this PR do?

~~1. Add apply_tensor_parallel API to apply TP plan to Llama and Granite models~~ Already merged in to HF/transformers.
2. Introduce tp_size user facing argument to be further consumed by accelerate (see huggingface/accelerate#3173)
3. Allows for e2e TP training.

Please review in conjunction with huggingface/accelerate#3173

Fixes #32470

Results

See significant improvement in both memory and throughput compared against single gpu training, and FSDP across different settings (checkpointing on/off) and context lengths.

Note: Please be aware that the effective TPS for FSDP would be multiplicative of the parallel factor (number of GPUs/devices engaged in distributed training) whereas that is not the case with TP. Therefore, when effective throughput is considered we can find FSDP is better than TP in terms of throughput. However, that may be compensated by increasing the batch size utilizing the memory gains etc.

Done on two models

ibm-granite/granite-8b-code-base-128k
codellama/CodeLlama-7b-hf

Tables below show the max cuda memory and throughput for various configurations showing the potential of TP contributed in this PR. There is gains in both memory and throughput.

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	8192	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	8192	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	8192	1	FALSE	52.4	7675.4

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	8192	1	TRUE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	8192	1	TRUE	29.975586	2256.896
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	8192	1	TRUE	26.5	5935.5

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	16384	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	16384	1	FALSE	OOM	NA
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	16384	1	FALSE	OOM	NA

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
ibm-granite/granite-8b-code-base-128k	Single GPU non-distributed	1	16384	1	TRUE	OOM	NA
ibm-granite/granite-8b-code-base-128k	FSDP	4	16384	1	TRUE	36.8	2084.864
ibm-granite/granite-8b-code-base-128k	TP (This PR)	4	16384	1	TRUE	33.5	5692.5

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	8192	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	FSDP	4	8192	1	FALSE	70.7	3560
codellama/CodeLlama-7b-hf	TP (This PR)	4	8192	1	FALSE	42.8	9216

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	8192	1	TRUE	75.3	2849
codellama/CodeLlama-7b-hf	FSDP	4	8192	1	TRUE	26.4	5957
codellama/CodeLlama-7b-hf	TP (This PR)	4	8192	1	TRUE	21.4	7125

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	16384	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	FSDP	4	16384	1	FALSE	OOM	NA
codellama/CodeLlama-7b-hf	TP (This PR)	4	16384	1	FALSE	OOM	NA

Model	Method	# of GPUs	Context Length	Batch Size	Grad Checkpointing	Cuda Max Mem (GiB)	Tokens/Sec/GPU
codellama/CodeLlama-7b-hf	Single GPU non-distributed	1	16384	1	TRUE	75.3	2599
codellama/CodeLlama-7b-hf	FSDP	4	16384	1	TRUE	30.1	2433
codellama/CodeLlama-7b-hf	TP (This PR)	4	16384	1	TRUE	26.6	6873

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

trainer: @muellerzr and @SunMarc
@SeungyounShin (happy to include as co author - [WIP][Need help and discussion] : basic llama tensor parallel #32597)

I have cycles to bring in more improvements over this PR to bring in Pytorch TP support to HF. Looking forward. Thank you

HF projects:

accelerate: different repo

kwen2501 · 2024-10-16T18:16:44Z

Such timing! I have similar thought here. Shall we collaborate?

kmehant · 2024-10-17T11:19:00Z

@kwen2501 Absolutely, please let me know, how you want to take this forward. Thank you.

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant · 2025-01-29T15:56:37Z

@ArthurZucker @muellerzr since accelerate PR (huggingface/accelerate#3173) is merged. Requesting review and merge of this PR which would allow for complete e2e training workflow using tensor parallelism. Thank you.

Signed-off-by: Mehant Kammakomati <[email protected]>

SunMarc

SGTM ! Could you also update the docs (Efficient Training on Multiple GPUs doc / trainer doc ) ?

HuggingFaceDocBuilderDev · 2025-01-29T17:13:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kmehant mentioned this pull request Oct 16, 2024

feat: support tensor parallel & Data loader huggingface/accelerate#3173

Merged

5 tasks

kmehant force-pushed the tp branch from 05d446e to cf8f9e2 Compare October 18, 2024 05:04

kmehant mentioned this pull request Oct 25, 2024

Simplify Tensor Parallel implementation with PyTorch TP #34184

Merged

5 tasks

kmehant force-pushed the tp branch 3 times, most recently from cce95b6 to c3fe4bf Compare December 13, 2024 21:03

kmehant changed the title ~~feat: add support for tensor parallel using Pytorch 2.0~~ feat: add support for tensor parallel using Pytorch Dec 13, 2024

kmehant force-pushed the tp branch from c3fe4bf to 87bebaf Compare December 13, 2024 21:04

kmehant force-pushed the tp branch 2 times, most recently from 0a4fe4e to e79f4d1 Compare January 23, 2025 08:55

kmehant changed the title ~~feat: add support for tensor parallel using Pytorch~~ feat: add support for tensor parallel training workflow with accelerate Jan 29, 2025

kmehant added 2 commits January 29, 2025 21:22

feat: add support for tensor parallel flow using accelerate

29825bb

Signed-off-by: Mehant Kammakomati <[email protected]>

fix: add tp degree to env variable

d585ebd

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant force-pushed the tp branch from e79f4d1 to d585ebd Compare January 29, 2025 15:54

fix: add version check for accelerate to allow TP

19bffe2

Signed-off-by: Mehant Kammakomati <[email protected]>

SunMarc approved these changes Jan 29, 2025

View reviewed changes

SunMarc requested review from muellerzr and ArthurZucker January 29, 2025 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for tensor parallel training workflow with accelerate #34194

feat: add support for tensor parallel training workflow with accelerate #34194

kmehant commented Oct 16, 2024 •

edited

Loading

kwen2501 commented Oct 16, 2024

kmehant commented Oct 17, 2024

kmehant commented Jan 29, 2025

SunMarc left a comment

HuggingFaceDocBuilderDev commented Jan 29, 2025

feat: add support for tensor parallel training workflow with accelerate #34194

Are you sure you want to change the base?

feat: add support for tensor parallel training workflow with accelerate #34194

Conversation

kmehant commented Oct 16, 2024 • edited Loading

What does this PR do?

Results

Before submitting

Who can review?

kwen2501 commented Oct 16, 2024

kmehant commented Oct 17, 2024

kmehant commented Jan 29, 2025

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jan 29, 2025

kmehant commented Oct 16, 2024 •

edited

Loading