Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP OOM error #35636

Open
blurmemo opened this issue Jan 12, 2025 · 1 comment
Open

FSDP OOM error #35636

blurmemo opened this issue Jan 12, 2025 · 1 comment

Comments

@blurmemo
Copy link

I use two 40G A100 GPUs and one 80G GPUs to fine-tune my model through lora and FSDP which ShardingStrategy is FULL SHARD. When I use command(CUDA_VISIBLE_DEVICES=5,3,4 torchrun --standalone --nnodes=1 --nproc-per-node=3 finetuning.py) to begin my work. I still get problems which are OOM on two 40G A100 GPUs. I watch my GPUs and find all GPUs will load total model weights when using FullyShardedDataParallel to init model. So I am so confused about them and do not know how to fix them.

Bug logs

[rank2]: Traceback (most recent call last):
[rank2]:   File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 438, in <module>
[rank2]:     fire.Fire(main)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank2]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank2]:     component, remaining_args = _CallAndUpdateTrace(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank2]:     component = fn(*varargs, **kwargs)
[rank2]:   File "/data0/home/ening/NICA/cogmllm/src/cogmllm/tools/finetuning.py", line 281, in main
[rank2]:     model = FSDP(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank2]:     _init_param_handle_from_module(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 636, in _init_param_handle_from_module
[rank2]:     _init_param_handle_from_params(state, managed_params, fully_sharded_module)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 648, in _init_param_handle_from_params
[rank2]:     handle = FlatParamHandle(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 584, in __init__
[rank2]:     self._init_flat_param_and_metadata(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 739, in _init_flat_param_and_metadata
[rank2]:     self.flat_param: FlatParameter = self.flatten_tensors_into_flat_param(
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 852, in flatten_tensors_into_flat_param
[rank2]:     flat_param_data = self.flatten_tensors(tensors, aligned_numel)
[rank2]:   File "/data0/home/ening/software/miniconda3/envs/cogmllm/lib/python3.10/site-packages/torch/distributed/fsdp/_flat_param.py", line 844, in flatten_tensors
[rank2]:     return torch.cat(flat_tensors, dim=0)
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.88 GiB. GPU 2 has a total capacity of 39.38 GiB of which 18.80 GiB is free. Including non-PyTorch memory, this process has 20.57 GiB memory in use. Of the allocated memory 19.89 GiB is allocated by PyTorch, and 208.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
@Rocketknight1
Copy link
Member

cc @muellerzr @SunMarc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants