-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizer state offload to CPU #204
Conversation
I have been summoned 😮 I would recommend:
There might be a public API now, but this is what I usually do:
|
thanks @awgu !! |
@awgu I just spent a couple hours poking at it. Actually, as far as I can tell, the parameters are not on CPU until the first time That's a different question though. The answer to the original question turned out to be that the optimizer actually DID get offloaded. Contrary to my belief. The max memory usage happens right at the beginning when the model is materialized before So, now I'm trying to figure out how to not actually materialize the full model before sharding. |
I think the unsharded model passed to FSDP is on GPU, which aligns with your observation that the peak memory is before calling Line 125 in adfbb5c
At this point: Line 175 in adfbb5c
I would expect all model parameters to be on CPU:
Could you check if that is the case? |
@awgu Yep, the parameters are all on CPU. I now believe the issue was that I uncommented torch.set_default_device(cuda) so the model would load faster. In like 5 seconds, rather than 40. But it also materializes it. Should learn how that works. When I comment it out, I see the memory savings. |
configs/10B/H100_devel.toml
Outdated
name_model = "26B" | ||
type_model = "llama2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's create a 26B model config then instead of modifying the 10b_devel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. This is the one that I'm making changes to, to try to get the model sized to the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you break down this pr in two/three PR ?
one with only the CPU offload code
one with all the others modification.
one with the lieager kernel updates
Otherwise its hard to review and potentially revert if needed
Summary of changes:
|
Summary of changes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lfgtm
./scripts/simulate_multi_node_diloco.sh 1 8 src/zeroband/train.py @ configs/10B/H100_devel.toml
Before:
After:
@awgu do you have any idea what's going on here? I added
CPUOffloadPolicy
, but it doesn't seem to be offloading.