Optimizer state offload to CPU #204

apaz-cli · 2025-01-22T23:05:55Z

./scripts/simulate_multi_node_diloco.sh 1 8 src/zeroband/train.py @ configs/10B/H100_devel.toml

Before:

22:55:18 [DEBUG] [Rank 0] Max memory used: 49695.58 MB

After:

22:55:18 [DEBUG] [Rank 0] Max memory used: 49695.58 MB

@awgu do you have any idea what's going on here? I added CPUOffloadPolicy, but it doesn't seem to be offloading.

awgu · 2025-01-22T23:10:19Z

I have been summoned 😮

I would recommend:

Adding some assertions to check if your model.parameters() are actually on CPU when you expect them to be (e.g. after init)
Use memory snapshot which can give you some stack traces of allocations to know why you are still seeing GPU allocation

There might be a public API now, but this is what I usually do:

# Add this somewhere early in your init code
torch.cuda.memory._record_memory_history()
...
# Later
snapshot = torch.cuda.memory._snapshot()
with open("snapshot.pickle", "wb") as f:
    pickle.dump(snapshot, f)
# Or something like
snapshot = torch.cuda.memory._snapshot()
with open("snapshot_{torch.distributed.get_rank()}.pickle", "wb") as f:
    pickle.dump(snapshot, f)

samsja · 2025-01-22T23:19:51Z

thanks @awgu !!

apaz-cli · 2025-01-23T01:08:12Z

@awgu I just spent a couple hours poking at it.

Actually, as far as I can tell, the parameters are not on CPU until the first time loss.backward() is called with model.set_requires_gradient_sync(True). Until that point, param.grad for all the params is None. I expected it to be there (especially because I had plans to write my own CPU optimizer using register_post_accumulate_grad_hook()), and I'm sort of wondering what's going on. Do you know what's up with that? Is it documented anywhere?

That's a different question though. The answer to the original question turned out to be that the optimizer actually DID get offloaded. Contrary to my belief. The max memory usage happens right at the beginning when the model is materialized before fully_shard().

So, now I'm trying to figure out how to not actually materialize the full model before sharding.

awgu · 2025-01-23T01:14:37Z

I think the unsharded model passed to FSDP is on GPU, which aligns with your observation that the peak memory is before calling fully_shard.

prime/src/zeroband/train.py

Line 125 in adfbb5c

model = model.to(world_info.local_rank)

At this point:

prime/src/zeroband/train.py

Line 175 in adfbb5c

logger.debug("model fsdped")

I would expect all model parameters to be on CPU:

for param_name, param in model.named_parameters():
    assert param.device.type == "cpu", f"{param_name} is not on CPU!"

Could you check if that is the case?

apaz-cli · 2025-01-23T01:27:31Z

@awgu Yep, the parameters are all on CPU.

I now believe the issue was that I uncommented torch.set_default_device(cuda) so the model would load faster. In like 5 seconds, rather than 40. But it also materializes it. Should learn how that works. When I comment it out, I see the memory savings.

samsja · 2025-01-29T23:29:52Z

configs/10B/H100_devel.toml

+name_model = "26B"
+type_model = "llama2"


let's create a 26B model config then instead of modifying the 10b_devel

Fair. This is the one that I'm making changes to, to try to get the model sized to the machine.

src/zeroband/diloco.py

src/zeroband/train.py

samsja

can you break down this pr in two/three PR ?

one with only the CPU offload code
one with all the others modification.
one with the lieager kernel updates

Otherwise its hard to review and potentially revert if needed

src/zeroband/data.py

src/zeroband/train.py

apaz-cli · 2025-01-30T01:37:30Z

Summary of changes:

Removed einops
Fixed types
Added logging to inner training loop
Changed the arguments on get_optimizer()
Fixed model initialization
Swap the order of detach() and clone()
Overlap loss all_reduce()s

samsja · 2025-01-30T01:39:03Z

Summary of changes:
Removed einops
Fixed types
Added logging to inner training loop
Changed the arguments on get_optimizer()
Fixed model initialization
Swap the order of detach() and clone()
Overlap loss all_reduce()s

samsja

lfgtm

apaz-cli added 8 commits January 19, 2025 06:10

Save for night

3f65141

Add opt offload fn

098dd2c

Update EOD

3ec983c

sync

8ee4ca5

Sync yet again.

4d683da

Removed line.

699a6a1

Fix ruuff.

63477f4

cleanup

8459606

Liger with z loss

adfbb5c

apaz-cli added 3 commits January 23, 2025 01:57

No to(), do cleanup.

373d52e

Bump up to 26B.

b85a003

Merged

d8bbbc4