Make `cache_all_gather` configurable #20508

lgiacomoni · 2024-12-13T12:01:39Z

This option is defaulted to True and there doesn't seem to be a way to set it to False programmatically. Is there a specific reason for that? The accompanying comment seems to suggest that it should be safe to set it to False.

I am trying to have a more fine grain control on how XLA allocates memory for sharded parameters, and I was wondering if cache_all_gather could help with that. At the moment, I have a large model that during the forward pass, gathers the sharded param array for the matmul, but then it keeps it in memory to re use in the backward pass instead of discarding the shards and gathering them again. Has anyone tried to enforce memory allocation/deallocation with XLA before? Is that even possible?

The text was updated successfully, but these errors were encountered:

patrick-toulme · 2024-12-17T19:33:25Z

Those all-gathers are probably getting CSEed away. Try this

# adapt the prediction function to gather weights just before their use,
# and to re-gather them on the backward pass (rather than saving them)
@partial(jax.remat, policy=lambda op, *_, **__: str(op) != 'all_gather')

patrick-toulme · 2024-12-17T19:34:14Z

Run your trainer with this and you can see the number of all-gathers after each pass

export XLA_FLAGS="--xla_dump_hlo_as_text --xla_dump_to=${HLO_DUMP_PATH} --xla_dump_hlo_pass_re='.*'"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `cache_all_gather` configurable #20508

Make `cache_all_gather` configurable #20508

lgiacomoni commented Dec 13, 2024 •

edited

Loading

patrick-toulme commented Dec 17, 2024

patrick-toulme commented Dec 17, 2024

Make cache_all_gather configurable #20508

Make cache_all_gather configurable #20508

Comments

lgiacomoni commented Dec 13, 2024 • edited Loading

patrick-toulme commented Dec 17, 2024

patrick-toulme commented Dec 17, 2024

Make `cache_all_gather` configurable #20508

Make `cache_all_gather` configurable #20508

lgiacomoni commented Dec 13, 2024 •

edited

Loading