Support DTensor for distributed training #85

rlrs · 2024-12-17T15:54:55Z

The current approach of using uu.Parameter doesn't work for models distributed for example with fully_shard, which uses DTensor. When distributed, the optimizer won't have access to the properties on uu.Parameters.

A minimal fix for this is to also keep the unit scaling info in a lookup table used when instantiating the optimizer. I can probably contribute this if it would be valuable - still trying to make this setup work with torch.compile, which currently struggles a lot with recompilation.

The text was updated successfully, but these errors were encountered:

thecharlieblake · 2025-01-07T17:12:14Z

Hi @rlrs, apologies for the slow response - took some time off for the holidays.

I'm afraid we don't have a neat solution for either of the DTensor or torch.compile issue you highlight. On DTensor, it's definitely something we'd like the library to support and we're aware we haven't touched it yet. Anything you wish to do on this we'd be very receptive to. If you decide to embark on an implementation we'd be keen to get involved however you wish (brainstorm design, review PRs etc), though at this time we don't have scope to lead on it.

For torch.compile I'm a little surprised. I haven't looked at this is a while but from my recollection everything was fusing successfully and I don't think recompiling - I'd be keen to see an example of it not working. If you can send us any code/output here I'd be happy to debug it myself and try to identify the issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support DTensor for distributed training #85

Support DTensor for distributed training #85

rlrs commented Dec 17, 2024 •

edited

Loading

thecharlieblake commented Jan 7, 2025

Support DTensor for distributed training #85

Support DTensor for distributed training #85

Comments

rlrs commented Dec 17, 2024 • edited Loading

thecharlieblake commented Jan 7, 2025

rlrs commented Dec 17, 2024 •

edited

Loading