Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DTensor for distributed training #85

Open
rlrs opened this issue Dec 17, 2024 · 1 comment
Open

Support DTensor for distributed training #85

rlrs opened this issue Dec 17, 2024 · 1 comment

Comments

@rlrs
Copy link

rlrs commented Dec 17, 2024

The current approach of using uu.Parameter doesn't work for models distributed for example with fully_shard, which uses DTensor. When distributed, the optimizer won't have access to the properties on uu.Parameters.

A minimal fix for this is to also keep the unit scaling info in a lookup table used when instantiating the optimizer. I can probably contribute this if it would be valuable - still trying to make this setup work with torch.compile, which currently struggles a lot with recompilation.

@thecharlieblake
Copy link
Contributor

Hi @rlrs, apologies for the slow response - took some time off for the holidays.

I'm afraid we don't have a neat solution for either of the DTensor or torch.compile issue you highlight. On DTensor, it's definitely something we'd like the library to support and we're aware we haven't touched it yet. Anything you wish to do on this we'd be very receptive to. If you decide to embark on an implementation we'd be keen to get involved however you wish (brainstorm design, review PRs etc), though at this time we don't have scope to lead on it.

For torch.compile I'm a little surprised. I haven't looked at this is a while but from my recollection everything was fusing successfully and I don't think recompiling - I'd be keen to see an example of it not working. If you can send us any code/output here I'd be happy to debug it myself and try to identify the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants