You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current approach of using uu.Parameter doesn't work for models distributed for example with fully_shard, which uses DTensor. When distributed, the optimizer won't have access to the properties on uu.Parameters.
A minimal fix for this is to also keep the unit scaling info in a lookup table used when instantiating the optimizer. I can probably contribute this if it would be valuable - still trying to make this setup work with torch.compile, which currently struggles a lot with recompilation.
The text was updated successfully, but these errors were encountered:
Hi @rlrs, apologies for the slow response - took some time off for the holidays.
I'm afraid we don't have a neat solution for either of the DTensor or torch.compile issue you highlight. On DTensor, it's definitely something we'd like the library to support and we're aware we haven't touched it yet. Anything you wish to do on this we'd be very receptive to. If you decide to embark on an implementation we'd be keen to get involved however you wish (brainstorm design, review PRs etc), though at this time we don't have scope to lead on it.
For torch.compile I'm a little surprised. I haven't looked at this is a while but from my recollection everything was fusing successfully and I don't think recompiling - I'd be keen to see an example of it not working. If you can send us any code/output here I'd be happy to debug it myself and try to identify the issue
The current approach of using
uu.Parameter
doesn't work for models distributed for example withfully_shard
, which usesDTensor
. When distributed, the optimizer won't have access to the properties onuu.Parameter
s.A minimal fix for this is to also keep the unit scaling info in a lookup table used when instantiating the optimizer. I can probably contribute this if it would be valuable - still trying to make this setup work with
torch.compile
, which currently struggles a lot with recompilation.The text was updated successfully, but these errors were encountered: