-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: forbid repeated deepspeed.initialize on training objects #6874
base: master
Are you sure you want to change the base?
Fix: forbid repeated deepspeed.initialize on training objects #6874
Conversation
@microsoft-github-policy-service agree |
This fix still has interference with existing unit tests. Let me double check before we proceed. |
…peedEngine propagates flag from the internal model
dc81325
to
d1e7777
Compare
The unit tests in I am not able to check other unit tests due to GPU memory constraint. |
deepspeed/__init__.py
Outdated
if _is_initialized(model): | ||
raise ValueError( | ||
"Model has already been initialized, please make sure to only call deepspeed.initialize on a model once.") | ||
if optimizer is not None and _is_initialized(optimizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that optimizer could be a Callable, not an object
https://github.com/microsoft/DeepSpeed/blob/4cd1d97460b677563d57f07a293724bdc02e0ef5/deepspeed/__init__.py#L71
deepspeed/__init__.py
Outdated
raise ValueError( | ||
"Optimizer has already been initialized, please make sure to only call deepspeed.initialize on an optimizer once." | ||
) | ||
if lr_scheduler is not None and _is_initialized(lr_scheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepspeed/__init__.py
Outdated
@@ -137,6 +181,10 @@ def initialize(args=None, | |||
zero.partition_parameters.shutdown_init_context() | |||
|
|||
assert model is not None, "deepspeed.initialize requires a model" | |||
# enforce that model, optimizer, and lr_scheduler have not been used in a previous deepspeed.initialize call | |||
_assert_trainobjs_not_inited(model, optimizer, lr_scheduler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this call should be moved into `_mark_trainobjs_initialized()
Thanks for the review @tjruwase.
Regarding,
I think If we still want to keep
|
Thanks for raising this important design issue. Below are my thoughts and/or clarifications.
Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.
What do you think? |
Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD. |
7f6fc1f
to
2c5806b
Compare
@tjruwase Thanks for the suggestion. Please check the updated PR. |
@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do? |
@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures. |
@loadams Thanks for getting back so quickly and I apologize for the late reply. I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy. For example:
As far as I see there are two solutions:
|
FYI @traincheck-team - there still appear to be some unit test errors if you could investigate. |
Thanks for the heads up. Will do that soon.
…On Fri, Jan 31, 2025 at 15:03 Logan Adams ***@***.***> wrote:
FYI @traincheck-team <https://github.com/traincheck-team> - there still
appear to be some unit test errors if you could investigate.
—
Reply to this email directly, view it on GitHub
<#6874 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BNAPJE3J6B3C3Z6IOWKCO5L2NPJKDAVCNFSM6AAAAABTVBD4X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGI4TQNJZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
FYI @traincheck-team - you'll need to sign off on the DCO requirement now that we've changed the GH organization/structure. Let me know if you have any questions on that. |
d2f315f
to
efeec37
Compare
efeec37
to
d2f315f
Compare
@loadams I tried to amend the six problematic commits using git rebase, but since there were previous merges from main into this branch, it caused conflicts and made things messy (e.g. see the requested reviewers of this PR as code owners). To fix this, I plan to create a new local branch from main, reapply the necessary changes, and then force-push it to replace this branch. This should give us a clean history while preserving the required commits. Let me know if you have any concerns or if there’s a better way to proceed! |
@loadams Hi Logan, I apologize for the late reply. I’ve reviewed the 9 unit test failures in the recent workflow run: https://github.com/deepspeedai/DeepSpeed/actions/runs/13205140637/job/36866442471. My understanding is that these failures are caused by repeated model initialization, which this PR now detects and raises exceptions for. Evidences:
Impacted Tests:These failed unit tests span 3 files
Suggested Fix:The solution is straightforward and manageable as—ensure a new model object is created before the second initialization, which does not require modification to any critical logic. I suggest handling this in a separate PR to fix these tests. If there are any additional failures beyond these 9, please let me know. |
Previously closed PR:
#6848
Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.
What are changed:
deepspeed.initialize
with the flagds_is_inited = True
.deepspeed.initialize
with the flagds_is_inited = True
.deepspeed.initialize
, raise an exception if detectedds_is_inited == True
in the inputmodel
,optimizer
orlr_scheduler
Expected Behavior:
Forbid repeated
deepspeed.initialize
invocations onmodel
,optimizer
andlr_scheduler
objects.