Fix: forbid repeated deepspeed.initialize on training objects #6874

traincheck-team · 2024-12-16T00:18:34Z

Previously closed PR:
#6848

Partially Fixes: #6772 #6771 #6770 by forbidding repeated initialization.

What are changed:

Marking 'model', 'optimizer' and 'lr_scheduler' in the arguments of deepspeed.initialize with the flag ds_is_inited = True.
Marking 'engine', 'engine.optimizer' and 'engine.lr_scheduler' in the return values of deepspeed.initialize with the flag ds_is_inited = True.
When calling deepspeed.initialize, raise an exception if detected ds_is_inited == True in the input model, optimizer or lr_scheduler

Expected Behavior:
Forbid repeated deepspeed.initialize invocations on model, optimizer and lr_scheduler objects.

traincheck-team · 2024-12-16T00:19:18Z

@microsoft-github-policy-service agree

traincheck-team · 2024-12-16T20:14:13Z

This fix still has interference with existing unit tests. Let me double check before we proceed.

…peedEngine propagates flag from the internal model

traincheck-team · 2024-12-16T21:03:48Z

The unit tests in tests/unit/runtime/test_ds_initialize.py all passed.
The PR is ready for review @tjruwase.

I am not able to check other unit tests due to GPU memory constraint.

deepspeed/__init__.py

tjruwase · 2024-12-18T03:47:12Z

deepspeed/__init__.py

+    if _is_initialized(model):
+        raise ValueError(
+            "Model has already been initialized, please make sure to only call deepspeed.initialize on a model once.")
+    if optimizer is not None and _is_initialized(optimizer):


Note that optimizer could be a Callable, not an object
https://github.com/microsoft/DeepSpeed/blob/4cd1d97460b677563d57f07a293724bdc02e0ef5/deepspeed/__init__.py#L71

tjruwase · 2024-12-18T03:47:41Z

deepspeed/__init__.py

+        raise ValueError(
+            "Optimizer has already been initialized, please make sure to only call deepspeed.initialize on an optimizer once."
+        )
+    if lr_scheduler is not None and _is_initialized(lr_scheduler):


Ditto for lr_scheduler
https://github.com/microsoft/DeepSpeed/blob/4cd1d97460b677563d57f07a293724bdc02e0ef5/deepspeed/__init__.py#L74

tjruwase · 2024-12-18T03:49:30Z

deepspeed/__init__.py

@@ -137,6 +181,10 @@ def initialize(args=None,
    zero.partition_parameters.shutdown_init_context()

    assert model is not None, "deepspeed.initialize requires a model"
+    # enforce that model, optimizer, and lr_scheduler have not been used in a previous deepspeed.initialize call
+    _assert_trainobjs_not_inited(model, optimizer, lr_scheduler)


I think this call should be moved into `_mark_trainobjs_initialized()

traincheck-team · 2024-12-19T18:43:01Z

Thanks for the review @tjruwase.

I added handling for callable types for optimizer and lr_scheduler. The handling is to only mark is_ds_inited for objects instead of callables, as the callables are not stateful and reuse should be allowed.

Regarding,

I think this call should be moved into _mark_trainobjs_initialized()

I think _assert_trainobjs_not_inited should still be separated from _mark_trainobjs_initialized as _mark_trainobjs_initialized is also called on the wrapped model and optimizers before exiting from deepspeed.initialize. The wrapped models may already have is_ds_inited being True since in DeepSpeedEngine all model flags will be passed through on the wrapper.

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

add a flag to _mark_trainobjs_initialized to indicate whether the *not_inited` check should be performed
add/check a flag using __dict__ rather than setattr/getattr
check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

tests/unit/runtime/test_ds_initialize.py

tjruwase · 2024-12-28T16:09:40Z

If we still want to keep _assert_trainobjs_not_inited inside _mark_trainobjs_initialized, we can do either of the three:

Thanks for raising this important design issue. Below are my thoughts and/or clarifications.

DeepSpeed is designed to optimize PyTorch, and currently does this by wrapping PyTorch constructs, e.g., nn.Module, optim.Optimizer, optim.lr_scheduler, etc.
To avoid nesting issues, DeepSpeed should not wrap DeepSpeed data structures, such as DeepSpeedEngine, DeepSpeedOptimizer, DeepSpeedDataLoader, etc.

Based on the above, I am aligned with your 3rd suggestion (below) of leveraging type information to simplify this PR.

3. check whether inited use type information as well, i.e. for types of DeepSpeedEngine we directly return inited == True instead of checking for flags.

What do you think?

traincheck-team · 2024-12-28T17:31:47Z

Got it. Yeah 3 is technically what should be implemented instead of letting flags just flying around. Will get the update EOD.

traincheck-team · 2024-12-30T02:05:37Z

@tjruwase Thanks for the suggestion. Please check the updated PR.

traincheck-team · 2025-01-13T21:04:30Z

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

loadams · 2025-01-13T22:45:37Z

@tjruwase Hi Olatunji, are there any further modifications that you'd like me to do?

@traincheck-team - could you look into the failures on the nv-torch-latest-v100 test? We've stabilized the CI and these look to be real failures.

traincheck-team · 2025-01-16T21:58:04Z

@loadams Thanks for getting back so quickly and I apologize for the late reply.

I inspected the 9 tests that failed. It appears to me that these tests failed because they initialized deepspeed using the same model multiple times, which is the exact behavior that this PR is trying to forbid. The PR itself is not buggy.

For example:

in tests/unit/checkpoint/test_other_optimizer.TestOtherOptimizerCheckpoint::test_checkpoint_unfused_optimizer at https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/checkpoint/test_other_optimizer.py#L61-L69. Two checkpoint_correctness_verification were invoked with the same models argument.
in tests/unit/runtime/half_precision/onebit/test_onebit.TestZeroOneAdamCheckpointing::test at https://github.com/microsoft/DeepSpeed/blob/c08e69f21238f15bfe0e3779170fefa2f75d4c7e/tests/unit/runtime/half_precision/onebit/test_onebit.py#L589-L617. Two deepspeed.initialize were invoked with exactly the same model parameter.

As far as I see there are two solutions:

Modify the unit tests to use separate model copies. But I don't think you guys may want to risk doing this, but I feel this is the "right" thing to do.
**Optionally suppress the exceptions by env or config vars". This might be an easier and more practical solution.

deepspeed/__init__.py

tests/unit/runtime/test_ds_initialize.py

… ones

loadams · 2025-01-31T20:03:22Z

FYI @traincheck-team - there still appear to be some unit test errors if you could investigate.

traincheck-team · 2025-02-01T14:34:46Z

Thanks for the heads up. Will do that soon.

…

On Fri, Jan 31, 2025 at 15:03 Logan Adams ***@***.***> wrote: FYI @traincheck-team <https://github.com/traincheck-team> - there still appear to be some unit test errors if you could investigate. — Reply to this email directly, view it on GitHub <#6874 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BNAPJE3J6B3C3Z6IOWKCO5L2NPJKDAVCNFSM6AAAAABTVBD4X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGI4TQNJZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

loadams · 2025-02-07T17:45:55Z

FYI @traincheck-team - you'll need to sign off on the DCO requirement now that we've changed the GH organization/structure. Let me know if you have any questions on that.

traincheck-team · 2025-02-09T05:19:07Z

@loadams I tried to amend the six problematic commits using git rebase, but since there were previous merges from main into this branch, it caused conflicts and made things messy (e.g. see the requested reviewers of this PR as code owners).

To fix this, I plan to create a new local branch from main, reapply the necessary changes, and then force-push it to replace this branch. This should give us a clean history while preserving the required commits.

Let me know if you have any concerns or if there’s a better way to proceed!

traincheck-team · 2025-02-09T06:15:12Z

@loadams Hi Logan, I apologize for the late reply. I’ve reviewed the 9 unit test failures in the recent workflow run: https://github.com/deepspeedai/DeepSpeed/actions/runs/13205140637/job/36866442471.

My understanding is that these failures are caused by repeated model initialization, which this PR now detects and raises exceptions for.

Evidences:

Inspecting the affected locations confirms that the exception is raised precisely where repeated initialization occurs.

Impacted Tests:

These failed unit tests span 3 files

test_onebit.py:249, test_onebit.py:616, test_onebit.py:997,
test_other_optimizer.py:68, test_other_optimizer.py:107 and
test_zero_optimizer.py:365.

Suggested Fix:

The solution is straightforward and manageable as—ensure a new model object is created before the second initialization, which does not require modification to any critical logic. I suggest handling this in a separate PR to fix these tests.

If there are any additional failures beyond these 9, please let me know.

fix: forbid repeated deepspeed.initialize on training objects

238ba1f

traincheck-team requested review from tjruwase, loadams and tohtana as code owners December 16, 2024 00:18

fix: remove mark-time checking for non-existence of the flag as DeepS…

d1e7777

…peedEngine propagates flag from the internal model

traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from dc81325 to d1e7777 Compare December 16, 2024 21:02

tjruwase reviewed Dec 18, 2024

View reviewed changes

deepspeed/__init__.py Outdated Show resolved Hide resolved

tjruwase reviewed Dec 18, 2024

View reviewed changes

deepspeed/__init__.py Outdated Show resolved Hide resolved

tjruwase reviewed Dec 18, 2024

View reviewed changes

handle callable types in init mark

62067cc

tjruwase reviewed Dec 28, 2024

View reviewed changes

tests/unit/runtime/test_ds_initialize.py Outdated Show resolved Hide resolved

change: do init checking and marking in one func

2c5806b

traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from 7f6fc1f to 2c5806b Compare December 30, 2024 02:05

loadams added 2 commits January 2, 2025 11:52

Merge branch 'master' into fix-6848-forbid-repeated-init

6a0b600

Merge branch 'master' into fix-6848-forbid-repeated-init

7452786

Merge branch 'master' into fix-6848-forbid-repeated-init

71d3e31

Merge branch 'master' into fix-6848-forbid-repeated-init

80e9e16

tjruwase reviewed Jan 21, 2025

View reviewed changes

deepspeed/__init__.py Outdated Show resolved Hide resolved

remove unnecessary prints

a9837f9

tjruwase reviewed Jan 21, 2025

View reviewed changes

deepspeed/__init__.py Show resolved Hide resolved

tjruwase reviewed Jan 21, 2025

View reviewed changes

deepspeed/__init__.py Show resolved Hide resolved

tjruwase reviewed Jan 21, 2025

View reviewed changes

tests/unit/runtime/test_ds_initialize.py Outdated Show resolved Hide resolved

loadams and others added 3 commits January 21, 2025 11:49

Merge branch 'master' into fix-6848-forbid-repeated-init

b1d4330

add: split TestNoRepeatedInitializationAllowed test into two separate…

1b15bea

… ones

Merge branch 'master' into fix-6848-forbid-repeated-init

f84cca6

tjruwase approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'master' into fix-6848-forbid-repeated-init

13dbe56

Merge branch 'master' into fix-6848-forbid-repeated-init

d2f315f

traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from d2f315f to efeec37 Compare February 9, 2025 04:47

traincheck-team requested review from jomayeri, GuanhuaWang and hwchen2017 as code owners February 9, 2025 04:47

traincheck-team force-pushed the fix-6848-forbid-repeated-init branch from efeec37 to d2f315f Compare February 9, 2025 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: forbid repeated deepspeed.initialize on training objects #6874

Fix: forbid repeated deepspeed.initialize on training objects #6874

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024 •

edited

Loading

tjruwase Dec 18, 2024

tjruwase Dec 18, 2024

tjruwase Dec 18, 2024

traincheck-team commented Dec 19, 2024 •

edited

Loading

tjruwase commented Dec 28, 2024

traincheck-team commented Dec 28, 2024

traincheck-team commented Dec 30, 2024

traincheck-team commented Jan 13, 2025

loadams commented Jan 13, 2025

traincheck-team commented Jan 16, 2025 •

edited

Loading

loadams commented Jan 31, 2025

traincheck-team commented Feb 1, 2025 via email

loadams commented Feb 7, 2025

traincheck-team commented Feb 9, 2025 •

edited

Loading

traincheck-team commented Feb 9, 2025

Fix: forbid repeated deepspeed.initialize on training objects #6874

Are you sure you want to change the base?

Fix: forbid repeated deepspeed.initialize on training objects #6874

Conversation

traincheck-team commented Dec 16, 2024 • edited Loading

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024

traincheck-team commented Dec 16, 2024 • edited Loading

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

tjruwase Dec 18, 2024

Choose a reason for hiding this comment

traincheck-team commented Dec 19, 2024 • edited Loading

tjruwase commented Dec 28, 2024

traincheck-team commented Dec 28, 2024

traincheck-team commented Dec 30, 2024

traincheck-team commented Jan 13, 2025

loadams commented Jan 13, 2025

traincheck-team commented Jan 16, 2025 • edited Loading

loadams commented Jan 31, 2025

traincheck-team commented Feb 1, 2025 via email

loadams commented Feb 7, 2025

traincheck-team commented Feb 9, 2025 • edited Loading

traincheck-team commented Feb 9, 2025

Evidences:

Impacted Tests:

Suggested Fix:

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 16, 2024 •

edited

Loading

traincheck-team commented Dec 19, 2024 •

edited

Loading

traincheck-team commented Jan 16, 2025 •

edited

Loading

traincheck-team commented Feb 9, 2025 •

edited

Loading