[Diffusers] Regression of CPU memory usage #738

JingyaHuang · 2024-11-18T13:19:53Z

Issue

We were able to run SDXL artifacts compiled with inf2.8xlarge on inf2.xLarge (as stated in the blog). However, we recently found that SDXL's CPU memory usage increased, leading to OOM during the inference on inf2.xlarge. In this issue, we will note down some experiment results to trace where the regression was introduced.

Tasks

Latest Optimum Neuron (0.0.26) on Neuron SDK 2.15.0
Other Neuron SDK versions
Pytorch version 1.13.1 v.s 2.1?

Reproduction (minimal, reproducible, runnable)

optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0  --batch_size 1 --height 1024 --width 1024 --num_images_per_prompt 4 --auto_cast matmul --auto_cast_type bf16 sd_neuron_xl/

Expected behavior

Being able to fit into inf2.xlarge after being compiled with inf2.8xlarge.

JingyaHuang · 2024-11-20T13:27:38Z

Experiment 1: compiled with Neuron sdk 2.15.0 + Optimum Neuron 0.0.26:

Compiled artifacts: https://huggingface.co/Jingya/sdxl_on_0.0.26_neuron_2.15/tree/main/unet
Inference failed with

Loading only U-Net into both Neuron Cores...
You have disabled the safety checker for <class 'optimum.neuron.modeling_diffusion.NeuronStableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
  0%|                                                                                                       | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/ubuntu/optimum-neuron/test_sdxl.py", line 5, in <module>
    image = stable_diffusion(prompt).images[0]
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 1108, in __call__
    return self.auto_model_class.__call__(self, height=height, width=width, *args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 1020, in __call__
    noise_pred = self.unet(
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 1137, in __call__
    return self.forward(*args, **kwargs)
  File "/home/ubuntu/optimum-neuron/optimum/neuron/modeling_diffusion.py", line 1228, in forward
    outputs = self.model(*inputs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch_neuronx/xla_impl/data_parallel.py", line 254, in forward
    outputs = parallel_apply(
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch_neuronx/xla_impl/data_parallel.py", line 404, in parallel_apply
    output.reraise()
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
RuntimeError: Caught RuntimeError on neuroncore 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch_neuronx/xla_impl/data_parallel.py", line 390, in _worker
    output = module(*input)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/aws_neuron_venv_2.15.0/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: forward() is missing value for argument 'argument_4'. Declaration: forward(__torch__.torch_neuronx.xla_impl.trace.___torch_mangle_7.NeuronModule self, Tensor argument_1, Tensor argument_2, Tensor argument_3, Tensor argument_4, Tensor argument_5) -> ((Tensor))

Segmentation fault (core dumped)

JingyaHuang · 2024-11-20T13:30:27Z

Experiment 2: Neuron sdk 2.16.1 + Optimum Neuron 0.0.18

Compiled artifacts: https://huggingface.co/Jingya/sdxl_on_0.0.18_neuron_2.16.1/tree/main
Inference failed with

model_index.json: 100%|███████████████████████████████████████████████████████████████████████████| 779/779 [00:00<00:00, 8.42MB/s]
tokenizer/special_tokens_map.json: 100%|██████████████████████████████████████████████████████████| 472/472 [00:00<00:00, 6.37MB/s]
text_encoder_2/config.json: 100%|█████████████████████████████████████████████████████████████| 1.42k/1.42k [00:00<00:00, 19.9MB/s]
tokenizer/tokenizer_config.json: 100%|████████████████████████████████████████████████████████████| 704/704 [00:00<00:00, 10.2MB/s]
tokenizer/merges.txt: 100%|█████████████████████████████████████████████████████████████████████| 525k/525k [00:00<00:00, 28.2MB/s]
tokenizer_2/special_tokens_map.json: 100%|████████████████████████████████████████████████████████| 460/460 [00:00<00:00, 6.92MB/s]
text_encoder/config.json: 100%|███████████████████████████████████████████████████████████████| 1.41k/1.41k [00:00<00:00, 20.1MB/s]
scheduler/scheduler_config.json: 100%|████████████████████████████████████████████████████████████| 582/582 [00:00<00:00, 9.11MB/s]
tokenizer_2/tokenizer_config.json: 100%|██████████████████████████████████████████████████████████| 855/855 [00:00<00:00, 13.5MB/s]
unet/config.json: 100%|███████████████████████████████████████████████████████████████████████| 2.82k/2.82k [00:00<00:00, 33.1MB/s]
vae_decoder/config.json: 100%|████████████████████████████████████████████████████████████████| 1.43k/1.43k [00:00<00:00, 21.5MB/s]
vae_encoder/config.json: 100%|████████████████████████████████████████████████████████████████| 1.44k/1.44k [00:00<00:00, 21.8MB/s]
tokenizer/vocab.json: 100%|███████████████████████████████████████████████████████████████████| 1.06M/1.06M [00:00<00:00, 9.99MB/s]
model.neuron: 100%|█████████████████████████████████████████████████████████████████████████████| 426M/426M [00:56<00:00, 7.54MB/s]
model.neuron: 100%|█████████████████████████████████████████████████████████████████████████████| 376M/376M [00:59<00:00, 6.28MB/s]
model.neuron: 100%|█████████████████████████████████████████████████████████████████████████████| 825M/825M [02:04<00:00, 6.64MB/s]
model.neuron: 100%|███████████████████████████████████████████████████████████████████████████| 1.79G/1.79G [02:52<00:00, 10.4MB/s]
model.neuron: 100%|███████████████████████████████████████████████████████████████████████████| 4.18G/4.18G [06:41<00:00, 10.4MB/s]
Fetching 20 files: 100%|███████████████████████████████████████████████████████████████████████████| 20/20 [06:41<00:00, 20.07s/it]
Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.18G [02:52<03:48, 10.4MB/s]
Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.
Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.
Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.
Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.
Loading only U-Net into both Neuron Cores...
Killed

CPU OOM... Next step, neuron SDK 2.16.1 with Optimum Neuron 0.0.13

JingyaHuang · 2024-11-21T12:07:42Z

Experiment 2: Neuron sdk 2.16.1 + Optimum Neuron 0.0.13

Compiled artifacts: https://huggingface.co/Jingya/sdxl_on_0.0.13_neuron_2.16.1/tree/main
Inference: succeeded
-> the regression was introduced between Optimum Neuron 0.0.13 - 0.0.18

JingyaHuang · 2024-11-22T11:51:03Z

Optimum Neuron v0.0.14 + Neuron SDK 2.16.1 ✔️

https://huggingface.co/Jingya/sd_neuron_xl_2.16.1_0.0.14

JingyaHuang · 2024-11-22T13:39:53Z

Optimum Neuron v0.0.15 + Neuron SDK 2.16.1 👎

Jingya/sd_neuron_xl_2.16.1_0.0.15

Loading only U-Net into both Neuron Cores...
Killed

The regression shall has been introduced between v0.0.14 -> v0.0.15

JingyaHuang · 2024-11-25T23:14:01Z

The regression comes from the change of order in loading submodels, we need to load the unet first otherwise we get CPU OOM when loading the UNet if other models(VAE, text encoders) are already loaded.

This regression was already corrected during the diffusion pipeline refactoring: #711 and included since Optimum Neuron v0.0.26 release. Besides, a PR is opened to enhance the loading order: #742.

Close this issue as solved.

JingyaHuang added the bug Something isn't working label Nov 18, 2024

JingyaHuang mentioned this issue Nov 25, 2024

Emphasize the order of loading models in diffusion pipelines #742

Merged

JingyaHuang closed this as completed Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Diffusers] Regression of CPU memory usage #738

[Diffusers] Regression of CPU memory usage #738

JingyaHuang commented Nov 18, 2024 •

edited

Loading

JingyaHuang commented Nov 20, 2024

JingyaHuang commented Nov 20, 2024 •

edited

Loading

JingyaHuang commented Nov 21, 2024

JingyaHuang commented Nov 22, 2024 •

edited

Loading

JingyaHuang commented Nov 22, 2024

JingyaHuang commented Nov 25, 2024

[Diffusers] Regression of CPU memory usage #738

[Diffusers] Regression of CPU memory usage #738

Comments

JingyaHuang commented Nov 18, 2024 • edited Loading

Issue

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

JingyaHuang commented Nov 20, 2024

JingyaHuang commented Nov 20, 2024 • edited Loading

JingyaHuang commented Nov 21, 2024

JingyaHuang commented Nov 22, 2024 • edited Loading

JingyaHuang commented Nov 22, 2024

JingyaHuang commented Nov 25, 2024

JingyaHuang commented Nov 18, 2024 •

edited

Loading

JingyaHuang commented Nov 20, 2024 •

edited

Loading

JingyaHuang commented Nov 22, 2024 •

edited

Loading