-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
introducing components manager #10572
base: main
Are you sure you want to change the base?
Conversation
cc @a-r-r-o-w here too since you're working on some offloading strategy that's targeted on UI use case (e.g. #10503) , we should make it work with this API |
cc @vladmandic too, I think it might be useful for SD Next, if it's the case, let us know if you have any feedbacks! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
we've implemented our custom offloading (on top of accelerate hooks) a bit back and it works as goal based - move model components based on their size until goal is reach (goal being configurable min/max vram usage thresholds). i'm not sure how this compares since there are no notes on:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean 💚
Apart from the in-line comments I have some notes below:
will run a custom function to decide if it needs to offload other models (for example, very commonly UI can decide whether to offload based on the available memory and model size, and they may have a set of custom logic that they use to decide which model to offload)
Should this follow a common template so that users can supply their own implementations?
when offloading, it tries to offload as little as possible, i.e. it will find the model(s) with the smallest total size that meets the memory requirement.
Let's say I have exhausted the available GPU memory. What happens in that case? Do I offload some parts of the model to CPU and possibly to disk i.e., the ones that didn't fit the GPU?
components.enable_auto_cpu_offload(device)
It's not a blocker but we could think of allowing the users to pass multiple devices here as well. But perhaps this is best done with a separate offload class. Or should it rather be handled with device_map
(pipeline-level) completely? Okay with me if this feels like we're digressing or table the discussion for later.
you can print out the components to get an overview of all the model components you have, their device/dtype info; this is what I have before we run any pipelines.
This is SO GOOD 🔥
pipe = FluxPipeline.from_pretrained(repo, **components.get(["transformer","text_encoder","text_encoder_2","vae"]), torch_dtype=dtype)
So, in this case, users don't have to do any kind of device placement on the pipe
as the components
already have been mapped when we called components.enable_auto_cpu_offload(device)
. Yeah?
Definitely not a blocker but **components.get(["transformer","text_encoder","text_encoder_2","vae"])
, assumes that the user exactly knows the name of the components. Would it be possible to automatically infer these names to make it a tad bit easier? Or is it far-fetched?
this is what I got, the base transformer are now moved to cpu, and canny was moved to device, max memory stays at 36.28
This is nice but also assumes that we have enough CPU memory, which starts to add up when there are multiple models that are to be kept in CPU. Should we expose an argument that lets users remove these components completely?
LMK if anything's unclear. Excited to see this getting shipped soon.
from ..utils import ( | ||
is_accelerate_available, | ||
logging, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit.
from ..utils import ( | |
is_accelerate_available, | |
logging, | |
) | |
from ..utils import is_accelerate_available, logging |
def get_memory_footprint(self, return_buffers=True): | ||
r""" | ||
Get the memory footprint of a model. This will return the memory footprint of the current model in bytes. Useful to | ||
benchmark the memory footprint of the current model and design some tests. Solution inspired from the PyTorch | ||
discussions: https://discuss.pytorch.org/t/gpu-memory-that-model-uses/56822/2 | ||
|
||
Arguments: | ||
return_buffers (`bool`, *optional*, defaults to `True`): | ||
Whether to return the size of the buffer tensors in the computation of the memory footprint. Buffers are | ||
tensors that do not require gradients and not registered as parameters. E.g. mean and std in batch norm | ||
layers. Please see: https://discuss.pytorch.org/t/what-pytorch-means-by-buffers/120266/2 | ||
""" | ||
mem = sum([param.nelement() * param.element_size() for param in self.parameters()]) | ||
if return_buffers: | ||
mem_bufs = sum([buf.nelement() * buf.element_size() for buf in self.buffers()]) | ||
mem = mem + mem_bufs | ||
return mem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to have this independent method.
Perhaps from the model-level implementation of get_memory_footprint()
we can call this method and that allows us to reuse. Can be revisited later.
def set_strategy(self, offload_strategy: "AutoOffloadStrategy"): | ||
self.offload_strategy = offload_strategy | ||
|
||
def add_other_hook(self, hook: "UserCustomOffloadHook"): | ||
""" | ||
Add a hook to the list of hooks to consider for offloading. | ||
""" | ||
if self.other_hooks is None: | ||
self.other_hooks = [] | ||
self.other_hooks.append(hook) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to have utilities like:
delete_hooks()
list_current_hooks()
import time | ||
|
||
# YiYi Notes: only logging time for now to monitor the overhead of offloading strategy (remove later) | ||
start_time = time.perf_counter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinions but I think we could keep these with logger.debug()
, it's very useful, IMO.
def __init__( | ||
self, | ||
execution_device: Optional[Union[str, int, torch.device]] = None, | ||
other_hooks: Optional[List["UserCustomOffloadHook"]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be called hooks
? Or we're calling it other_hooks
to avoid any potential overlaps with naming?
Edit: Think other_hooks
is better.
|
||
if hooks_to_offload: | ||
clear_device_cache() | ||
module.to(self.execution_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understand why we need to do this additional device placement provided the call to self.offload_strategy()
above (when it's not None)? Do we have to guard the placement with if self.offload_strategy is None
?
class UserCustomOffloadHook: | ||
""" | ||
A simple hook grouping a model and a `CustomOffloadHook`, which provides easy APIs for to call the init method of | ||
the hook or remove it entirely. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need to override this class in case they want to customize one?
|
||
current_module_size = get_memory_footprint(model) | ||
|
||
mem_on_device = torch.cuda.mem_get_info(execution_device.index)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be guarded if CUDA is available. So, perhaps we can have a dispatching system to obtain the memory on the device based on the device being used.
Regardless of the dispatching system, this call needs to be guarded I think.
This PR introduces "components manager" (a.k.a diffusers built-in "model management system")
It was mainly made for modular diffusers, but IMO it is also useful for regular pipelines. I think I'll be able to demonstrate its behavior with the regular pipeline use case. So making a separate PR here for easy review and I can start to iterate on this feature before the modular diffusers PR is ready for review
Motivations/objective :
How does it work?
we add a accelerate hook to all models where we:
before the forward pass of each model, if the model is not already on the execution device, the hook moves the models there along with all its inputs , and will run a custom function to decide if it needs to offload other models (for example, very commonly UI can decide whether to offload based on the available memory and model size, and they may have a set of custom logic that they use to decide which model to offload)
so it is a little bit similar to our sequential cpu model offload, but it is a lot more flexible.
an example
in this example, we'll work with 3 flux related workflows at same time: flux text2img, flux canny control, and flux depth control. we demo a default strategy we made that's basically:
memory_reserve_margin
that you can use to adjust how aggressive it is to offload, e.g. if your model size is 20G and you think the actual memory used would be around 25G, I will setmemory_reserve_margin=5G
, so if you're getting an OOM and the offloading strategy applied wasn't aggressive enough, i.e. it still has unused model left on device, you can reduce this number, otherwise, increase.this strategy is really just an example to show users how they can set their own strategies, can totally use a different strategy so feel free to help brainstorm
I made a colab notebook here too, https://colab.research.google.com/drive/1EVVS8ai4qIW5Ca2CcSGz5VsdvD_N4N8_?usp=sharing
first let's set up, and define inputs etc
now let's create the components manager and add all the models we need to use: that include CLIP/T5 text encoders, 3 flux transformers and vae
you can print out the
components
to get an overview of all the model components you have, their device/dtype info; this is what I have before we run any pipelines. You can see that all the model adds up to be something around 76G and in the colab notebook instance I have around 40G memorynow let's apply the custom offloading strategy on components manager
run the first workflow
this is what I got after the first workflow, you can see that all the models that were needed for text2image are moved to device and kept on device, the max memory was 36.17 (less than the available) so we were able to run this workflow without offloading any model
now let's clear the memory cache and run the second workflow
this is what I got, the base transformer are now moved to cpu, and canny was moved to device, max memory stays at 36.28
now run the last workflow