Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

bubbliiiing · 2025-01-22T08:49:52Z

What does this PR do?

This PR converts the EasyAnimateV5.1 model into a diffuser-supported inference model, including three complete pipelines and corresponding modules.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@a-r-r-o-w

a-r-r-o-w

Thank you for the PR @bubbliiiing! This is in great shape and already mostly in the implementation style used in diffusers 🤗

I've left some comments from a quick look through the PR. Happy to help make any of the required changes to help bring the PR to completion

.gitignore

src/diffusers/models/attention_processor.py

a-r-r-o-w · 2025-01-22T14:51:10Z

src/diffusers/models/attention_processor.py

+        encoder_hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        image_rotary_emb: Optional[torch.Tensor] = None,
+        attn2: Attention = None,


This seems similar to Flux/SD3/HunyuanVideo's Joint-attention processors that concatenate the visual and text tokens. Let's do it the same way as done here:

diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py

Line 367 in 8d6f6d6

self.attn = Attention(

diffusers/src/diffusers/models/transformers/transformer_hunyuan_video.py

Line 41 in 8d6f6d6

class HunyuanVideoAttnProcessor2_0:

Does this mean using add_q, add_k, add_v instead of using attn2?

There are two things we could do here:

Either convert the state dict of the original-format models (that you currently have on the HuggingFace Hub) and update them to diffusers-format (which would make attn2.to_q -> add_q_proj, attn2.to_k -> add_k_proj, attn2.to_v -> add_v_proj

Create a custom attention class similar to Attention and MochiAttention in which you are free to use layer naming of your choice (so basically keeping the same to_q, to_k and to_v.

The first approach is more closely aligned with diffusers code style but would require you to update multiple checkpoints -- but we are transitioning to a single file modeling format, so if you choose to go with second approach for convenience, that works for us as well. Essentially, irrespective of the design you choose, we need to make sure:

When the forward of a layer is called, it only takes tensors as input and produces tensors as output.

Taking intermediate layers as input to forward, or making calls to other layers out-of-order randomly, is not supported by our design style of different current/upcoming features

cc @DN6 here in case you have thoughts about the single file format and model-specific Attention classes

I moved attn2 to the processor's init; does this meet the requirements?

src/diffusers/models/autoencoders/autoencoder_kl_magvit.py

a-r-r-o-w · 2025-01-22T14:59:35Z

src/diffusers/models/autoencoders/autoencoder_kl_magvit.py

+        for name, module in self.named_children():
+            _set_3dgroupnorm_for_submodule(name, module)
+
+    def single_forward(self, x: torch.Tensor) -> torch.Tensor:


This is very different from diffusers-style implementation of encoder/decoder. Could we follow the style as done in:

diffusers/src/diffusers/models/autoencoders/autoencoder_dc.py

Line 202 in 8d6f6d6

class Encoder(nn.Module):

diffusers/src/diffusers/models/autoencoders/autoencoder_kl_mochi.py

Line 691 in 8d6f6d6

class AutoencoderKLMochi(ModelMixin, ConfigMixin):

diffusers/src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py

Line 457 in 8d6f6d6

class HunyuanVideoEncoder3D(nn.Module):

Sorry, does this mean that I cannot use functions like set_padding_one_frame?

Do I need to use this conv_cache in autoencoder_kl_mochi?

In the model implementations, we usually only try to keep (atleast in the latest model integrations):

Submodel initializations

Forward method

So, unless a helper function like set_padding_one_frame is used in multiple locations, I would suggest directly substituting its code in the forward implementation. If a helper function is required, let's make it a private function by prefixing the function name with an underscore

The conv_cache saves a few computations when running the VAE encode/decode process from repeated frames that are used as padding. As such, it is not required to implement it if it is not needed for framewise encoding and decoding.

src/diffusers/models/autoencoders/autoencoder_kl_magvit.py

src/diffusers/models/downsampling.py

src/diffusers/models/normalization.py

bubbliiiing · 2025-01-23T03:55:33Z

Sorry for not standardizing some parts; I will make the necessary modifications. Also, I would like to ask if I need to add test files in tests/pipelines and add documentation in docs/source/en/api/pipelines?

a-r-r-o-w · 2025-01-23T21:25:43Z

Yes, we will need a test for all three pipelines as well as model tests in tests/models, as well as the documentation pages. Ofcourse, I will try to help you with everything :)

Also, congratulations on the release! I tried out the original repository example and the model is very good! 🎉

bubbliiiing added 2 commits January 22, 2025 08:26

Update EasyAnimate V5.1

5609fc2

Merge branch 'main' of https://github.com/bubbliiiing/diffusers

4978561

a-r-r-o-w reviewed Jan 22, 2025

View reviewed changes

yiyixuxu added the roadmap Add to current release roadmap label Jan 22, 2025

bubbliiiing added 3 commits February 4, 2025 15:12

Add docs && add tests && Fix comments problems in transformer3d and vae

0b01118

delete comments and remove useless import

914f460

delete process

19fcc7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

bubbliiiing commented Jan 22, 2025 •

edited

Loading

a-r-r-o-w left a comment

a-r-r-o-w Jan 22, 2025

bubbliiiing Jan 26, 2025

a-r-r-o-w Jan 27, 2025

bubbliiiing Feb 4, 2025

a-r-r-o-w Jan 22, 2025

bubbliiiing Jan 26, 2025

bubbliiiing Jan 26, 2025

a-r-r-o-w Jan 27, 2025

bubbliiiing commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025

Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

Are you sure you want to change the base?

Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

Conversation

bubbliiiing commented Jan 22, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bubbliiiing commented Jan 23, 2025

a-r-r-o-w commented Jan 23, 2025

bubbliiiing commented Jan 22, 2025 •

edited

Loading