You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@shonenkov Great work everyone!
As far as I can tell, there is only VAE decoder with DWT and no corresponding encoder.
Encoding with get_vae(dwt=True) produces the same number of tokens as get_vae(dwt=False) on the same picture size but they are different. And the DWT decoder doubles the original image size. The result is large but blurry and I see quality loss even after reducing to the original image size. The image decoded-encoded with default VQ GAN model still seems to be better than the DWT model. @bes-dev Is this due to the need of re-training the model end to end you mentioned in #42 ?
I would expect the compatible VAE DWT encoder encode 512x512 image into 1024 tokens and the decoder restore the image back to 512x512.
I think for now VAE with DWT needs 256x256 image prompts rather than 512x512 but then the resulting quality is unfortunately not worth the effort. Looking forward to see DALL-E trained end-to-end on 512 images.
The text was updated successfully, but these errors were encountered:
@ink1 yes, the available checkpoint of the DWT VQVAE was trained only for a few iterations and a small dataset as a proof of concept, but to achieve production quality, we should train it longer with a larger dataset. At the moment, I don't have enough resources to do it, but I think Sber guys will do it on their side.
@shonenkov Great work everyone!
As far as I can tell, there is only VAE decoder with DWT and no corresponding encoder.
Encoding with get_vae(dwt=True) produces the same number of tokens as get_vae(dwt=False) on the same picture size but they are different. And the DWT decoder doubles the original image size. The result is large but blurry and I see quality loss even after reducing to the original image size. The image decoded-encoded with default VQ GAN model still seems to be better than the DWT model.
@bes-dev Is this due to the need of re-training the model end to end you mentioned in #42 ?
I would expect the compatible VAE DWT encoder encode 512x512 image into 1024 tokens and the decoder restore the image back to 512x512.
I think for now VAE with DWT needs 256x256 image prompts rather than 512x512 but then the resulting quality is unfortunately not worth the effort. Looking forward to see DALL-E trained end-to-end on 512 images.
The text was updated successfully, but these errors were encountered: