I first trained the VQVAE(Vector Quantized Variational Auto Encoder) using the celeba dataset containing 200000+ images, it can be found in kaggle
After 2 Epochs, this is a sample from VQVAE:
Model weights can be found here.
The dataset used for training the UNET model is CelebaText
Then using the trained VQVAE, we trained the diffusion model. Each epoch was taking bit longer, so I decided to stop is at 5 epochs, and below is a sample from it after giving it a prompt-'He is a man'