Dreambooth-Diffusers-Xformers-Win

A Windows compatible fork of ShivamShrirao/diffusers Dreambooth Xformers example with prebuilt dependencies and additional tools for ease of use.

Note from maintainer

I am not actively developing nor maintaining this project. While I may be active on some issues, my presence shouldn't be relied upon as I am moving on to other projects. Issues however are open for peer to peer support. Inactive issues are marked stale after 60 days and closed after 7 more.

Conda installation

Open anaconda prompt and create env

conda create -n dreambooth-sd-xformers python=3.8
conda activate dreambooth-sd-xformers

Install requirements

conda install torchvision==0.13.1 -c pytorch -c conda-forge
pip install ./deps/diffusers-0.7.0.dev0-py3-none-any.whl
pip install ./deps/xformers-0.0.14.dev0-cp38-cp38-win_amd64.whl
pip install -r requirements.txt

Copy deps/bitsandbytes-win-prebuilt/* to C:\Users\%username%\.conda\envs\dreambooth-sd-xformers\Lib\site-packages\bitsandbytes so that the .dll are among the .so

Browse to C:\Users\%username%\.conda\envs\dreambooth-sd-xformers\Lib\site-packages\bitsandbytes

In cextension.py ~line 91:

Replace

self.lib = ct.cdll.LoadLibrary(binary_path)

with

self.lib = ct.cdll.LoadLibrary(str(binary_path))

In cuda_setup/main.py ~line 119:

Replace

if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.so', None, None, None, None

with

if torch.cuda.is_available(): return 'libbitsandbytes_cuda116.dll', None, None, None, None
if not torch.cuda.is_available(): return 'libsbitsandbytes_cpu.dll', None, None, None, None

Login to huggingface-cli and config accelerate

huggingface-cli login
accelerate config

Train and convert using dreambooth.bat

Workflow Example

Download any Stable Diffusion model to work off of (ie. stable-diffusion-v-1-4-original), convert to diffuser with batch script, and place into models folder. Place input images into data/NAME/images. Generate class images and train with batch script. (optionally, when satisfied, convert model to SD/ckpt for use with AUTOMATIC1111's webui and the such.

Performance

Important

Use table bellow to find best config for training. To adjust batch script config, edit line 8-20 of dreambooth.bat to fit your needs. Backup dreambooth.bat stored in deps in case of damage.

Model with just xformers memory efficient flash attention uses 15.79 GB VRAM with --gradient_checkpointing else 17.7 GB. Both have no loss in precision at all. gradient_checkpointing recalculates intermediate activations to save memory at cost of some speed.

Caching the outputs of VAE and Text Encoder and freeing them also helped in reducing memory.

Use the table below to choose the best flags based on your memory and speed requirements. Tested on Tesla T4 GPU.

`fp16`	`train_batch_size`	`gradient_accumulation_steps`	`gradient_checkpointing`	`use_8bit_adam`	GB VRAM usage	Speed (it/s)
fp16	1	1	TRUE	TRUE	9.92	0.93
no	1	1	TRUE	TRUE	10.08	0.42
fp16	2	1	TRUE	TRUE	10.4	0.66
fp16	1	1	FALSE	TRUE	11.17	1.14
no	1	1	FALSE	TRUE	11.17	0.49
fp16	1	2	TRUE	TRUE	11.56	1
fp16	2	1	FALSE	TRUE	13.67	0.82
fp16	1	2	FALSE	TRUE	13.7	0.83
fp16	1	1	TRUE	FALSE	15.79	0.77

DreamBooth training example

DreamBooth is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject.

set MODEL_NAME="path-to-sd-model"
set INSTANCE_DIR="path-to-instance-images"
set OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py^
  --pretrained_model_name_or_path=%MODEL_NAME%^
  --instance_data_dir=%INSTANCE_DIR%^
  --output_dir=%OUTPUT_DIR%^
  --instance_prompt="a photo of sks dog"^
  --resolution=512^
  --train_batch_size=1^
  --gradient_accumulation_steps=1^
  --learning_rate=5e-6^
  --lr_scheduler="constant"^
  --lr_warmup_steps=0^
  --max_train_steps=400

Training with prior-preservation loss

Prior-preservation is used to avoid overfitting and language-drift. Refer to the paper to learn more about it. For prior-preservation we first generate images using the model with a class prompt and then use those during training along with our data. According to the paper, it's recommended to generate num_epochs * num_samples images for prior-preservation. 200-300 works well for most cases.

set MODEL_NAME="path-to-sd-model"
set INSTANCE_DIR="path-to-instance-images"
set CLASS_DIR="path-to-class-images"
set OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py^
  --pretrained_model_name_or_path=%MODEL_NAME%^
  --instance_data_dir=%INSTANCE_DIR%^
  --class_data_dir=%CLASS_DIR%^
  --output_dir=%OUTPUT_DIR%^
  --with_prior_preservation --prior_loss_weight=1.0^
  --instance_prompt="a photo of sks dog"^
  --class_prompt="a photo of dog"^
  --resolution=512^
  --train_batch_size=1^
  --gradient_accumulation_steps=1^
  --learning_rate=5e-6^
  --lr_scheduler="constant"^
  --lr_warmup_steps=0^
  --num_class_images=200^
  --max_train_steps=800

Training on a 16GB GPU:

With the help of gradient checkpointing and the 8-bit optimizer from bitsandbytes it's possible to run train dreambooth on a 16GB GPU.

set MODEL_NAME="path-to-sd-model"
set INSTANCE_DIR="path-to-instance-images"
set CLASS_DIR="path-to-class-images"
set OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py^
  --pretrained_model_name_or_path=%MODEL_NAME%^
  --instance_data_dir=%INSTANCE_DIR%^
  --class_data_dir=%CLASS_DIR%^
  --output_dir=%OUTPUT_DIR%^
  --with_prior_preservation --prior_loss_weight=1.0^
  --instance_prompt="a photo of sks dog"^
  --class_prompt="a photo of dog"^
  --resolution=512^
  --train_batch_size=1^
  --gradient_accumulation_steps=2 --gradient_checkpointing^
  --use_8bit_adam^
  --learning_rate=5e-6^
  --lr_scheduler="constant"^
  --lr_warmup_steps=0^
  --num_class_images=200^
  --max_train_steps=800

Training on a 8 GB GPU:

By using DeepSpeed it's possible to offload some tensors from VRAM to either CPU or NVME allowing to train with less VRAM.

DeepSpeed needs to be enabled with accelerate config. During configuration answer yes to "Do you want to use DeepSpeed?". With DeepSpeed stage 2, fp16 mixed precision and offloading both parameters and optimizer state to cpu it's possible to train on under 8 GB VRAM with a drawback of requiring significantly more RAM (about 25 GB). See documentation for more DeepSpeed configuration options.

Changing the default Adam optimizer to DeepSpeed's special version of Adam deepspeed.ops.adam.DeepSpeedCPUAdam gives a substantial speedup but enabling it requires CUDA toolchain with the same version as pytorch. 8-bit optimizer does not seem to be compatible with DeepSpeed at the moment.

set MODEL_NAME="path-to-sd-model"
set INSTANCE_DIR="path-to-instance-images"
set CLASS_DIR="path-to-class-images"
set OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py^
  --pretrained_model_name_or_path=%MODEL_NAME%^
  --instance_data_dir=%INSTANCE_DIR%^
  --class_data_dir=%CLASS_DIR%^
  --output_dir=%OUTPUT_DIR%^
  --with_prior_preservation --prior_loss_weight=1.0^
  --instance_prompt="a photo of sks dog"^
  --class_prompt="a photo of dog"^
  --resolution=512^
  --train_batch_size=1^
  --sample_batch_size=1^
  --gradient_accumulation_steps=1 --gradient_checkpointing^
  --learning_rate=5e-6^
  --lr_scheduler="constant"^
  --lr_warmup_steps=0^
  --num_class_images=200^
  --max_train_steps=800^
  --mixed_precision=fp16

Fine-tune text encoder with the UNet.

The script also allows to fine-tune the text_encoder along with the unet. It's been observed experimentally that fine-tuning text_encoder gives much better results especially on faces. Pass the --train_text_encoder argument to the script to enable training text_encoder.

set MODEL_NAME="path-to-sd-model"
set INSTANCE_DIR="path-to-instance-images"
set CLASS_DIR="path-to-class-images"
set OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py^
  --pretrained_model_name_or_path=%MODEL_NAME%^
  --train_text_encoder^
  --instance_data_dir=%INSTANCE_DIR%^
  --class_data_dir=%CLASS_DIR%^
  --output_dir=%OUTPUT_DIR%^
  --with_prior_preservation --prior_loss_weight=1.0^
  --instance_prompt="a photo of sks dog"^
  --class_prompt="a photo of dog"^
  --resolution=512^
  --train_batch_size=1^
  --use_8bit_adam
  --gradient_checkpointing^
  --learning_rate=2e-6^
  --lr_scheduler="constant"^
  --lr_warmup_steps=0^
  --num_class_images=200^
  --max_train_steps=800

Credits

Dreambooth Xformers - https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth
Bitsandbytes Prebuilt DLLs - https://github.com/DeXtmL/bitsandbytes-win-prebuilt
Convert Diffusers to SD https://gist.github.com/jachiam/8a5c0b607e38fcc585168b90c686eb05
SKS example images https://unsplash.com/@alvannee

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
data/example-sks/images		data/example-sks/images
deps		deps
models		models
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
convert_diff_to_sd.py		convert_diff_to_sd.py
convert_sd_to_diff.py		convert_sd_to_diff.py
dreambooth.bat		dreambooth.bat
requirements.txt		requirements.txt
train_dreambooth.py		train_dreambooth.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dreambooth-Diffusers-Xformers-Win

Note from maintainer

Conda installation

Workflow Example

Performance

Important

DreamBooth training example

Training with prior-preservation loss

Training on a 16GB GPU:

Training on a 8 GB GPU:

Fine-tune text encoder with the UNet.

Credits

About

Languages

License

davidvfx07/Dreambooth-Diffusers-Xformers-Win

Folders and files

Latest commit

History

Repository files navigation

Dreambooth-Diffusers-Xformers-Win

Note from maintainer

Conda installation

Workflow Example

Performance

Important

DreamBooth training example

Training with prior-preservation loss

Training on a 16GB GPU:

Training on a 8 GB GPU:

Fine-tune text encoder with the UNet.

Credits

About

Resources

License

Stars

Watchers

Forks

Languages