Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

Open
deGENERATIVE-SQUAD opened this issue Dec 13, 2024 · 5 comments

Comments

@deGENERATIVE-SQUAD
Copy link

deGENERATIVE-SQUAD commented Dec 13, 2024

Hi, Kohya. I know you hardcoded fused_backward_pass to Adafactor, but prodigy-plus-schedule-free https://github.com/LoganBooker/prodigy-plus-schedule-free has that feature inside already, but we cant use it. That is, in fact, we can apply the built-in argument itself, but it breaks the training process. Can you add more flexibility in this case, please?

@michP247
Copy link

michP247 commented Jan 6, 2025

#1866

@deGENERATIVE-SQUAD
Copy link
Author

deGENERATIVE-SQUAD commented Jan 8, 2025

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
--pretrained_model_name_or_path="model.safetensors" ^
--train_data_dir="LP3" ^
--output_dir="output_dir" ^
--output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^
--network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^
--resolution="1024,1024" ^
--save_model_as="safetensors" ^
--network_module="lycoris.kohya" ^
--max_train_steps=1000 ^
--save_every_n_epochs=1 ^
--save_every_n_steps=100 ^
--save_state_on_train_end ^
--network_dim=32 ^
--network_alpha=1 ^
--train_batch_size=1 ^
--max_data_loader_n_workers=0 ^
--random_crop ^
--enable_bucket ^
--bucket_reso_steps=64 ^
--min_bucket_reso=768 ^
--max_bucket_reso=2048 ^
--mixed_precision="bf16" ^
--caption_extension=".txt" ^
--noise_offset=0.05 ^
--multires_noise_discount=0.2 ^
--multires_noise_iterations=7 ^
--gradient_checkpointing ^
--fused_backward_pass ^
--optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^
--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^
--loss_type="l2" ^
--unet_lr=1.0 ^
--network_train_unet_only ^
--min_snr_gamma=1 ^
--prior_loss_weight=1 ^
--seed=0 ^
--logging_dir="logs" ^

@michP247
Copy link

michP247 commented Jan 8, 2025

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

@deGENERATIVE-SQUAD
Copy link
Author

deGENERATIVE-SQUAD commented Jan 8, 2025

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

Tested it earlier:

  1. back_pass as an optimizer argument is not working in the dev branch: LoRA is simply not working—no changes in the resulting image after tons of steps/epochs. If retrained without this argument, everything works as expected.
  2. back_pass as an optimizer argument is not working in your branch either: the effect is the same as in option 1.

The speeds and other metrics are about 30% faster than usual during the training process. I don’t remember the exact VRAM usage, but there was no significant decrease.

I also used --fused_backward_pass together with the optimizer argument, but the resulting LoRA is not working, just like in the cases above.

Saying 'lora not working' I mean it trains without errors or NaNs, tensorboard shows regular graphs, but if I use resulting lora with the model - no changes for the image even with 1000 weight power.

@deGENERATIVE-SQUAD
Copy link
Author

Problem solved in new version of https://github.com/LoganBooker/prodigy-plus-schedule-free
According issue thread LoganBooker/prodigy-plus-schedule-free#7 now it works with full finetunes and loras (the problem was the lack of Fused support for LoRa in sd-scripts).

@deGENERATIVE-SQUAD deGENERATIVE-SQUAD changed the title fused_backward_pass in prodigy-plus-schedule-free ❌ fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants