fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

deGENERATIVE-SQUAD · 2024-12-13T00:39:56Z

Hi, Kohya. I know you hardcoded fused_backward_pass to Adafactor, but prodigy-plus-schedule-free https://github.com/LoganBooker/prodigy-plus-schedule-free has that feature inside already, but we cant use it. That is, in fact, we can apply the built-in argument itself, but it breaks the training process. Can you add more flexibility in this case, please?

michP247 · 2025-01-06T03:43:27Z

#1866

deGENERATIVE-SQUAD · 2025-01-08T05:11:15Z

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^
--pretrained_model_name_or_path="model.safetensors" ^
--train_data_dir="LP3" ^
--output_dir="output_dir" ^
--output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^
--network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^
--resolution="1024,1024" ^
--save_model_as="safetensors" ^
--network_module="lycoris.kohya" ^
--max_train_steps=1000 ^
--save_every_n_epochs=1 ^
--save_every_n_steps=100 ^
--save_state_on_train_end ^
--network_dim=32 ^
--network_alpha=1 ^
--train_batch_size=1 ^
--max_data_loader_n_workers=0 ^
--random_crop ^
--enable_bucket ^
--bucket_reso_steps=64 ^
--min_bucket_reso=768 ^
--max_bucket_reso=2048 ^
--mixed_precision="bf16" ^
--caption_extension=".txt" ^
--noise_offset=0.05 ^
--multires_noise_discount=0.2 ^
--multires_noise_iterations=7 ^
--gradient_checkpointing ^
--fused_backward_pass ^
--optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^
--optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^
--loss_type="l2" ^
--unet_lr=1.0 ^
--network_train_unet_only ^
--min_snr_gamma=1 ^
--prior_loss_weight=1 ^
--seed=0 ^
--logging_dir="logs" ^

michP247 · 2025-01-08T12:32:32Z

#1866

I tried both the Lora algorithm and the Glora+Dora algorithm on SDXL - no noticeable decrease in VRAM usage. Speeds are the same with and without fused_backward_pass also.

Settings for example:

accelerate launch --num_cpu_threads_per_process 8 sdxl_train_network.py ^ --pretrained_model_name_or_path="model.safetensors" ^ --train_data_dir="LP3" ^ --output_dir="output_dir" ^ --output_name="LP3-glora-dora-1024p-batch1-reso2k-psteps500-biastrue-d1e4-lossl2-dcoef01-unetlr1-snr1-randcrop-bigasp-dropout01-conv32conv1-netdim32netalpha1-trainnorm-fbwp" ^ --network_args "algo=glora" "dropout=0.1" "conv_dim=32" "conv_alpha=1" "train_norm=True" "dora_wd=True" ^ --resolution="1024,1024" ^ --save_model_as="safetensors" ^ --network_module="lycoris.kohya" ^ --max_train_steps=1000 ^ --save_every_n_epochs=1 ^ --save_every_n_steps=100 ^ --save_state_on_train_end ^ --network_dim=32 ^ --network_alpha=1 ^ --train_batch_size=1 ^ --max_data_loader_n_workers=0 ^ --random_crop ^ --enable_bucket ^ --bucket_reso_steps=64 ^ --min_bucket_reso=768 ^ --max_bucket_reso=2048 ^ --mixed_precision="bf16" ^ --caption_extension=".txt" ^ --noise_offset=0.05 ^ --multires_noise_discount=0.2 ^ --multires_noise_iterations=7 ^ --gradient_checkpointing ^ --fused_backward_pass ^ --optimizer_type="prodigyplus.ProdigyPlusScheduleFree" ^ --optimizer_args "d0=1e-4" "prodigy_steps=500" "d_coef=0.1" "use_bias_correction=True" "use_adopt=False" "weight_decay_by_lr=True" ^ --loss_type="l2" ^ --unet_lr=1.0 ^ --network_train_unet_only ^ --min_snr_gamma=1 ^ --prior_loss_weight=1 ^ --seed=0 ^ --logging_dir="logs" ^

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

deGENERATIVE-SQUAD · 2025-01-08T13:17:41Z

Try it without my propesed changes, apparently it was already working if you set --fused_back_pass as an optimizer arg

Tested it earlier:

back_pass as an optimizer argument is not working in the dev branch: LoRA is simply not working—no changes in the resulting image after tons of steps/epochs. If retrained without this argument, everything works as expected.
back_pass as an optimizer argument is not working in your branch either: the effect is the same as in option 1.

The speeds and other metrics are about 30% faster than usual during the training process. I don’t remember the exact VRAM usage, but there was no significant decrease.

I also used --fused_backward_pass together with the optimizer argument, but the resulting LoRA is not working, just like in the cases above.

Saying 'lora not working' I mean it trains without errors or NaNs, tensorboard shows regular graphs, but if I use resulting lora with the model - no changes for the image even with 1000 weight power.

deGENERATIVE-SQUAD · 2025-01-10T18:45:39Z

Problem solved in new version of https://github.com/LoganBooker/prodigy-plus-schedule-free
According issue thread LoganBooker/prodigy-plus-schedule-free#7 now it works with full finetunes and loras (the problem was the lack of Fused support for LoRa in sd-scripts).

deGENERATIVE-SQUAD changed the title ~~fused_backward_pass in prodigy-plus-schedule-free ❌~~ fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

deGENERATIVE-SQUAD commented Dec 13, 2024 •

edited

Loading

michP247 commented Jan 6, 2025

deGENERATIVE-SQUAD commented Jan 8, 2025 •

edited

Loading

michP247 commented Jan 8, 2025

deGENERATIVE-SQUAD commented Jan 8, 2025 •

edited

Loading

deGENERATIVE-SQUAD commented Jan 10, 2025

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

fused_backward_pass in prodigy-plus-schedule-free [SOLVED VIA EXTERNAL SOLUTION ✅] #1834

Comments

deGENERATIVE-SQUAD commented Dec 13, 2024 • edited Loading

michP247 commented Jan 6, 2025

deGENERATIVE-SQUAD commented Jan 8, 2025 • edited Loading

michP247 commented Jan 8, 2025

deGENERATIVE-SQUAD commented Jan 8, 2025 • edited Loading

deGENERATIVE-SQUAD commented Jan 10, 2025

deGENERATIVE-SQUAD commented Dec 13, 2024 •

edited

Loading

deGENERATIVE-SQUAD commented Jan 8, 2025 •

edited

Loading

deGENERATIVE-SQUAD commented Jan 8, 2025 •

edited

Loading