You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think that the functionality is sufficient if we combine options --save_every_n_epochs and --save_last_n_epochs. Saving checkpoints does take time, but if there is a problem and training ends midway, it would be more of a problem if the checkpoints were not saved.
lets say i wanted to save 30, 50, 55, currently this is not possible
also last time i tested --save_last_n_epochs it didnt worked :D it tried to save the 4th saving and after that it is trying to delete thus i had out of space error , i had it as 3
but i am gonna test again lets. i think it should delete last one and after that save next one - thus fully utilize space
I set Save last N epochs state to 2, my intention was to have just the last 2 or 3 safetensor checkpoints saved because a disk space restrictions, I saving each 25 epochs, I should set "Save last N epochs state" to 50 if I want to keep the las 2 or 75 to keep the last 3, or it doesn't work this way?
We need to be able to save only certain epochs and steps
Like save epoch 30,35,40,45 and no others
or
Save step 300,400,500 and no others
Can you please add this option? Thank you @kohya-ss
This became super important for FLUX training since each checkpoint is 24 GB
This is for saving checkpoints but saving state option this way would be nice as well
The text was updated successfully, but these errors were encountered: