You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had an issue while training the fs-Vid2Vid model on a similar dataset compared to that of Youtube Dancing, we created all the 3 other folder poses-openpose,pose_maps-densepose, human_instance_maps for all the sequences and there are 3000 sequences. While training we got ZERO DIVISION ERROR after model completed 5 epoch. We confirmed the dataset do not contain any None images in images folder, pose_maps-densepose folder, human_instance_maps folder, we also confirmed no empty JSON files in poses-openpose. We kept batch size 2 and trained with a single GPU. We also decreased the dataset to 500 sequences and then tried to train, the same error occurred after the 7th epoch.
Is there a fix for this error?
This is the exact error we got:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 1.265e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 6.3e-322
Traceback (most recent call last):
File "train.py", line 93, in
main()
File "train.py", line 78, in main
trainer.gen_update(data)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 283, in gen_update
self.get_gen_losses(data_t, net_G_output, net_D_output)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 537, in get_gen_losses
scaled_loss.backward()
File "/home/ubuntu/anaconda3/lib/python3.8/contextlib.py", line 120, in exit
next(self.gen)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 123, in post_backward_models_are_masters
scaler.unscale(
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
ZeroDivisionError: float division by zero
Thanks for your time.
The text was updated successfully, but these errors were encountered:
Hello all,
We had an issue while training the fs-Vid2Vid model on a similar dataset compared to that of Youtube Dancing, we created all the 3 other folder poses-openpose,pose_maps-densepose, human_instance_maps for all the sequences and there are 3000 sequences. While training we got ZERO DIVISION ERROR after model completed 5 epoch. We confirmed the dataset do not contain any None images in images folder, pose_maps-densepose folder, human_instance_maps folder, we also confirmed no empty JSON files in poses-openpose. We kept batch size 2 and trained with a single GPU. We also decreased the dataset to 500 sequences and then tried to train, the same error occurred after the 7th epoch.
Is there a fix for this error?
This is the exact error we got:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1e-323
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 2.53e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5e-324
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 1.265e-321
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0
Gradient overflow. Skipping step, loss scaler 1 reducing loss scale to 6.3e-322
Traceback (most recent call last):
File "train.py", line 93, in
main()
File "train.py", line 78, in main
trainer.gen_update(data)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 283, in gen_update
self.get_gen_losses(data_t, net_G_output, net_D_output)
File "/mnt/fs/imaginaire/imaginaire/trainers/vid2vid.py", line 537, in get_gen_losses
scaled_loss.backward()
File "/home/ubuntu/anaconda3/lib/python3.8/contextlib.py", line 120, in exit
next(self.gen)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/_process_optimizer.py", line 123, in post_backward_models_are_masters
scaler.unscale(
File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
ZeroDivisionError: float division by zero
Thanks for your time.
The text was updated successfully, but these errors were encountered: