Model training terminated #8

divymurli · 2020-11-23T18:42:19Z

Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss.

Thanks!

wgrathwohl · 2020-11-23T18:45:13Z

As I say in the paper, the best thing to do when the model diverges is to increase the number of mcmc steps or decrease the learning rate. EBMs are very finicky creatures! Thankfully, there's been lots of work on improving and stabilizing the training. One thing I read recently found smooth nonlinearities to make training considerably more stable. So, you could try a Swish and see if that helps out. Cheers

…

On Mon, Nov 23, 2020 at 1:42 PM Divyanshu Murli ***@***.***> wrote: Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss. Thanks! [image: Screenshot 2020-11-23 at 08 46 08] <https://user-images.githubusercontent.com/38363539/100001813-ea9aa800-2d80-11eb-8376-6f7ac75a9970.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADYQS4QZO6HJLC4VBPE237TSRKUJXANCNFSM4T73SUOQ> .

-- Will Grathwohl Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

divymurli · 2020-11-23T18:49:45Z

Ah, and by MCMC steps do you mean SGLD (sorry not super familiar with MCMC)?

mwcvitkovic · 2020-11-23T18:53:14Z

Related question: so just to be clear, the code in the repo isn't the code used to create the results in the paper?

wgrathwohl · 2020-11-23T19:04:43Z

It is, but as we write in Appendix H.3: "We find that when using PCD occasionally throughout training a sample will be drawn from the replay buffer that has a considerably higher-than average energy (higher than the energy of a random initialization). This causes the gradients w.r.t this example to be orders of magnitude larger than gradients w.r.t the rest of the examples and causes the model to diverge. We tried a number of heuristic approaches such as gradient clipping, energy clipping, ignoring examples with atypical energy values, and many others but could not find an approach that stabilized training and did not hurt generative and discriminative performance." I will be the first to admit that EBM training in this way is a nightmare and requires pretty consistent baby-sitting. At the moment these models are basically where GANs were in like 2014. Not easy to train. Requires a lot of hand-tuning. The main point of this paper was to demonstrate the utility of these models if they can be trained. There have been a number of improvements which can stabilize EBM training. You should be able to train these models with some combo of restarts, lr decrease, and mcmc step increase. I hope that helps.

…

On Mon, Nov 23, 2020 at 1:53 PM Milan Cvitkovic ***@***.***> wrote: Related question: so just to be clear, the code in the repo isn't the code used to create the results in the paper? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADYQS4X7ZBKVDMDFS2X5S5TSRKVSPANCNFSM4T73SUOQ> .

-- Will Grathwohl Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

mwcvitkovic · 2020-11-23T19:11:10Z

Definitely helpful, and much appreciated. I'm just curious whether the line of code in the README worked for you, but isn't working for @divymurli.

That would be surprising considering that random draws from the buffer should be deterministic under the random seeds you set in the training scripts. I can't see what the source of randomness would be.

andiac · 2023-11-17T10:48:39Z

As I say in the paper, the best thing to do when the model diverges is to increase the number of mcmc steps or decrease the learning rate. EBMs are very finicky creatures! Thankfully, there's been lots of work on improving and stabilizing the training. One thing I read recently found smooth nonlinearities to make training considerably more stable. So, you could try a Swish and see if that helps out. Cheers
…
On Mon, Nov 23, 2020 at 1:42 PM Divyanshu Murli @.***> wrote: Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss. Thanks! [image: Screenshot 2020-11-23 at 08 46 08] https://user-images.githubusercontent.com/38363539/100001813-ea9aa800-2d80-11eb-8376-6f7ac75a9970.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYQS4QZO6HJLC4VBPE237TSRKUJXANCNFSM4T73SUOQ .
-- Will Grathwohl Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

Thanks, increasing MCMC steps helps a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model training terminated #8

Model training terminated #8

divymurli commented Nov 23, 2020

wgrathwohl commented Nov 23, 2020 via email

divymurli commented Nov 23, 2020

mwcvitkovic commented Nov 23, 2020

wgrathwohl commented Nov 23, 2020 via email

mwcvitkovic commented Nov 23, 2020

andiac commented Nov 17, 2023

Model training terminated #8

Model training terminated #8

Comments

divymurli commented Nov 23, 2020

wgrathwohl commented Nov 23, 2020 via email

divymurli commented Nov 23, 2020

mwcvitkovic commented Nov 23, 2020

wgrathwohl commented Nov 23, 2020 via email

mwcvitkovic commented Nov 23, 2020

andiac commented Nov 17, 2023