Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training terminated #8

Open
divymurli opened this issue Nov 23, 2020 · 6 comments
Open

Model training terminated #8

divymurli opened this issue Nov 23, 2020 · 6 comments

Comments

@divymurli
Copy link

Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss.

Thanks!
Screenshot 2020-11-23 at 08 46 08

@wgrathwohl
Copy link
Owner

wgrathwohl commented Nov 23, 2020 via email

@divymurli
Copy link
Author

Ah, and by MCMC steps do you mean SGLD (sorry not super familiar with MCMC)?

@mwcvitkovic
Copy link

Related question: so just to be clear, the code in the repo isn't the code used to create the results in the paper?

@wgrathwohl
Copy link
Owner

wgrathwohl commented Nov 23, 2020 via email

@mwcvitkovic
Copy link

Definitely helpful, and much appreciated. I'm just curious whether the line of code in the README worked for you, but isn't working for @divymurli.

That would be surprising considering that random draws from the buffer should be deterministic under the random seeds you set in the training scripts. I can't see what the source of randomness would be.

@andiac
Copy link

andiac commented Nov 17, 2023

As I say in the paper, the best thing to do when the model diverges is to increase the number of mcmc steps or decrease the learning rate. EBMs are very finicky creatures! Thankfully, there's been lots of work on improving and stabilizing the training. One thing I read recently found smooth nonlinearities to make training considerably more stable. So, you could try a Swish and see if that helps out. Cheers

On Mon, Nov 23, 2020 at 1:42 PM Divyanshu Murli @.***> wrote: Hi, one other issue I wanted to point out was that the training process seemed to terminate about 27 epochs in, due to a diverging loss. Thanks! [image: Screenshot 2020-11-23 at 08 46 08] https://user-images.githubusercontent.com/38363539/100001813-ea9aa800-2d80-11eb-8376-6f7ac75a9970.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADYQS4QZO6HJLC4VBPE237TSRKUJXANCNFSM4T73SUOQ .
-- Will Grathwohl Graduate Student Researcher Machine Learning Group University of Toronto / Vector Institute

Thanks, increasing MCMC steps helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants