Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track training and validation loss #204

Closed
wsnoble opened this issue Jul 5, 2023 · 9 comments · Fixed by #376
Closed

Track training and validation loss #204

wsnoble opened this issue Jul 5, 2023 · 9 comments · Fixed by #376
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@wsnoble
Copy link
Contributor

wsnoble commented Jul 5, 2023

I think we should add a Boolean config file option called something like loss_file that triggers creation of a TSV file containing training and validation loss information. This could have the following columns: Loss type ("train" or "validation"), Epoch, Batch, Loss. I would make the epoch a float (just the number of batches divided by the batches-per-epoch).

@wsnoble wsnoble added the enhancement New feature or request label Jul 5, 2023
@wsnoble
Copy link
Contributor Author

wsnoble commented Sep 19, 2023

We should also report the learning rate as a column in this file.

@wsnoble wsnoble added the good first issue Good for newcomers label Sep 21, 2023
@wsnoble
Copy link
Contributor Author

wsnoble commented Nov 6, 2023

This might be a good place to add a check for NaN training or validation loss. I think it would be a good idea, when you print out the losses, if you can test for NaN and then terminate with an error when that happens.

@Lilferrit
Copy link
Contributor

How granular do we want this log be (i.e. record on every step or just once per epoch)? I'm looking into a potential solution that uses the CSVLogger interface provided by PyLightning.

@Lilferrit
Copy link
Contributor

I got a working solution going using the PyLighting's built in CSVLogger. Unfortunately it doesn't look like there's an easy way to change the delimitator, but if having a tab delimitator would make things significantly easier I'm sure some a work around could be found.

@wsnoble
Copy link
Contributor Author

wsnoble commented Aug 1, 2024

I guess that's OK, though it's a bit unsatisfying to use a different delimiter for the different output files produced by Casanovo.

@Lilferrit
Copy link
Contributor

Lilferrit commented Aug 1, 2024

Alternatively we could implement our own TSV writer. This does have the drawback of adding more maintenance overhead, but gives us a lot more control over the format of the loss file. Even beyond the delimitator, while the file given by the CSVLogger is definitely workable it isn't exactly pretty to work with. Here's what the output from a small training run looks like:

epoch,epoch_num,learning_rate,step,train_CELoss,valid_CELoss
,0.0,1.999999987845058e-08,0,,
0,,,3,,3.488032579421997
0,,,3,3.487025737762451,
,1.0,3.999999975690116e-08,1,,
1,,,7,,3.4862751960754395
1,,,7,3.4866459369659424,
,2.0,5.99999978589949e-08,2,,
2,,,11,,3.4832537174224854
2,,,11,3.484100341796875,
,3.0,7.999999951380232e-08,3,,
3,,,15,,3.4789743423461914
3,,,15,3.4826388359069824,
,4.0,1.0000000116860974e-07,4,,
4,,,19,,3.4734537601470947
4,,,19,3.478057622909546,

Rows will have different cells populated depending on when the logging operation happened (at the end of a training step, validation step, or validation epoch). While this workable I don't think it is exactly desirable.

@wsnoble
Copy link
Contributor Author

wsnoble commented Aug 3, 2024

I agree that it might be nice to have finer-grained control, but I am not sure how high priority that is. @bittremieux what do you think?

@bittremieux
Copy link
Collaborator

The blank entries during loss logging, both by the CSVLogger and in the console, is what sparked this request last year. I agree that this format is silly and it would be better to have it as a long table in a log file, but unfortunately this is not possible with the built-in Lightning loggers.

Meanwhile, I also don't consider it a very high priority. Ultimately, what's the goal of this additional feature? To more easily make loss plots. (Unless I'm missing something.) However, rather than making those plots ourselves after the training has happened, if you want nice (and interactive) loss plots, it's as simple as enabling the Tensorboard logging, and you have everything in there without any efforts. This is also not a feature that (m)any users will care about, as it is only relevant for people training Casanovo (i.e. us).

@wsnoble
Copy link
Contributor Author

wsnoble commented Aug 3, 2024

OK, I think we should go ahead with the CSVLogger implementation. Providing this functionality is better than not, and I do think we want to empower as many users as possible to train (or fine tune) models, not just use our models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants