Track training and validation loss #204

wsnoble · 2023-07-05T16:26:51Z

I think we should add a Boolean config file option called something like loss_file that triggers creation of a TSV file containing training and validation loss information. This could have the following columns: Loss type ("train" or "validation"), Epoch, Batch, Loss. I would make the epoch a float (just the number of batches divided by the batches-per-epoch).

The text was updated successfully, but these errors were encountered:

wsnoble · 2023-09-19T16:45:59Z

We should also report the learning rate as a column in this file.

wsnoble · 2023-11-06T22:47:34Z

This might be a good place to add a check for NaN training or validation loss. I think it would be a good idea, when you print out the losses, if you can test for NaN and then terminate with an error when that happens.

Lilferrit · 2024-07-31T20:32:36Z

How granular do we want this log be (i.e. record on every step or just once per epoch)? I'm looking into a potential solution that uses the CSVLogger interface provided by PyLightning.

Lilferrit · 2024-07-31T23:14:00Z

I got a working solution going using the PyLighting's built in CSVLogger. Unfortunately it doesn't look like there's an easy way to change the delimitator, but if having a tab delimitator would make things significantly easier I'm sure some a work around could be found.

wsnoble · 2024-08-01T07:38:15Z

I guess that's OK, though it's a bit unsatisfying to use a different delimiter for the different output files produced by Casanovo.

Lilferrit · 2024-08-01T18:42:39Z

Alternatively we could implement our own TSV writer. This does have the drawback of adding more maintenance overhead, but gives us a lot more control over the format of the loss file. Even beyond the delimitator, while the file given by the CSVLogger is definitely workable it isn't exactly pretty to work with. Here's what the output from a small training run looks like:

epoch,epoch_num,learning_rate,step,train_CELoss,valid_CELoss
,0.0,1.999999987845058e-08,0,,
0,,,3,,3.488032579421997
0,,,3,3.487025737762451,
,1.0,3.999999975690116e-08,1,,
1,,,7,,3.4862751960754395
1,,,7,3.4866459369659424,
,2.0,5.99999978589949e-08,2,,
2,,,11,,3.4832537174224854
2,,,11,3.484100341796875,
,3.0,7.999999951380232e-08,3,,
3,,,15,,3.4789743423461914
3,,,15,3.4826388359069824,
,4.0,1.0000000116860974e-07,4,,
4,,,19,,3.4734537601470947
4,,,19,3.478057622909546,

Rows will have different cells populated depending on when the logging operation happened (at the end of a training step, validation step, or validation epoch). While this workable I don't think it is exactly desirable.

wsnoble · 2024-08-03T09:32:23Z

I agree that it might be nice to have finer-grained control, but I am not sure how high priority that is. @bittremieux what do you think?

bittremieux · 2024-08-03T10:12:36Z

The blank entries during loss logging, both by the CSVLogger and in the console, is what sparked this request last year. I agree that this format is silly and it would be better to have it as a long table in a log file, but unfortunately this is not possible with the built-in Lightning loggers.

Meanwhile, I also don't consider it a very high priority. Ultimately, what's the goal of this additional feature? To more easily make loss plots. (Unless I'm missing something.) However, rather than making those plots ourselves after the training has happened, if you want nice (and interactive) loss plots, it's as simple as enabling the Tensorboard logging, and you have everything in there without any efforts. This is also not a feature that (m)any users will care about, as it is only relevant for people training Casanovo (i.e. us).

wsnoble · 2024-08-03T13:34:23Z

OK, I think we should go ahead with the CSVLogger implementation. Providing this functionality is better than not, and I do think we want to empower as many users as possible to train (or fine tune) models, not just use our models.

wsnoble added the enhancement New feature or request label Jul 5, 2023

wsnoble added the good first issue Good for newcomers label Sep 21, 2023

melihyilmaz assigned ishagokhale Nov 7, 2023

bittremieux unassigned ishagokhale Jun 26, 2024

Lilferrit self-assigned this Jun 28, 2024

bittremieux mentioned this issue Aug 20, 2024

Some questions when I tried to train a new casanovo model from scratch... #367

Closed

Lilferrit linked a pull request Sep 16, 2024 that will close this issue

Log optimizer and training metrics to CSV file #376

Merged

Lilferrit closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track training and validation loss #204

Track training and validation loss #204

wsnoble commented Jul 5, 2023

wsnoble commented Sep 19, 2023

wsnoble commented Nov 6, 2023

Lilferrit commented Jul 31, 2024

Lilferrit commented Jul 31, 2024

wsnoble commented Aug 1, 2024

Lilferrit commented Aug 1, 2024 •

edited

Loading

wsnoble commented Aug 3, 2024

bittremieux commented Aug 3, 2024

wsnoble commented Aug 3, 2024

Track training and validation loss #204

Track training and validation loss #204

Comments

wsnoble commented Jul 5, 2023

wsnoble commented Sep 19, 2023

wsnoble commented Nov 6, 2023

Lilferrit commented Jul 31, 2024

Lilferrit commented Jul 31, 2024

wsnoble commented Aug 1, 2024

Lilferrit commented Aug 1, 2024 • edited Loading

wsnoble commented Aug 3, 2024

bittremieux commented Aug 3, 2024

wsnoble commented Aug 3, 2024

Lilferrit commented Aug 1, 2024 •

edited

Loading