Fix tqdm serialization #169

Yuyan-Li · 2019-03-27T14:55:21Z

This removes the TQDM bar from the serialization.

It prevents the error when saving the trainer:
TypeError: cannot serialize '_io.TextIOWrapper' object

I think the bar will be rebuilt automatically (haven't tested it yet).

DerThorsten · 2019-08-15T11:25:58Z

anyone an opinion on merging this?

nasimrahaman · 2019-08-15T11:32:09Z

If the deserialisation works (which @Yuyan-Li has not tested yet), sure. @DerThorsten would you have time to do a quick test?

DerThorsten · 2019-08-15T11:34:31Z

Let's see if @Yuyan-Li whants to contribute a test, if not I'll write one soonish

Yuyan-Li · 2019-08-15T12:35:17Z

I can test it on my system but I don't know how to write proper unittests. I could write a sample script showing that it works if that's enough.

Yuyan-Li · 2019-08-15T13:56:03Z

So I tested it and the deserialisation works. I also fixed it so that it shows the proper epoch in the bar when continuing the training.
But there seems to be a problem with continuing the training after loading a checkpoint. My script crashes at the end with an error (ConnectionRefusedError: [Errno 111] Connection refused) in the dataloader. This should be unrelated to the TQDM bar because it also happens without it.
I put the script I used below (it's heavily borrowd from tests/test_training/test_basic.py). Maybe someone has time to take a look.

def _make_test_model():
    import torch.nn as nn
    from inferno.extensions.layers.reshape import AsMatrix

    toy_net = nn.Sequential(nn.Conv2d(3, 8, 3, 1, 1),
                            nn.ELU(),
                            nn.MaxPool2d(2),
                            nn.Conv2d(8, 8, 3, 1, 1),
                            nn.ELU(),
                            nn.MaxPool2d(2),
                            nn.Conv2d(8, 16, 3, 1, 1),
                            nn.ELU(),
                            nn.AdaptiveAvgPool2d((1, 1)),
                            AsMatrix(),
                            nn.Linear(16, 10))
    return toy_net


def test_serialization():
    from inferno.trainers.basic import Trainer
    from inferno.trainers.callbacks import TQDMProgressBar
    from inferno.io.box.cifar import get_cifar10_loaders
    # Make model
    net = _make_test_model()
    # Make trainer
    trainer = Trainer(model=net) \
        .build_optimizer('Adam') \
        .build_criterion('CrossEntropyLoss') \
        .build_metric('CategoricalError') \
        .validate_every((1, 'epochs')) \
        .save_every((1, 'epochs'), to_directory='saves') \
        .set_max_num_iterations(500) \
        .register_callback(TQDMProgressBar())

    train_loader, validate_loader = get_cifar10_loaders(root_directory='.', download=True)
    trainer.bind_loader('train', train_loader)
    trainer.bind_loader('validate', validate_loader)

    # Try to train
    trainer.fit()
    # Try to serialize
    trainer.save()


def test_deserialization():
    from inferno.trainers.basic import Trainer
    from inferno.io.box.cifar import get_cifar10_loaders

    net = _make_test_model()
    # Try to unserialize
    trainer = Trainer(net).save_to_directory('saves').load()

    train_loader, validate_loader = get_cifar10_loaders(root_directory='.', download=True)
    trainer.bind_loader('train', train_loader)
    trainer.bind_loader('validate', validate_loader)

    # Try to continue training
    trainer.set_max_num_iterations(800)
    trainer.fit()


if __name__=='__main__':
    test_serialization()
    test_deserialization()

svenpeter42 · 2019-09-01T10:08:45Z

The "training epoch x" bar isn't restored correctly. If you set trainer.set_max_num_iterations(800) to something larger you will notice

Training epoch 1: : 500it [00:16, 29.77it/s] | 2/1000 [00:16<2:20:35, 8.45s/it]

even though each epoch only has 391 iterations.

svenpeter42

the "training epoch x" bar is not restored corretly. after loading training will resume at iteration 0 but the bar seems to be restored at the iteration when the snapshot was saved.

Yuyan Li added 2 commits March 27, 2019 15:40

make serialization possible

362cc61

use inheritance

24b61c7

DerThorsten requested a review from svenpeter42 August 15, 2019 11:31

add epoch counter when loading checkpoint

8b55205

svenpeter42 suggested changes Sep 1, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tqdm serialization #169

Fix tqdm serialization #169

Yuyan-Li commented Mar 27, 2019

DerThorsten commented Aug 15, 2019

nasimrahaman commented Aug 15, 2019

DerThorsten commented Aug 15, 2019

Yuyan-Li commented Aug 15, 2019

Yuyan-Li commented Aug 15, 2019

svenpeter42 commented Sep 1, 2019

svenpeter42 left a comment

Fix tqdm serialization #169

Are you sure you want to change the base?

Fix tqdm serialization #169

Conversation

Yuyan-Li commented Mar 27, 2019

DerThorsten commented Aug 15, 2019

nasimrahaman commented Aug 15, 2019

DerThorsten commented Aug 15, 2019

Yuyan-Li commented Aug 15, 2019

Yuyan-Li commented Aug 15, 2019

svenpeter42 commented Sep 1, 2019

svenpeter42 left a comment

Choose a reason for hiding this comment