Understand why DataLoader gets killed #35

Temigo · 2019-08-09T20:35:11Z

Especially on V100, training UResNet (uresnet_lonely from Temigo/lartpc_mlreco3d, branch temigo) with batch size 64 and spatial size 768px.

Traceback (most recent call last):
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 33, in <module>
    main()
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 28, in main
    train(cfg)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 36, in train
    train_loop(cfg, handlers)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 236, in train_loop
    res = handlers.trainer.train_step(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 66, in train_step
    res_combined = self.forward(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 83, in forward
    res = self._forward(blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 127, in _forward
    result = self._net(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/models/uresnet_lonely.py", line 140, in forward
    x = self.input((coords, features))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 63, in forward
    self.mode
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 184, in forward
    mode
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 93813) is killed by signal: Killed.

The text was updated successfully, but these errors were encountered:

drinkingkazu · 2019-08-09T20:50:21Z

Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce

Temigo · 2019-08-12T19:32:15Z

I just monitored the memory usage (free -h) when running my training on a single machine (nu-gpu) with 4 V100 at the same time. The jobs crash with this error when the machine runs out of memory. I think we already ran into this issue in the past and it is not a bug. @drinkingkazu please reopen if you think I am wrong and we should look more into this.

Temigo · 2019-08-12T20:30:45Z

In the past we were loading all data into RAM, so this is not related to what we saw previously (my mistake).
To clarify, it is not crashing because of using 4 GPUs at the same time, but because of running out of CPU memory.
The config was 12 features + image size 768px + batch size 64 + 4 workers in DataLoader, which is using a lot of CPU memory (~14Gb per worker).
Monitoring over time the memory usage, it seems to me that there is a memory leak somewhere.

Config to reproduce
https://gist.github.com/Temigo/9a58ecacbc3b0d58cd34078a5f6c92fa (12 features)
https://gist.github.com/Temigo/b067cd6f2b09bd3e30cd449d3a9dae72 (1 feature)

drinkingkazu · 2019-08-13T17:03:21Z

Since I never do what I promise, I write what I would try here...

I would try running the whole process (training) by commenting out the body of a parser functions but returning the same (a constant empty array in global scope or something is fine) numpy array every time it's called. If memory is stable, then it's a problem within the parser.
If the answer is yes to the above trial, I would look into fill_3d_pcloud and fill_3d_voxel in larcv/core/PyUtils/PyUtils.cxx. One can try commenting out the whole function body, then un-comment little by little to identify the memory leak location (but if you want to try something straight, see below).
I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

drinkingkazu · 2019-08-17T05:37:35Z

Can reproduce the issue using the 12 features configuration provided by @Temigo .

Wrote a simple script to...

Run only data streaming w/ DataLoader (copy iotool section from the config file linked above)
Record increase in memory usage (i.e. relative to the mem usage at process start), iteration number, and time (though iter or time, only 1 needed)

Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time).

drinkingkazu · 2019-08-17T05:41:06Z

... ok then it seems my guess was right... I implemented one of suggestions I made before (see my earlier reply on this thread above):

* I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

With that implementation, here's the same script's output:

There's little evidence for memory leak in this plot (the memory fluctuation due to background process is dominating). We might record free RAM memory in the train log (will open another enhancement issue later) so that we can use any long-period training to monitor this.

Anyhow, seems this may be solved. Will close after making the container image available with the fix in larcv2.

drinkingkazu · 2019-08-28T15:51:07Z

... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result...

We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else.

drinkingkazu · 2019-08-28T16:05:09Z

Here's a trial with a 1 channel data. The memory increase over 900 iterations.

drinkingkazu · 2019-08-28T16:09:01Z

3 channel data ... the memory increased by roughly 2.3GB from iteration 0 (~4.3GB) to 700 (~6.6GB)
1 channel data ... the memory increased by roughly 1.3GB from iteration 0 (~3.0GB) to 700 (~4.3GB)
In both cases, there's a drop in memory usage at iteration 700...?

Not very conclusive. Running a test with 1800 iterations now.

drinkingkazu · 2019-08-28T17:18:01Z

Here's what 1 channel data looks like after running 1800 iterations (twice longer)

drinkingkazu · 2019-08-28T17:19:05Z

the memory leak takes a break around 700 iterations, then re-start increasing around 1000

I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data.

Temigo closed this as completed Aug 12, 2019

drinkingkazu reopened this Aug 12, 2019

drinkingkazu added the bug Something isn't working label Aug 15, 2019

Temigo self-assigned this Aug 15, 2019

drinkingkazu mentioned this issue Aug 17, 2019

Record CPU RAM usage during training (in report + log file) #44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understand why DataLoader gets killed #35

Understand why DataLoader gets killed #35

Temigo commented Aug 9, 2019

drinkingkazu commented Aug 9, 2019

Temigo commented Aug 12, 2019 •

edited

Loading

Temigo commented Aug 12, 2019 •

edited

Loading

drinkingkazu commented Aug 13, 2019

drinkingkazu commented Aug 17, 2019

drinkingkazu commented Aug 17, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

Understand why DataLoader gets killed #35

Understand why DataLoader gets killed #35

Comments

Temigo commented Aug 9, 2019

drinkingkazu commented Aug 9, 2019

Temigo commented Aug 12, 2019 • edited Loading

Temigo commented Aug 12, 2019 • edited Loading

drinkingkazu commented Aug 13, 2019

drinkingkazu commented Aug 17, 2019

drinkingkazu commented Aug 17, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

drinkingkazu commented Aug 28, 2019

Temigo commented Aug 12, 2019 •

edited

Loading

Temigo commented Aug 12, 2019 •

edited

Loading