Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understand why DataLoader gets killed #35

Open
Temigo opened this issue Aug 9, 2019 · 11 comments
Open

Understand why DataLoader gets killed #35

Temigo opened this issue Aug 9, 2019 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@Temigo
Copy link
Member

Temigo commented Aug 9, 2019

Especially on V100, training UResNet (uresnet_lonely from Temigo/lartpc_mlreco3d, branch temigo) with batch size 64 and spatial size 768px.

Traceback (most recent call last):
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 33, in <module>
    main()
  File "/u/ki/ldomine/lartpc_mlreco3d/bin/run.py", line 28, in main
    train(cfg)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 36, in train
    train_loop(cfg, handlers)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/main_funcs.py", line 236, in train_loop
    res = handlers.trainer.train_step(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 66, in train_step
    res_combined = self.forward(data_blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 83, in forward
    res = self._forward(blob)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/trainval.py", line 127, in _forward
    result = self._net(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/u/ki/ldomine/lartpc_mlreco3d/mlreco/models/uresnet_lonely.py", line 140, in forward
    x = self.input((coords, features))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 63, in forward
    self.mode
  File "/usr/local/lib/python3.6/dist-packages/sparseconvnet/ioLayers.py", line 184, in forward
    mode
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 93813) is killed by signal: Killed.
@drinkingkazu
Copy link
Contributor

Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce

@Temigo
Copy link
Member Author

Temigo commented Aug 12, 2019

I just monitored the memory usage (free -h) when running my training on a single machine (nu-gpu) with 4 V100 at the same time. The jobs crash with this error when the machine runs out of memory. I think we already ran into this issue in the past and it is not a bug. @drinkingkazu please reopen if you think I am wrong and we should look more into this.

@Temigo Temigo closed this as completed Aug 12, 2019
@drinkingkazu drinkingkazu reopened this Aug 12, 2019
@Temigo
Copy link
Member Author

Temigo commented Aug 12, 2019

  • In the past we were loading all data into RAM, so this is not related to what we saw previously (my mistake).
  • To clarify, it is not crashing because of using 4 GPUs at the same time, but because of running out of CPU memory.
  • The config was 12 features + image size 768px + batch size 64 + 4 workers in DataLoader, which is using a lot of CPU memory (~14Gb per worker).
  • Monitoring over time the memory usage, it seems to me that there is a memory leak somewhere.

Config to reproduce
https://gist.github.com/Temigo/9a58ecacbc3b0d58cd34078a5f6c92fa (12 features)
https://gist.github.com/Temigo/b067cd6f2b09bd3e30cd449d3a9dae72 (1 feature)

@drinkingkazu
Copy link
Contributor

Since I never do what I promise, I write what I would try here...

  • I would try running the whole process (training) by commenting out the body of a parser functions but returning the same (a constant empty array in global scope or something is fine) numpy array every time it's called. If memory is stable, then it's a problem within the parser.
  • If the answer is yes to the above trial, I would look into fill_3d_pcloud and fill_3d_voxel in larcv/core/PyUtils/PyUtils.cxx. One can try commenting out the whole function body, then un-comment little by little to identify the memory leak location (but if you want to try something straight, see below).
  • I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

@drinkingkazu drinkingkazu added the bug Something isn't working label Aug 15, 2019
@Temigo Temigo self-assigned this Aug 15, 2019
@drinkingkazu
Copy link
Contributor

Can reproduce the issue using the 12 features configuration provided by @Temigo .

Wrote a simple script to...

  • Run only data streaming w/ DataLoader (copy iotool section from the config file linked above)
  • Record increase in memory usage (i.e. relative to the mem usage at process start), iteration number, and time (though iter or time, only 1 needed)

Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time).

before

@drinkingkazu
Copy link
Contributor

... ok then it seems my guess was right... I implemented one of suggestions I made before (see my earlier reply on this thread above):

* I think we need to insert PyArray_Free(pyarray, (void *)carray); in line 157 and 215 anyway (and this might be the culprit).

With that implementation, here's the same script's output:
after

There's little evidence for memory leak in this plot (the memory fluctuation due to background process is dominating). We might record free RAM memory in the train log (will open another enhancement issue later) so that we can use any long-period training to monitor this.

Anyhow, seems this may be solved. Will close after making the container image available with the fix in larcv2.

@drinkingkazu
Copy link
Contributor

... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result...

download

We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else.

@drinkingkazu
Copy link
Contributor

Here's a trial with a 1 channel data. The memory increase over 900 iterations.

download

@drinkingkazu
Copy link
Contributor

  • 3 channel data ... the memory increased by roughly 2.3GB from iteration 0 (~4.3GB) to 700 (~6.6GB)
  • 1 channel data ... the memory increased by roughly 1.3GB from iteration 0 (~3.0GB) to 700 (~4.3GB)
  • In both cases, there's a drop in memory usage at iteration 700...?

Not very conclusive. Running a test with 1800 iterations now.

@drinkingkazu
Copy link
Contributor

Here's what 1 channel data looks like after running 1800 iterations (twice longer)

download

@drinkingkazu
Copy link
Contributor

  • the memory leak takes a break around 700 iterations, then re-start increasing around 1000

I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants