-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understand why DataLoader gets killed #35
Comments
Can you also share a config file? Or exact command to run is good! That's a kernel kill... will try to reproduce |
I just monitored the memory usage ( |
Config to reproduce |
Since I never do what I promise, I write what I would try here...
|
Can reproduce the issue using the 12 features configuration provided by @Temigo . Wrote a simple script to...
Here's the plot of the record which shows likely memory leak increasing linearly with iteration number (and time). |
... and I came back with a longer test, and see more memory leak! To run it longer (yet for short period, so faster!), I changed to read only 3 features, run for 900 iterations using this config. Here's the result... We can see a clear memory increase. This is batch size 64 and 4 workers. Next, I will try with 1 feature to see how much memory increase happens. If that has the 1/3 memory increase of this test, which should be visible, then it is likely where loading 3 channel data. If the increase is similar, that would suggest the leak is somewhere else. |
Not very conclusive. Running a test with 1800 iterations now. |
I don't understand the behavior but it's def leaking, and it has a correlation to number of channels in the input data. |
Especially on V100, training UResNet (
uresnet_lonely
fromTemigo/lartpc_mlreco3d
, branchtemigo
) with batch size 64 and spatial size 768px.The text was updated successfully, but these errors were encountered: