-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OOM in DeepPot.eval_descriptor
while dp test
works
#4544
Comments
DeepPot.eval_descriptor
wile dp test
worksDeepPot.eval_descriptor
while dp test
works
Perhaps because the descriptor tensor is not detached |
Fix deepmodeling#4544. Signed-off-by: Jinzhe Zeng <[email protected]>
I can reproduce the error, though I am not sure how to fix it. |
@njzjz I found that if replace the eval_descriptor for whole LabeledSystem
to eval_descriptor for one System in whole LabeledSystem by a for-loop
(in The test is done after modification of #4547 I consider that the key for OOM may be in the evaluation for a whole LabeledSystem once for all |
Bug summary
While loading learned model in python env and use
DeepPot.eval_descriptor
function in my test LabeledSystem, there will by OOM error in my A100-40G hardware:But the
dp test
from the same model in the same LabeledSystem dataset can be done with ~39GB memory used first then become lower to 28G memory usage.DeePMD-kit Version
DeePMD-kit v3.0.0rc1.dev0+g0ad42893.d20250106
Backend and its version
Pytorch 2.5.1
How did you download the software?
pip
Input Files, Running Commands, Error Log, etc.
All reference files can be accessed by dp_eval_desc_oom.tar.gz
or though Nutshull Cloud
https://www.jianguoyun.com/p/DV3eAQoQrZ-XCRim4ecFIAA (code : unpbns)
Steps to Reproduce
There will be these scripts
explained as follow:
calc_desc.py
useDeepPot.eval_descriptor()
to generate descriptor of a LabeledSystemdesc_all.sh
read MultiSystems fromdata
and callcalc_desc.py
iterativelydata
directory contain the LabeledSystems which will lead to OOM ineval_descriptor
model.pth
is the model in usetest.sh
is the script callingdp --pt test
to outputtest.log
desc.log
is the OOM stderr print-outdescrtptors
is the directory aim to contain the output descriptor, which should be empty due to OOMOne can checkout the OOM problem directly by these.
Further Information, Files, and Links
Related issue: #4533
if one directly use
DeepPot.eval()
function in python code, the likely OOM problem will also emerge (from my previous test in Jan 2024), So I guess there are some difference indp test
on cmd and directly use evaluation interface in python.The text was updated successfully, but these errors were encountered: