Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesmer randomly fails on HPC using singularity due to model loading #43

Closed
FloWuenne opened this issue Dec 18, 2023 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@FloWuenne
Copy link
Collaborator

Description of the bug

Sometimes the DEEPCELL_MESMER module seems to randomly fail on the HPC related to loading tensorflow within the singularity container.

Command used and terminal output

nextflow run nf-core/molkart -r 6c1eef828896a5e60fefc9aa2398ad76ab41ec63 -profile singularity -c ./core_molkart_MI.conf -params-file ./params.yml -with-tower -resume

Command error:
  2023-12-18 15:27:38.508036: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
  2023-12-18 15:27:38.508066: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  2023-12-18 15:27:42.757496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
  2023-12-18 15:27:42.757524: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
  2023-12-18 15:27:42.757542: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m03n06): /proc/driver/nvidia/version does not exist
  2023-12-18 15:27:42.758824: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  Traceback (most recent call last):
    File "/usr/src/app/run_app.py", line 60, in <module>
      run_application(dict(ARGS._get_kwargs()))
    File "/usr/src/app/deepcell_applications/app_runners.py", line 52, in run_application
      app = dca.utils.get_app(arg_dict['app'])
    File "/usr/src/app/deepcell_applications/utils.py", line 44, in get_app
      return app_map[name]['class'](**kwargs)
    File "/usr/local/lib/python3.8/dist-packages/deepcell/applications/mesmer.py", line 222, in __init__
      model = tf.keras.models.load_model(model_path)
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
      raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 991, in load_internal
      raise ValueError("SavedModels saved from Tensorflow 1.x or Estimator (any"
  ValueError: SavedModels saved from Tensorflow 1.x or Estimator (any version) cannot be loaded with node filters.

Work dir:
  /gpfs/bwfor/work/ws/hd_gr294-MIproject_nfcore_molkart/data/Molecular_Cartography/work/4f/9358489573beb1dcf90f8accbc901b

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
WARN: Tower request field `workflow.errorMessage` exceeds expected size | offending value: `2023-12-18 15:27:38.508036: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
2023-12-18 15:27:38.508066: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-18 15:27:42.757496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
2023-12-18 15:27:42.757524: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-18 15:27:42.757542: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m03n06): /proc/driver/nvidia/version does not exist
2023-12-18 15:27:42.758824: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/src/app/run_app.py", line 60, in <module>
    run_application(dict(ARGS._get_kwargs()))
  File "/usr/src/app/deepcell_applications/app_runners.py", line 52, in run_application
    app = dca.utils.get_app(arg_dict['app'])
  File "/usr/src/app/deepcell_applications/utils.py", line 44, in get_app
    return app_map[name]['class'](**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepcell/applications/mesmer.py", line 222, in __init__
    model = tf.keras.models.load_model(model_path)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 991, in load_internal
    raise ValueError("SavedModels saved from Tensorflow 1.x or Estimator (any"
ValueError: SavedModels saved from Tensorflow 1.x or Estimator (any version) cannot be loaded with node filters.`, size: 2462 (max: 255)

Relevant files

No response

System information

HPC : https://wiki.bwhpc.de/e/Helix
Executor: Slurm
Container engine: singularity
nextflow version 23.10.0.5889

@FloWuenne FloWuenne added the bug Something isn't working label Dec 18, 2023
@FloWuenne
Copy link
Collaborator Author

This is likely related to insufficient RAM availability to specific jobs. When supplying sufficient RAM for all processes, this should not happen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant