Mesmer randomly fails on HPC using singularity due to model loading #43

FloWuenne · 2023-12-18T14:40:49Z

Description of the bug

Sometimes the DEEPCELL_MESMER module seems to randomly fail on the HPC related to loading tensorflow within the singularity container.

Command used and terminal output

nextflow run nf-core/molkart -r 6c1eef828896a5e60fefc9aa2398ad76ab41ec63 -profile singularity -c ./core_molkart_MI.conf -params-file ./params.yml -with-tower -resume

Command error:
  2023-12-18 15:27:38.508036: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
  2023-12-18 15:27:38.508066: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  2023-12-18 15:27:42.757496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
  2023-12-18 15:27:42.757524: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
  2023-12-18 15:27:42.757542: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m03n06): /proc/driver/nvidia/version does not exist
  2023-12-18 15:27:42.758824: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
  Traceback (most recent call last):
    File "/usr/src/app/run_app.py", line 60, in <module>
      run_application(dict(ARGS._get_kwargs()))
    File "/usr/src/app/deepcell_applications/app_runners.py", line 52, in run_application
      app = dca.utils.get_app(arg_dict['app'])
    File "/usr/src/app/deepcell_applications/utils.py", line 44, in get_app
      return app_map[name]['class'](**kwargs)
    File "/usr/local/lib/python3.8/dist-packages/deepcell/applications/mesmer.py", line 222, in __init__
      model = tf.keras.models.load_model(model_path)
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
      raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 991, in load_internal
      raise ValueError("SavedModels saved from Tensorflow 1.x or Estimator (any"
  ValueError: SavedModels saved from Tensorflow 1.x or Estimator (any version) cannot be loaded with node filters.

Work dir:
  /gpfs/bwfor/work/ws/hd_gr294-MIproject_nfcore_molkart/data/Molecular_Cartography/work/4f/9358489573beb1dcf90f8accbc901b

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

 -- Check '.nextflow.log' file for details
WARN: Tower request field `workflow.errorMessage` exceeds expected size | offending value: `2023-12-18 15:27:38.508036: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
2023-12-18 15:27:38.508066: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-12-18 15:27:42.757496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
2023-12-18 15:27:42.757524: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2023-12-18 15:27:42.757542: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m03n06): /proc/driver/nvidia/version does not exist
2023-12-18 15:27:42.758824: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/usr/src/app/run_app.py", line 60, in <module>
    run_application(dict(ARGS._get_kwargs()))
  File "/usr/src/app/deepcell_applications/app_runners.py", line 52, in run_application
    app = dca.utils.get_app(arg_dict['app'])
  File "/usr/src/app/deepcell_applications/utils.py", line 44, in get_app
    return app_map[name]['class'](**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepcell/applications/mesmer.py", line 222, in __init__
    model = tf.keras.models.load_model(model_path)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/saved_model/load.py", line 991, in load_internal
    raise ValueError("SavedModels saved from Tensorflow 1.x or Estimator (any"
ValueError: SavedModels saved from Tensorflow 1.x or Estimator (any version) cannot be loaded with node filters.`, size: 2462 (max: 255)

Relevant files

No response

System information

HPC : https://wiki.bwhpc.de/e/Helix
Executor: Slurm
Container engine: singularity
nextflow version 23.10.0.5889

FloWuenne · 2024-01-09T13:52:23Z

This is likely related to insufficient RAM availability to specific jobs. When supplying sufficient RAM for all processes, this should not happen!

FloWuenne added the bug Something isn't working label Dec 18, 2023

FloWuenne closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mesmer randomly fails on HPC using singularity due to model loading #43

Mesmer randomly fails on HPC using singularity due to model loading #43

FloWuenne commented Dec 18, 2023

FloWuenne commented Jan 9, 2024

Mesmer randomly fails on HPC using singularity due to model loading #43

Mesmer randomly fails on HPC using singularity due to model loading #43

Comments

FloWuenne commented Dec 18, 2023

Description of the bug

Command used and terminal output

Relevant files

System information

FloWuenne commented Jan 9, 2024