Skip to content

IV. Potential errors

carlosuc3m edited this page Sep 13, 2023 · 2 revisions

JNA version incompatibility

One of the most common problems when trying to run Pytorch models with JDLL is due to the JNA library.

The JNA library enables the access to native code from Java. Torchscript accesses Pytorch native code using this library. Due to the custom classloading approach, if a version of JNA JAR lower than (<5) is already loaded in the classloader that is used as the parent of the new engine classloader the load of teh model will fail. The Pytorch engine comes with the latest JAR version available of JNA, the 5.13.0 one. If this error happens to you, please consider updating the main software JNA version.

If it is completely impossible to update your JNA version, there is one trick that could work as a workaround. Copy the JAR file for the JNA version you are using to the Pytorch engine folder, delete the JNA 5.13.0 version JAR and rename your old JNA version JAR to the new version.

If your JAR was jna-4.5.1.jar, rename it to jna-5.13.0.jar.

Arm64, MacOS M, Rosetta

The new Arm64 chips of Mac M1 and Mac M2 are bringing new computation possibilities and capabilities. But they are also producing many backwards compatibility problems.

In order to reduce the backwards compatibility issues Rosetta was developed. Rosetta simulates a x86_64 chip to be able to use software not compatible with arm64 chips. Java 8, for example, si not compatible with arm64, but Rosetta does the trick to be able to run the majority of programs that exist in Java 8 in Mac M1 and Mac M2.

However, not all the software not compatible with arm64 can be run with Rosetta. In out case, Tensorflow 1 will not work for Arm64 based computers as the bindings do not exist for this chip and Rosetta is not able to load the x86_64 ones. And Tensorflow 2 will only work on those systems if used with the Arm64 bindings, because again, Rosetta is not able to load the Tensorflow 2 x86_64 native library.

This means that for example in ImageJ/Fiji, computers based on Arm64 chips (Mac M1 and M2) will not be able to use either Tensorflow 1 or Tensorflow 2 as ImageJ/Fiji still use Java 8.

This is not a JDLL issue, its an issue of incompatibility of the Tensorflow 1 and 2 libraries with Rosetta.

However, using Tensorflow 2 on a Java software that uses more modern Java versions, such as Icy will work with the improved speed that Arm64 chips bring.

Tensorflow 1 Java at the moment remains incompatible with Arm64 chips.

Compatible versions

The field of Deep Learning is evolving constantly and rapidly, and with it the frameworks used to develop DL networks have to evolve too. In consequence, the functions and methods that worked in previous versions might not work in future versions, and the improvements that appear in futrure versions do not exist in older versions.

Because of the above mentioned reasons, in order to guarantee the reproducibility of DL methods, we need to know which was the framework and version of the framework used to develop them. The Bioimage.io rdf.yaml file for example includes a field in which all this information is specified. And in JDLL the class EngineInfo needs to be defined for every model in order to be able to load and run them. The objective of these components is to know which is the DL framework and its version in which the model was trained in order to make sure that it will load and run without any problem.

In practice, the criteria is a little bit more flexible. The exact version in which a DL framework was trained is not going to be the only one capable of loading and running the model. The DL frameworks developers design them maximizing the backwards compatibility with respect to the previous versions of the framework. The result is that usually, many of the versions of a framework are able to run the same model, and usually the bigger the version is, the most probable it will run any model because of their backwards compatibility. However, for models that use very rare or specific operations this might not be always true.

In order to ensure the maximum coverage of models running smoothly while trying to be as flexible as possible to improve usability, JDLL introduces the concept of compatible versions. Two versions from the same DL framework are considered compatible if the major version (first integer of the version String) is the same.

When trying to load a model which requires Tensorflow 2.7.0, JDLL will look for a compatible version installed and if there is no version installed it will not load the model. This means that as long as there exists any Tensorflow 2 installed, JDLL will try to load the model. Same happens with Pytorch 1 or Tensorflow 1. Note that trying to load it does not guarantee that it will be able to load well. The only guarantee to load correctly a DL model, is to use the engine of the exact same version as the one it was trained with.

For more information about loading a model with acompatible installed version or with the exact required version please click here or here.

The safest approach is always to try to load the models with the exact engine version that was used for training. The next safest approach is to install the most recent version of every framework and major version and run all the models with them. With this the vast majority of models will be covered and the engines installed will be very few, saving memory. JDLL provides methods ready-to-go methods for directly installing the newest engine for every framework and major version. For more info click here.

If you find that loading and running a model fails, one of the approaches to try to solve the error is to try to load the model with the exact version needed for training using either EngineInfo.defineDLEngine( String engine, String version, String jarsDirectory, boolean cpu, boolean gpu) or Model.createBioimageioModelWithExactWeigths(String bmzModelFolder, String enginesFolder).

On a side note, the degree of compatibility in JDLL depends ontwo factors. First, newer versions are more compatible, and second in importance, closer subversions are more compatible.

For the next set of versions [1.1, 1.12. 1.14, 1.32], if we had to order it wiht respect to the compatibility with version 1.13 from most compatible to least compatible, the result would be: [1.14, 1.32, 1.12, 1.1].

Some methods to retrieve compatible versions:

InstalledEngines.getMostCompatibleVersionForEngine(String engine, String version, String enginesDir)

Get the version most compatible among the installed with respect to the DL framework defined by the engine argument and the version. All the engines evaluated are at the directory defined by enginesDir.

InstalledEngines.getOrderedListOfCompatibleVesionsForEngine(String engine, String version, String enginesDir)

Get a list of the installed compatible versions with respect to the DL framework defined by the engine argument and the version. All the engines evaluated are at the directory defined by enginesDir. The list is ordered from most compatible to least compatible

Loading two engine versions in the same session

Another common pitfall, that only happens with Pytorch, is trying to load two different versions of Pytroch in the same session.

If you have tried to load two different models (or the same model twice) using two different versions of Pytorch with EngineInfo.defineDLEngine( String engine, String version, String jarsDirectory, boolean cpu, boolean gpu) or Model.createBioimageioModelWithExactWeigths(String bmzModelFolder, String enginesFolder) JDLL will not be able to load the second version of Pytorch. This happens because of some incompatibilites between the native libraries of different Pytorch versions.

If you want to load two different versions of Pytorch you will need to re-start the JVM. Another option is to just load the latest (and most-backwards compatible version) and load both models with it.

This issue will be addressed in the future with the adoption of interprocessing.

Possible problems creating and loading a model

Needed engine is missing

If the engine required to load a model have not been installed, the model will not be able to load. In order to be able to load a model, please first install the engine required. For more information, please click here, here or here.

Classloading problem

JDLL uses an engineered system of classloaders to avoid conflicts among Deep Learning frameworks. The particular classloader for each engine is an URLClassloader that uses the Thread Context ClassLoader as its parent. In some applications, the classloader management might be done in a different way and the classes might not be loaded in the Thread Context ClassLoader. This will cause an error when trying to build the engines.

In order to avoid this problem, JDLL provides an special method that allows setting the parent classloader of the engine classloader. The methods just require an extra argument, which should be a ClassLoader, when creating a model.

For more complete information on how to do this click here or here.

Loading Onnx models

If you get an error trying to load an Onnx model, please review the engine you ahve installed. For older versions of Onnx, if the engine supports GPU, it will not support CPU, so if you do not have a GPU available and the corresponding CUDA installed, you will not be able to run your model on the CPU.

In older versions of Onnx, if you want to execute the model on the CPU, install the CPU only engine.

Loading Pytorch 1 in Windows systems

If you get an error trying to load a Pytorch 1 model on Windows it is likely that you are missing the installation of Visual Studio redistributables.

JDLL uses Deep Java Library as the backend to run Pytorch 1 models. This library requires the installation of Visual Studio redistributables to work. They should be Visual Studio 2019 or later redistributables.

To run Pytorch 1 models with JDLL go to https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170#visual-studio-2015-2017-2019-and-2022 download the wanted redistributables and install them.

Loading Pytorch models with CUDA installed

GPU connection for PyTorch models.

Having incompatible CUDA versions installed might be a source of conflict. In Windows, if a non-compatible version is installed, the plugin will fail to load the model. This is a known bug. For example, if Pytorch 1.13.1 is being used and CUDA 6.0 is installed, JDLL will not be able to load a model. If there is a CUDA version installed, JDLL is not able to fall back to CPU mode if that CUDA version does not work with the Pytorch version.

If you experience this error:

  • Remove CUDA_PATH (if exists) from your system environment variables.
  • Make sure that your PATH does not include any directory or file with the words Nvidia or CUDA.
    • Go to Edit the system environment variables or Edit environment variables for your account.
    • Click on Environment variables.
    • Check the Path and CUDA_PATH variables (note that Windows is not case sensitive so they might be written as PATH or path).