You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After running the command pip install --verbose git+https://github.com/NVIDIA/TransformerEngine.git@stable the installation is stuck at 97%
Command output:
[ 95%] Building CUDA object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
/leonardo/prod/opt/compilers/cuda/12.1/none/bin/nvcc -forward-unknown-to-host-compiler -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS --options-file CMakeFiles/transformer_engine.dir/includes_CUDA.rsp -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version --expt-relaxed-constexpr -O3 --threads 1 -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_70,code=[compute_70,sm_70]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" -Xcompiler=-fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o.d -x cu -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
[ 97%] Building CXX object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o
/usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-91x1et59/transformer_engine/common/.. -I/tmp/pip-req-build-91x1et59/transformer_engine/common/include -I/tmp/pip-req-build-91x1et59/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-91x1et59/build/cmake/string_headers -isystem /leonardo/prod/opt/compilers/cuda/12.1/none/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o.d -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
ERROR: Operation cancelled by user
The text was updated successfully, but these errors were encountered:
Oof, that's quite frustrating. I suspect the build process is using too much parallelism and it's overwhelming your system resources. Try setting MAX_JOBS=1 and NVTE_BUILD_THREADS_PER_JOB=1 in your environment, although be advised it will be slow. If this works, you can try increasing parallelism in future builds (I usually use MAX_JOBS=4 and NVTE_BUILD_THREADS_PER_JOB=4). Here is some more guidance on common build problems.
Debugging thoughts:
userbuffers.cu and comm_gemm_overlap.cpp are near the end of the list of source files (in my own build, they are 40th and 39th out of 42). Compilation must have started for almost all source files.
cublaslt_gemm.cu is in the middle of the list of source files (15th out of 42). The fact a warning shows up after starting userbuffers.cu and comm_gemm_overlap.cpp implies a huge amount of parallelism.
After running the command
pip install --verbose git+https://github.com/NVIDIA/TransformerEngine.git@stable
the installation is stuck at 97%Command output:
[ 95%] Building CUDA object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
/leonardo/prod/opt/compilers/cuda/12.1/none/bin/nvcc -forward-unknown-to-host-compiler -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS --options-file CMakeFiles/transformer_engine.dir/includes_CUDA.rsp -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version --expt-relaxed-constexpr -O3 --threads 1 -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_70,code=[compute_70,sm_70]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" -Xcompiler=-fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o.d -x cu -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
[ 97%] Building CXX object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o
/usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-91x1et59/transformer_engine/common/.. -I/tmp/pip-req-build-91x1et59/transformer_engine/common/include -I/tmp/pip-req-build-91x1et59/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-91x1et59/build/cmake/string_headers -isystem /leonardo/prod/opt/compilers/cuda/12.1/none/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o.d -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress "
ERROR: Operation cancelled by user
The text was updated successfully, but these errors were encountered: