Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation stuck at 97% #1399

Open
lorenzbaraldi opened this issue Jan 10, 2025 · 1 comment
Open

Installation stuck at 97% #1399

lorenzbaraldi opened this issue Jan 10, 2025 · 1 comment

Comments

@lorenzbaraldi
Copy link

After running the command pip install --verbose git+https://github.com/NVIDIA/TransformerEngine.git@stable the installation is stuck at 97%

Command output:

[ 95%] Building CUDA object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
/leonardo/prod/opt/compilers/cuda/12.1/none/bin/nvcc -forward-unknown-to-host-compiler -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS --options-file CMakeFiles/transformer_engine.dir/includes_CUDA.rsp -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version --expt-relaxed-constexpr -O3 --threads 1 -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_70,code=[compute_70,sm_70]" "--generate-code=arch=compute_80,code=[compute_80,sm_80]" "--generate-code=arch=compute_89,code=[compute_89,sm_89]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" -Xcompiler=-fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o.d -x cu -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers.cu -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/userbuffers/userbuffers.cu.o
[ 97%] Building CXX object CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o
/usr/bin/c++ -DNV_CUDNN_FRONTEND_USE_DYNAMIC_LOADING -Dtransformer_engine_EXPORTS -I/tmp/pip-req-build-91x1et59/transformer_engine/common/.. -I/tmp/pip-req-build-91x1et59/transformer_engine/common/include -I/tmp/pip-req-build-91x1et59/transformer_engine/common/../../3rdparty/cudnn-frontend/include -I/tmp/pip-req-build-91x1et59/build/cmake/string_headers -isystem /leonardo/prod/opt/compilers/cuda/12.1/none/targets/x86_64-linux/include -Wl,--version-script=/tmp/pip-req-build-91x1et59/transformer_engine/common/libtransformer_engine.version -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -MF CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o.d -o CMakeFiles/transformer_engine.dir/comm_gemm_overlap/comm_gemm_overlap.cpp.o -c /tmp/pip-req-build-91x1et59/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp
/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^

Remark: The warnings can be suppressed with "-diag-suppress "

/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^

Remark: The warnings can be suppressed with "-diag-suppress "

/tmp/pip-req-build-91x1et59/transformer_engine/common/gemm/cublaslt_gemm.cu(70): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^

Remark: The warnings can be suppressed with "-diag-suppress "

ERROR: Operation cancelled by user

@timmoon10
Copy link
Collaborator

Oof, that's quite frustrating. I suspect the build process is using too much parallelism and it's overwhelming your system resources. Try setting MAX_JOBS=1 and NVTE_BUILD_THREADS_PER_JOB=1 in your environment, although be advised it will be slow. If this works, you can try increasing parallelism in future builds (I usually use MAX_JOBS=4 and NVTE_BUILD_THREADS_PER_JOB=4). Here is some more guidance on common build problems.

Debugging thoughts:

  • userbuffers.cu and comm_gemm_overlap.cpp are near the end of the list of source files (in my own build, they are 40th and 39th out of 42). Compilation must have started for almost all source files.
  • cublaslt_gemm.cu is in the middle of the list of source files (15th out of 42). The fact a warning shows up after starting userbuffers.cu and comm_gemm_overlap.cpp implies a huge amount of parallelism.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants