We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I tried to run the Cuda server from within a container, but a thread panics:
running /workspace/aici/target/release/rllm-cuda --verbose --aicirt /workspace/aici/target/release/aicirt -m microsoft/phi-2@d3186761bf5c4409f7679359284066c25ab668ee -t phi -w /workspace/aici/rllm/rllm-cuda/expected/phi-2/cats.safetensors --host 0.0.0.0 INFO [rllm::server] explicit tokenizer: phi INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" tokenizer.json [00:00:00] [████████████████████████████████████████████████████████████████] 2.02 MiB/2.02 MiB 3.86 MiB/s (0s)INFO [rllm::engine] TokTrie building: TokRxInfo { vocab_size: 50295, tok_eos: 50256 } wl=50295 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" INFO [rllm_cuda::llm::loader] loading the model from https://huggingface.co/microsoft/phi-2/resolve/d3186761bf5c4409f7679359284066c25ab668ee/ config.json [00:00:00] [█████████████████████████████████████████████████████████████████████████] 755 B/755 B 6.69 KiB/s (0s)INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" Listening at http://0.0.0.0:4242 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" INFO [rllm_cuda::llm::loader] loading the model from https://huggingface.co/microsoft/phi-2/resolve/d3186761bf5c4409f7679359284066c25ab668ee/ INFO [actix_server::builder] starting 3 workers INFO [actix_server::server] Actix runtime found; starting in Actix runtime INFO [aicirt::bintokens] loading tokenizer: microsoft/phi-1_5 INFO [hf_hub] Token file not found "/root/.cache/huggingface/token" model.safetensors.index.json [00:00:00] [██████████████████████████████████████████████] 23.72 KiB/23.72 KiB 208.11 KiB/s (0s)..del-00001-of-00002.safetensors [00:00:16] [████████████████████████████████████████████] 4.64 GiB/4.64 GiB 280.05 MiB/s (0s)..del-00002-of-00002.safetensors [00:00:02] [████████████████████████████████████████] 550.21 MiB/550.21 MiB 252.67 MiB/s (0s)INFO [rllm_cuda::llm::loader] building the model INFO [rllm_cuda::llm::util] cuda mem: initial current: 0.000GiB, peak: 0.000GiB, allocated: 0.000GiB, freed: 0.000GiB [00:00:01] ████████████████████████████████████████████████████████████ 325/325 [00:00:00] INFO [rllm_cuda::llm::loader] model loaded INFO [rllm_cuda::llm::util] cuda mem: model fully loaded current: 5.196GiB, peak: 5.929GiB, allocated: 15.569GiB, freed: 10.373GiB INFO [rllm_cuda::llm::paged::batch_info] profile: BatchInfo { step_no: 0, tokens: Tensor[[2048], Int], positions: Tensor[[2048], Int64], seqlens_q: [0, 1948], seqlens_k: [0, 1948], gather_mapping: 1948, slot_mapping: 2048, max_seqlen_q: 1948, max_seqlen_k: 1948, paged_block_tables: Tensor[[100, 13], Int], paged_context_lens: Tensor[[100], Int], paged_block_size: 16, paged_max_context_len: 204, seqlen_multi: 1, q_multi: 1948 } INFO [rllm_cuda::llm::util] cuda mem: before model profile current: 5.196GiB, peak: 5.196GiB, allocated: 15.569GiB, freed: 10.373GiB killing 3806 thread '<unnamed>' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.14.0/src/wrappers/tensor_generated.rs:17495:36: called `Result::unwrap()` on an `Err` value: Torch("CUDA error: no kernel image is available for execution on the device\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n\nException raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75dcee992617 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x75dcee94d98d in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)\nframe #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x75dcf06859f8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)\nframe #3: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> const&) + 0x786 (0x75dcb36bee26 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #4: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 7u>, long (long)> const&) + 0x11b (0x75dcb36bf79b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #5: at::native::direct_copy_kernel_cuda(at::TensorIteratorBase&) + 0x338 (0x75dcb36aa0c8 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #6: at::native::copy_device_to_device(at::TensorIterator&, bool, bool) + 0xccd (0x75dcb36aae4d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #7: <unknown function> + 0x1590e92 (0x75dcb36ace92 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)\nframe #8: <unknown function> + 0x1ac2ebf (0x75dc9cccaebf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x75dc9cccc1f2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #10: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x75dc9d9a54af in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #11: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1bd5 (0x75dc9cf9a7d5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #12: <unknown function> + 0x2b2f12b (0x75dc9dd3712b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #13: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x75dc9d4a1425 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #14: <unknown function> + 0x295e793 (0x75dc9db66793 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #15: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x75dc9d4a1425 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #16: <unknown function> + 0x4020ecf (0x75dc9f228ecf in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #17: <unknown function> + 0x402147e (0x75dc9f22947e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #18: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1ee (0x75dc9d52894e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #19: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x75dc9cf9212b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #20: <unknown function> + 0x2d074d1 (0x75dc9df0f4d1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #21: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x203 (0x75dc9d6bcc13 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so)\nframe #22: <unknown function> + 0x21fb75 (0x5c405b293b75 in /workspace/aici/target/release/rllm-cuda)\nframe #23: <unknown function> + 0x223a56 (0x5c405b297a56 in /workspace/aici/target/release/rllm-cuda)\nframe #24: <unknown function> + 0x206916 (0x5c405b27a916 in /workspace/aici/target/release/rllm-cuda)\nframe #25: <unknown function> + 0x202a72 (0x5c405b276a72 in /workspace/aici/target/release/rllm-cuda)\nframe #26: <unknown function> + 0x18da7a (0x5c405b201a7a in /workspace/aici/target/release/rllm-cuda)\nframe #27: <unknown function> + 0x1a419a (0x5c405b21819a in /workspace/aici/target/release/rllm-cuda)\nframe #28: <unknown function> + 0x1d163b (0x5c405b24563b in /workspace/aici/target/release/rllm-cuda)\nframe #29: <unknown function> + 0x135f4d (0x5c405b1a9f4d in /workspace/aici/target/release/rllm-cuda)\nframe #30: <unknown function> + 0x154fc2 (0x5c405b1c8fc2 in /workspace/aici/target/release/rllm-cuda)\nframe #31: <unknown function> + 0x81e305 (0x5c405b892305 in /workspace/aici/target/release/rllm-cuda)\nframe #32: <unknown function> + 0x94ac3 (0x75dc9ac31ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)\nframe #33: clone + 0x44 (0x75dc9acc2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)\n") stack backtrace: 0: rust_begin_unwind at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5 1: core::panicking::panic_fmt at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14 2: core::result::unwrap_failed at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1653:5 3: core::result::Result<T,E>::unwrap at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/result.rs:1077:23 4: tch::wrappers::tensor_generated::<impl tch::wrappers::tensor::Tensor>::totype at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tch-0.14.0/src/wrappers/tensor_generated.rs:17495:9 5: tch_cuda::reshape_and_cache at /workspace/aici/rllm/tch-cuda/src/lib.rs:200:24 6: rllm_cuda::llm::save_attn at ./src/llm/mod.rs:158:9 7: rllm_cuda::llm::varlen_attn at ./src/llm/mod.rs:332:5 8: rllm_cuda::llm::phi::MHA::forward at ./src/llm/phi.rs:147:17 9: rllm_cuda::llm::phi::ParallelBlock::forward at ./src/llm/phi.rs:173:28 10: <rllm_cuda::llm::phi::MixFormerSequentialForCausalLM as rllm_cuda::llm::tmodel::TModelInner>::forward at ./src/llm/phi.rs:215:18 11: rllm_cuda::llm::loader::profile_model at ./src/llm/loader.rs:196:23 12: rllm_cuda::llm::loader::load_rllm_engine at ./src/llm/loader.rs:173:22 13: <rllm_cuda::llm::tmodel::TModel as rllm::exec::ModelExec>::load_rllm_engine at ./src/llm/tmodel.rs:61:9 14: rllm::server::spawn_inference_loop::{{closure}} at /workspace/aici/rllm/rllm-base/src/server/mod.rs:473:13 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
This is running within a GCP VM with the following configuration:
Steps to reproduce:
cd .devcontainer
sudo docker build . -f Dockerfile-cuda --tag aici
sudo docker run -it --rm -p 4242:4242 -v /path/to/aici/:/workspace/aici --gpus all aici /bin/bash
cd aici/rllm/rllm-cuda/
./server.sh phi2 --host 0.0.0.0
The text was updated successfully, but these errors were encountered:
You'll need a GPU with compute capability 8.0 or later. I have honestly only tried A100.
You can try llama.cpp on cuda (./server.sh --cuda ... in rllm-llamacpp).
We definitely need a better error message.
Sorry, something went wrong.
Didn't try with an A100, but the --cuda option for the llamacpp server works, thanks!
--cuda
No branches or pull requests
I tried to run the Cuda server from within a container, but a thread panics:
This is running within a GCP VM with the following configuration:
Steps to reproduce:
cd .devcontainer
andsudo docker build . -f Dockerfile-cuda --tag aici
sudo docker run -it --rm -p 4242:4242 -v /path/to/aici/:/workspace/aici --gpus all aici /bin/bash
cd aici/rllm/rllm-cuda/
and./server.sh phi2 --host 0.0.0.0
The text was updated successfully, but these errors were encountered: