Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] LMDeploy v0.6.4-cu12使用2张4090无法启动和推理, v0.6.0-cu12 可以正常启动和推理 #3062

Open
3 tasks done
simonwei97 opened this issue Jan 21, 2025 · 1 comment
Assignees

Comments

@simonwei97
Copy link

simonwei97 commented Jan 21, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用 v0.6.0v0.6.4 版本在 RTX4090 机器上推理 internlm2_5-20b-chat

  • openmmlab/lmdeploy:v0.6.0-cu12
    • 2卡4090启动和推理成功
    • 4卡4090启动和推理成功
  • openmmlab/lmdeploy:v0.6.4-cu12
    • 2卡4090启动失败,runtime_error 空指针map报错
    terminate called after throwing an instance of 'std::runtime_error'
    what():  [TM][ERROR] pointer_mapping_ does not have information of ptr at 0x150ffa3000. Assertion fail: /opt/lmdeploy/src/turbomind/utils/allocator.h:284
    • 4卡4090启动和推理成功

Reproduction

2卡4090启动命令

  • --max-batch-size 64
  • --tp 2
lmdeploy serve api_server internlm2_5-20b-chat --cache-max-entry-count 0.8 --max-batch-size 64 --tp 2 --log-level INFO --server-port 80

4卡4090启动命令

  • --max-batch-size 128
  • --tp 4
lmdeploy serve api_server internlm2_5-20b-chat --cache-max-entry-count 0.8 --max-batch-size 128 --tp 4 --log-level INFO --server-port 80

Environment

sys.platform: linux
Python: 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3: NVIDIA GeForce RTX 4090 D
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.0+cu121
LMDeploy: 0.6.4+
transformers: 4.47.0
gradio: 5.8.0
fastapi: 0.115.6
pydantic: 2.10.3
triton: 2.3.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU1	NODE	 X 	NODE	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU2	NODE	NODE	 X 	NODE	NODE	NODE	32-63,96-127	1		N/A
GPU3	NODE	NODE	NODE	 X 	NODE	NODE	32-63,96-127	1		N/A
NIC0	NODE	NODE	NODE	NODE	 X 	PIX
NIC1	NODE	NODE	NODE	NODE	PIX	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: rocep171s0f0
  NIC1: rocep171s0f1

Error traceback

2025-01-21 08:07:50,565 - lmdeploy - INFO - async_engine.py:143 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='auto', model_format=None, tp=2, session_len=None, max_batch_size=64, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2025-01-21 08:07:50,566 - lmdeploy - INFO - async_engine.py:145 - input chat_template_config=None
2025-01-21 08:07:50,635 - lmdeploy - INFO - async_engine.py:155 - updated chat_template_onfig=ChatTemplateConfig(model_name='internlm2', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2025-01-21 08:07:50,635 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model
2025-01-21 08:07:51,144 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:

{
  "model_config": {
    "model_name": "",
    "chat_template": "",
    "model_arch": "InternLM2ForCausalLM",
    "head_num": 48,
    "kv_head_num": 8,
    "hidden_units": 6144,
    "vocab_size": 92544,
    "embedding_size": 92544,
    "num_layer": 48,
    "inter_size": [
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384,
      16384
    ],
    "norm_eps": 1e-05,
    "attn_bias": 0,
    "start_id": 1,
    "end_id": 2,
    "size_per_head": 128,
    "group_size": 64,
    "weight_type": "bfloat16",
    "session_len": 32768,
    "tp": 2,
    "model_format": "hf",
    "expert_num": [],
    "expert_inter_size": 0,
    "experts_per_token": 0,
    "moe_shared_gate": false,
    "norm_topk_prob": false,
    "routed_scale": 1.0,
    "topk_group": 1,
    "topk_method": "greedy",
    "moe_group_num": 1,
    "q_lora_rank": 0,
    "kv_lora_rank": 0,
    "qk_rope_dim": 0,
    "v_head_dim": 0,
    "tune_layer_num": 1
  },
  "attention_config": {
    "rotary_embedding": 128,
    "rope_theta": 50000000.0,
    "softmax_scale": 0.0,
    "attention_factor": -1.0,
    "max_position_embeddings": 32768,
    "original_max_position_embeddings": 0,
    "rope_scaling_type": "dynamic",
    "rope_scaling_factor": 2.5,
    "use_dynamic_ntk": 1,
    "low_freq_factor": 1.0,
    "high_freq_factor": 1.0,
    "beta_fast": 32.0,
    "beta_slow": 1.0,
    "use_logn_attn": 0,
    "cache_block_seq_len": 64
  },
  "lora_config": {
    "lora_policy": "",
    "lora_r": 0,
    "lora_scale": 0.0,
    "lora_max_wo_r": 0,
    "lora_rank_pattern": "",
    "lora_scale_pattern": ""
  },
  "engine_config": {
    "dtype": "auto",
    "model_format": null,
    "tp": 2,
    "session_len": null,
    "max_batch_size": 64,
    "cache_max_entry_count": 0.8,
    "cache_chunk_size": -1,
    "cache_block_seq_len": 64,
    "enable_prefix_caching": false,
    "quant_policy": 0,
    "rope_scaling_factor": 0.0,
    "use_logn_attn": false,
    "download_dir": null,
    "revision": null,
    "max_prefill_token_num": 8192,
    "num_tokens_per_iter": 8192,
    "max_prefill_iters": 4
  }
}
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` is not set, default to 32768.
[TM][INFO] Model:
head_num: 48
kv_head_num: 8
size_per_head: 128
num_layer: 48
vocab_size: 92544
attn_bias: 0
max_batch_size: 64
max_prefill_token_num: 8192
max_context_token_num: 32768
num_tokens_per_iter: 8192
max_prefill_iters: 4
session_len: 32768
cache_max_entry_count: 0.8
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
start_id: 1
tensor_para_size: 2
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 64
expert_per_token: 0
moe_method: 1

[TM][INFO] TM_FUSE_SILU_ACT=1
2025-01-21 08:07:51,613 - lmdeploy - WARNING - turbomind.py:231 - get 581 model params
[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 201326592

[TM][INFO] [LlamaWeight<T>::prepare] workspace size: 201326592

[TM][WARNING] Devicle 0 peer access Device 1 is not available.
[TM][WARNING] Devicle 1 peer access Device 0 is not available.
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 6 MB
[TM][INFO] [BlockManager] max_block_count = 339
[TM][INFO] [BlockManager] block_size = 6 MB
[TM][INFO] [BlockManager] max_block_count = 339
[TM][INFO] [BlockManager] chunk_size = 339
[TM][INFO] [BlockManager] chunk_size = 339
[TM][WARNING] No enough blocks for `session_len` (32768), `session_len` truncated to 21696.
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [TM][ERROR] pointer_mapping_ does not have information of ptr at 0x150ffa3000. Assertion fail: /opt/lmdeploy/src/turbomind/utils/allocator.h:284
@simonwei97 simonwei97 changed the title [Bug] LMDeploy v0.6.4-cu12使用2张4090启动和推理, v0.6.0-cu12 可以正常启动和推理 [Bug] LMDeploy v0.6.4-cu12使用2张4090无法启动和推理, v0.6.0-cu12 可以正常启动和推理 Jan 21, 2025
@lvhan028
Copy link
Collaborator

@zhulinJulia24 can you help reproduce it on 4090?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants