Skip to content

AUGMXNT/inference-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

We will use shisa-7b-v1 (Mistral 7B w/ extended tokenizer) to test inference performance.

Full spreadsheet: https://docs.google.com/spreadsheets/d/19YaxXkMJu7VweJihBMxQfMuz290Q3VxpqeG2DYCdRws/edit?usp=sharing

512 token prompt, 512 tokens generated

All tests run on a Ryzen 5950X workstation w/ an RTX 4090 and RTX 3090 w/ CUDA 12.3.1 ~ 2023-12-10

  • Python 3.11.5
  • HF Transfromers 4.35.2
  • vLLM 0.2.3
  • cTranslate2 3.23.0
  • llama.cpp fe680e3 (1620)
  • ExLlamaV2 0.0.10
Software Settings Avg Tok/s Max Mem Speed X Max mem % Notes
HF Transformers Baseline (FP32) 1.48 47677.0 1.0 100.0
HF Transformers BF16 3.88 46211.0 2.63 97.0
HF Transformers BF16 3.88 46211.0 2.63 97.0
torch.no_grad()
HF Transformers BF16 3.89 45495.0 2.63 95.0
torch.inference_mode()
HF Transformers BF16 4.32 47191.0 2.93 99.0
torch.inference_mode()
use_flash_attention_2=True
HF Transformers BF16 BetterTransformers doesn't support Mistral
torch.inference_mode()
use_flash_attention_2=True
Optimum BetterTransformer
HF Transformers BF16 4.3 46851.0 2.91 98.0
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
HF Transformers BF16 2.07 42623.0 1.4 89.0 UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
load_in_8bit=True
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
HF Transformers BF16 2.06 45127.0 1.4 95.0 UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
load_in_8bit=True
bnb_8bit_compute_dtype=torch.bfloat16
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
HF Transformers FP16 2.13 45511.0 1.44 95.0 bitsandbytes-foundation/bitsandbytes#490
load_in_8bit=True
bnb_4bit_compute_dtype=torch.float16
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
HF Transformers FP16 2.11 45509.0 1.43 95.0
load_in_8bit=True
bnb_4bit_compute_dtype=torch.float16
torch.compile()
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
HF Transformers FP16 3.51 44101.0 2.37 92.0
load_in_4bit=True
bnb_4bit_compute_dtype=torch.float16
torch.compile()
torch.inference_mode()
use_flash_attention_2=True
SDPA flash attention
vLLM tensor_parallel_size=1 55.28 19958.0 37.44 42.0 vLLM is fast even for batch=1 but you need to batch by SamplerSettings and also you can't batch w/ multiple seeds
vLLM tensor_parallel_size=2 68.31 47843.0 46.27 100.0 A copy on each GPU
vLLM tensor_parallel_size=2 86.81 47175.0 58.8 99.0
quantization='awq'
vLLM tensor_parallel_size=2 NotImplementedError: Pipeline parallelism is not supported yet.
pipeline_parallel_size=2
quantization='awq'
cTranslate2 55.86 16996.0 37.84 36.0 requires model conversion: https://opennmt.net/CTranslate2/conversion.html
Missing some of the usual generation parameters, 4090 only
llama.cpp fp16 40.93 17987.0 27.72 38.0 convert_shisa.py
4090+3090
llama.cpp fp16 54.6 15873.0 36.98 33.0 4090 only
llama.cpp q8 48.95 11541.0 33.15 24.0 4090+3090
llama.cpp q8 87.85 9919.0 59.49 21.0 4090 only
llama.cpp q4_k_m 53.08 8271.0 35.94 17.0 4.63 BPW
4090+3090
llama.cpp q4_k_m 126.67 6701.0 85.78 14.0 4090 only
ExLLamaV2 EXLV2 8 BPW 92.94 13688.0 62.96 29.0 4090 only
ExLLamaV2 EXLV2 4.63 BPW 134.4 10856.0 91.06 23.0 4090 only
ExLLamaV2 GPTQ Q4 GS128 actorder 131.57 10938.0 89.12 23.0 4090 only
MLC LLM q0f16
MLC LLM q8f16_1
MLC LLM q4f16_1 mlc_chat_cli: symbol lookup error: ... mlc-llm/dist/shisa-7b-v1-q4f16_1/shisa-7b-v1-q4f16_1-cuda.so: undefined symbol: __cudaRegisterFatBinary
MLC LLM autogptq_llama_q4f16_1
gpt-fast many issues...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published