Skip to content

Latest commit

 

History

History
95 lines (76 loc) · 3.5 KB

DeepSeek-V3 Architecture.md

File metadata and controls

95 lines (76 loc) · 3.5 KB

FYI, I ran the math through O1 (no code execution), Sonnet 3.5 (JS code execution) and Gemini 2.0 Pro (Python code execution) w/ the config JSON and Python to try to get a good sense of the architecture and some more exact stats. Hopefully, this is broadly right (but corrections welcomed):

  • 28.81B activations per fwd pass / 452.82B total parameters
  • Hybrid architecture: 3 dense layers + 58 8x256+1 MoE
  • Uses YaRN RoPE extension to achieve 160K token context
  • FP16 weights: 905.65GB , FP8 weights: 452.82GB
  • FP16 kvcache: 286.55GB , FP8 kvcache: 143.28GB

At FP8 everything, might just fit into 1xH100 node, otherwise you'd need two, or an H200 or MI300X node...

vs Llama 3

Here is a comparison to Llama 3:

Parameter DeepSeek-V3 Llama3-70B Llama3-405B
Hidden Size 7168 8192 16384
Num Layers 61 80 126
Attn Heads 128 64 128
KV Heads 128 8 8
GQA Ratio 1:1 8:1 16:1
Head Dim 56 128 128
Interm Size 18432 28672 53248
Context Len 163840 8192 131072
Vocab Size 129280 128256 128256

FFN Expansion Ratios:

  • DeepSeek-V3 Dense Layers: 2.57x
  • DeepSeek-V3 MoE Experts: 0.29x (but with 257 experts)
  • Llama3-70B: 3.50x
  • Llama3-405B: 3.25x

Effective FFN Dimensions per Token:

  • DeepSeek-V3 Dense Layers: 18432
  • DeepSeek-V3 MoE Layers: 16384 (2048 × 8 experts)
  • Llama3-70B: 28672
  • Llama3-405B: 53248

vs Snowflake Arctic

The dense+moe hybrid maybe best compared to Snowflake Arctic (128 experts). Snowflake runs w/ parallel routing (more like Switch Transformer?) and DeepSeek-V3 is sequential (GLaM?) but they arrive at similar intermediate sizes (in most other ways, DeepSeek-V3 is bigger and badder, but to be expected):

Parameter DeepSeek-V3 Arctic
Hidden Size 7168 7168
Num Layers 61 35
Attention Heads 128 56
KV Heads 128 8
GQA Ratio 1:1 7:1
Head Dimension 56 128
Context Length 163840 4096
Vocab Size 129280 32000

MoE Architecture:

Parameter DeepSeek-V3 Arctic
Architecture 3 dense + 58 MoE layers Dense-MoE hybrid (parallel)
Num Experts 257 128
Experts/Token 8 2
Base Params ~10B 10B
Expert Size ~1.7B 3.66B
Total Params ~452B ~480B
Active Params ~29B ~17B

FFN Expansion Ratios (DeepSeek-V3):

  • Dense Layers: 2.57x
  • MoE Layers (per expert): 0.29x
  • MoE effective expansion: 2.29x

Effective FFN Dimensions per Token (DeepSeek-V3):

  • Dense Layers: 18432
  • MoE Layers: 16384 (2048 × 8 experts)

FFN Expansion Ratios (Arctic):

  • Dense (Residual) Path: 1.00x
  • MoE Path (per expert): 0.68x
  • Combined effective expansion: 2.36x

Effective FFN Dimensions per Token (Arctic):

  • Dense Path: 7168
  • MoE Path: 9728 (4864 × 2 experts)
  • Total: 16896

Reference

Other resources: https://simonwillison.net/2024/Dec/25/deepseek-v3/

LiveBench Results: https://www.reddit.com/r/LocalLLaMA/comments/1hm4959/benchmark_results_deepseek_v3_on_livebench/