Mixture of Experts

Vanilla implementation
max_model_len: max_possition_embeddings
- Mixtral-8x7B: 32768
- Mixtral-8x22B: 65536
Graph Capture Size
1. max_seq_len_to_capture : name terminology
2. _verify_cuda_graph:
```
self.max_seq_len_to_capture = min(self.max_seq_len_to_capture, self.max_model_len)
```
3. _get_graph_batch_size : Returns the padded batch size given actual batch size. Batch sizes are 1, 2, 4, _BATCH_SIZE_ALIGNMENT, 2*_BATCH_SIZE_ALIGNMENT, 3*_BATCH_SIZE_ALIGNMENT...

Graph usage condition

_use_captured_graph : link

print log for Mixtral-8x22B

 --input-len 8192 --output-len 3 --batch-size 32

max_seq_len_to_capture is set to 8192+256=8448

         decode_only: False && not enforce_eager: False
         ,  batch_size: 65536 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: 0, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 65536 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = False
         decode_only: False && not enforce_eager: False
         ,  batch_size: 65536 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: 0, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 65536 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = False
         decode_only: False && not enforce_eager: False
         ,  batch_size: 65536 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: 0, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 65536 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = False
         decode_only: False && not enforce_eager: False
         ,  batch_size: 65536 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: 0, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 65536 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = False
         decode_only: True && not enforce_eager: False
         ,  batch_size: 32 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: **8193**, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 32 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = True
         decode_only: True && not enforce_eager: False
         ,  batch_size: 32 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: **8194**, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 32 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = True
         decode_only: True && not enforce_eager: False
         ,  batch_size: 32 <= _BATCH_SIZES_TO_CAPTURE: 8192
         ,  max_decode_seq_len: **8195**, max_encoder_seq_len: 0 <=  max_seq_len_to_capture: 8448
            batch_size: 32 <= max_batchsize_to_capture: 256
            --> result (_use_captured_graph) = True

Chunked prefill gets enabled automatically for model_len > 32k: link

Decode latency IS affected by prefill length.

BS=240, output=200

Input-len	Decode latency
512	63ms
1024	66ms
2048	69ms

Reason

 kv cache size is bigger for larger context length. Hence, paged_attn kernel takes more time! 
             BS=240 | _paged_attn kernel time: In512 : 41us  --vs-- In2048 : 121us

Config file names to avoid confusion:

    Mi300 file names:  
         AMD_Instinct_MI300X.json
    Mi308 file names:
         AMD_Instinct_MI300X_OAM.json
         AMD_Instinct_MI308X_OAM.json
         AMD_Radeon_Graphics.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE.md

MoE.md

Mixture of Experts

Graph Capture Size

Graph usage condition

Files

MoE.md

Latest commit

History

MoE.md

File metadata and controls

Mixture of Experts

Graph Capture Size

Graph usage condition