[ PROMPT_NODE_22365 ]
Inference
[ SKILL_DOCUMENTATION ]
# MoE Inference Optimization
Complete guide to optimizing MoE inference based on MoE-Inference-Bench research (arXiv 2508.17467, 2024).
## Table of Contents
- Performance Metrics
- vLLM Optimizations
- Quantization
- Expert Parallelism
- Optimization Techniques
- Production Deployment
## Performance Metrics
**Source**: MoE-Inference-Bench (arXiv 2508.17467)
### Key Metrics
1. **Time to First Token (TTFT)**
- Latency until first token generated
- Critical for user experience
2. **Inter-Token Latency (ITL)**
- Time between consecutive tokens
- Affects streaming experience
3. **Throughput**
- Formula: `(Batch Size × (Input + Output Tokens)) / Total Latency`
- Higher is better
### Benchmark Results (H100 GPU)
**LLM Performance**:
- **OLMoE-1B-7B**: Highest throughput
- **Mixtral-8x7B**: Highest accuracy, lower throughput
- **Qwen3-30B**: High accuracy, moderate throughput
**VLM Performance**:
- **DeepSeek-VL2-Tiny**: Fastest, lowest accuracy
- **DeepSeek-VL2**: Highest accuracy, lowest throughput
## vLLM Optimizations
**Source**: MoE-Inference-Bench 2024, vLLM documentation
### Expert Parallelism
Distribute experts across GPUs for parallel execution.
```python
from vllm import LLM, SamplingParams
# Enable expert parallelism
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
tensor_parallel_size=2, # Tensor parallelism
enable_expert_parallel=True, # Expert parallelism
gpu_memory_utilization=0.9
)
# Generate
outputs = llm.generate(
prompts=["What is mixture of experts?"],
sampling_params=SamplingParams(temperature=0.7, max_tokens=256)
)
```
### Parallelism Strategies
**From MoE-Inference-Bench**:
| Strategy | Throughput Gain | Best For |
|----------|----------------|----------|
| **Tensor Parallelism** | High | Large models, multi-GPU |
| **Expert Parallelism** | Moderate | MoE-specific, many experts |
| **Pipeline Parallelism** | Low | Very large models |
**Recommendation**: Tensor parallelism most effective for MoE models
### Fused MoE Kernels
**Performance Gain**: 12-18% throughput improvement
```python
# vLLM automatically uses fused kernels when available
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
use_v2_block_manager=True # Enable fused MoE kernels
)
```
**What it does**:
- Reduces kernel launch overhead
- Combines multiple operations into single kernel
- Better GPU utilization
## Quantization
**Source**: MoE-Inference-Bench quantization analysis
### FP8 Quantization
**Performance**: 20-30% throughput improvement over FP16
```python
from vllm import LLM
# FP8 quantization
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
quantization="fp8" # FP8 quantization
)
```
**Trade-offs**:
- Throughput: +20-30%
- Memory: -40-50%
- Accuracy: Minimal degradation (<1%)
### INT8 Quantization
```python
# INT8 weight-only quantization
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
quantization="awq" # or "gptq"
)
```
**Performance**:
- Throughput: +15-20%
- Memory: -50-60%
- Quality: Slight degradation (1-2%)
## Expert Configuration
**Source**: MoE-Inference-Bench hyperparameter analysis
### Active Experts
**Key Finding**: Single-expert activation → 50-80% higher throughput
```python
# Top-1 routing (best throughput)
# Mixtral default is top-2, but top-1 can be enforced at inference
# Model architecture determines this
# Cannot change at runtime, but affects deployment planning
```
**Performance vs Experts**:
- 1 expert/token: +50-80% throughput vs top-2
- 2 experts/token: Balanced (Mixtral default)
- 3+ experts/token: Lower throughput, higher quality
### Total Expert Count
**Scaling**: Non-linear, diminishing returns at high counts
| Total Experts | Throughput | Memory |
|--------------|------------|--------|
| 8 | Baseline | Baseline |
| 16 | +15% | +20% |
| 32 | +25% | +45% |
| 64 | +30% | +90% |
| 128 | +32% | +180% |
**Recommendation**: 8-32 experts for optimal throughput/memory
### FFN Dimension
**Key Finding**: Performance degrades with increasing FFN size
```python
# Smaller FFN = better throughput
# Trade-off: model capacity vs inference speed
```
| FFN Dimension | Throughput | Quality |
|---------------|------------|---------|
| 2048 | High | Moderate |
| 4096 | Moderate | High |
| 8192 | Low | Very High |
## Optimization Techniques
**Source**: MoE-Inference-Bench optimization experiments
### 1. Speculative Decoding
**Performance**: 1.5-2.5× speedup
```python
from vllm import LLM, SamplingParams
# Main model (large MoE)
main_model = LLM(model="mistralai/Mixtral-8x7B-v0.1")
# Draft model (small, fast)
draft_model = LLM(model="Qwen/Qwen3-1.7B")
# Speculative decoding with draft model
# vLLM handles automatically if draft model specified
```
**Best draft models** (from research):
- Medium-sized (1.7B-3B parameters)
- Qwen3-1.7B most effective
- Too small (7B): overhead dominates
### 2. Expert Pruning
**Performance**: 50% pruning → significant throughput gain
```python
# Prune least-used experts (offline)
# Example: Keep top-50% experts by usage
# Requires profiling on representative data:
# 1. Track expert utilization
# 2. Prune unused/rarely-used experts
# 3. Fine-tune pruned model (optional)
```
**Trade-off**:
- 50% pruning: +40-60% throughput, -2-5% accuracy
- 75% pruning: +80-120% throughput, -5-15% accuracy
### 3. Batch Size Tuning
```python
# Larger batches = better throughput (until OOM)
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
max_num_seqs=256, # Maximum batch size
max_num_batched_tokens=8192 # Total tokens in batch
)
```
**Optimal batch sizes** (H100):
- Mixtral-8x7B: 64-128
- Smaller MoE (8 experts): 128-256
- Larger MoE (>16 experts): 32-64
## Production Deployment
### Single GPU (Consumer Hardware)
```python
from vllm import LLM
# Optimize for single GPU
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
gpu_memory_utilization=0.95, # Use 95% of VRAM
max_num_seqs=32, # Smaller batches
quantization="awq" # Quantize to fit
)
```
**Minimum requirements**:
- Mixtral-8x7B: 48GB VRAM (FP16) or 24GB (INT8)
- Expert parallelism not needed
### Multi-GPU (Data Center)
```python
# Tensor parallelism + Expert parallelism
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
tensor_parallel_size=2, # Split across 2 GPUs
enable_expert_parallel=True, # Distribute experts
gpu_memory_utilization=0.9
)
```
**Scaling strategy**:
- 2 GPUs: Tensor parallelism
- 4+ GPUs: Tensor + expert parallelism
- 8+ GPUs: Consider pipeline parallelism
### Production Configuration
```python
# Optimized for production
llm = LLM(
model="mistralai/Mixtral-8x7B-v0.1",
# Parallelism
tensor_parallel_size=2,
enable_expert_parallel=True,
# Memory
gpu_memory_utilization=0.9,
swap_space=4, # 4GB CPU swap
# Performance
use_v2_block_manager=True, # Fused kernels
max_num_seqs=64,
max_num_batched_tokens=4096,
# Optional: Quantization
quantization="fp8"
)
```
### Monitoring
```python
import time
# Track metrics
def monitor_inference(llm, prompts):
start = time.time()
outputs = llm.generate(prompts)
end = time.time()
total_time = end - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
print(f"Throughput: {total_tokens / total_time:.2f} tokens/sec")
print(f"Latency: {total_time / len(prompts):.2f} sec/request")
return outputs
# Usage
outputs = monitor_inference(llm, ["Prompt 1", "Prompt 2"])
```
## Optimization Checklist
**From MoE-Inference-Bench best practices:**
- [ ] Use FP8 quantization (20-30% speedup)
- [ ] Enable fused MoE kernels (12-18% speedup)
- [ ] Tune batch size for your hardware
- [ ] Use tensor parallelism for multi-GPU
- [ ] Consider speculative decoding (1.5-2.5× speedup)
- [ ] Profile expert utilization, prune if needed
- [ ] Optimize active expert count (top-1 vs top-2)
- [ ] Monitor and tune GPU memory utilization
## Resources
- **MoE-Inference-Bench**: https://arxiv.org/abs/2508.17467
- **vLLM Documentation**: https://docs.vllm.ai
- **PyTorch MoE Optimization**: https://pytorch.org/blog/accelerating-moe-model/
Source: claude-code-templates (MIT). See About Us for full credits.