[ PROMPT_NODE_22365 ]

Inference

[ SKILL_DOCUMENTATION ]

# MoE Inference Optimization Complete guide to optimizing MoE inference based on MoE-Inference-Bench research (arXiv 2508.17467, 2024). ## Table of Contents - Performance Metrics - vLLM Optimizations - Quantization - Expert Parallelism - Optimization Techniques - Production Deployment ## Performance Metrics **Source**: MoE-Inference-Bench (arXiv 2508.17467) ### Key Metrics 1. **Time to First Token (TTFT)** - Latency until first token generated - Critical for user experience 2. **Inter-Token Latency (ITL)** - Time between consecutive tokens - Affects streaming experience 3. **Throughput** - Formula: `(Batch Size × (Input + Output Tokens)) / Total Latency` - Higher is better ### Benchmark Results (H100 GPU) **LLM Performance**: - **OLMoE-1B-7B**: Highest throughput - **Mixtral-8x7B**: Highest accuracy, lower throughput - **Qwen3-30B**: High accuracy, moderate throughput **VLM Performance**: - **DeepSeek-VL2-Tiny**: Fastest, lowest accuracy - **DeepSeek-VL2**: Highest accuracy, lowest throughput ## vLLM Optimizations **Source**: MoE-Inference-Bench 2024, vLLM documentation ### Expert Parallelism Distribute experts across GPUs for parallel execution. ```python from vllm import LLM, SamplingParams # Enable expert parallelism llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", tensor_parallel_size=2, # Tensor parallelism enable_expert_parallel=True, # Expert parallelism gpu_memory_utilization=0.9 ) # Generate outputs = llm.generate( prompts=["What is mixture of experts?"], sampling_params=SamplingParams(temperature=0.7, max_tokens=256) ) ``` ### Parallelism Strategies **From MoE-Inference-Bench**: | Strategy | Throughput Gain | Best For | |----------|----------------|----------| | **Tensor Parallelism** | High | Large models, multi-GPU | | **Expert Parallelism** | Moderate | MoE-specific, many experts | | **Pipeline Parallelism** | Low | Very large models | **Recommendation**: Tensor parallelism most effective for MoE models ### Fused MoE Kernels **Performance Gain**: 12-18% throughput improvement ```python # vLLM automatically uses fused kernels when available llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", use_v2_block_manager=True # Enable fused MoE kernels ) ``` **What it does**: - Reduces kernel launch overhead - Combines multiple operations into single kernel - Better GPU utilization ## Quantization **Source**: MoE-Inference-Bench quantization analysis ### FP8 Quantization **Performance**: 20-30% throughput improvement over FP16 ```python from vllm import LLM # FP8 quantization llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", quantization="fp8" # FP8 quantization ) ``` **Trade-offs**: - Throughput: +20-30% - Memory: -40-50% - Accuracy: Minimal degradation (<1%) ### INT8 Quantization ```python # INT8 weight-only quantization llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", quantization="awq" # or "gptq" ) ``` **Performance**: - Throughput: +15-20% - Memory: -50-60% - Quality: Slight degradation (1-2%) ## Expert Configuration **Source**: MoE-Inference-Bench hyperparameter analysis ### Active Experts **Key Finding**: Single-expert activation → 50-80% higher throughput ```python # Top-1 routing (best throughput) # Mixtral default is top-2, but top-1 can be enforced at inference # Model architecture determines this # Cannot change at runtime, but affects deployment planning ``` **Performance vs Experts**: - 1 expert/token: +50-80% throughput vs top-2 - 2 experts/token: Balanced (Mixtral default) - 3+ experts/token: Lower throughput, higher quality ### Total Expert Count **Scaling**: Non-linear, diminishing returns at high counts | Total Experts | Throughput | Memory | |--------------|------------|--------| | 8 | Baseline | Baseline | | 16 | +15% | +20% | | 32 | +25% | +45% | | 64 | +30% | +90% | | 128 | +32% | +180% | **Recommendation**: 8-32 experts for optimal throughput/memory ### FFN Dimension **Key Finding**: Performance degrades with increasing FFN size ```python # Smaller FFN = better throughput # Trade-off: model capacity vs inference speed ``` | FFN Dimension | Throughput | Quality | |---------------|------------|---------| | 2048 | High | Moderate | | 4096 | Moderate | High | | 8192 | Low | Very High | ## Optimization Techniques **Source**: MoE-Inference-Bench optimization experiments ### 1. Speculative Decoding **Performance**: 1.5-2.5× speedup ```python from vllm import LLM, SamplingParams # Main model (large MoE) main_model = LLM(model="mistralai/Mixtral-8x7B-v0.1") # Draft model (small, fast) draft_model = LLM(model="Qwen/Qwen3-1.7B") # Speculative decoding with draft model # vLLM handles automatically if draft model specified ``` **Best draft models** (from research): - Medium-sized (1.7B-3B parameters) - Qwen3-1.7B most effective - Too small (7B): overhead dominates ### 2. Expert Pruning **Performance**: 50% pruning → significant throughput gain ```python # Prune least-used experts (offline) # Example: Keep top-50% experts by usage # Requires profiling on representative data: # 1. Track expert utilization # 2. Prune unused/rarely-used experts # 3. Fine-tune pruned model (optional) ``` **Trade-off**: - 50% pruning: +40-60% throughput, -2-5% accuracy - 75% pruning: +80-120% throughput, -5-15% accuracy ### 3. Batch Size Tuning ```python # Larger batches = better throughput (until OOM) llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", max_num_seqs=256, # Maximum batch size max_num_batched_tokens=8192 # Total tokens in batch ) ``` **Optimal batch sizes** (H100): - Mixtral-8x7B: 64-128 - Smaller MoE (8 experts): 128-256 - Larger MoE (>16 experts): 32-64 ## Production Deployment ### Single GPU (Consumer Hardware) ```python from vllm import LLM # Optimize for single GPU llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", gpu_memory_utilization=0.95, # Use 95% of VRAM max_num_seqs=32, # Smaller batches quantization="awq" # Quantize to fit ) ``` **Minimum requirements**: - Mixtral-8x7B: 48GB VRAM (FP16) or 24GB (INT8) - Expert parallelism not needed ### Multi-GPU (Data Center) ```python # Tensor parallelism + Expert parallelism llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", tensor_parallel_size=2, # Split across 2 GPUs enable_expert_parallel=True, # Distribute experts gpu_memory_utilization=0.9 ) ``` **Scaling strategy**: - 2 GPUs: Tensor parallelism - 4+ GPUs: Tensor + expert parallelism - 8+ GPUs: Consider pipeline parallelism ### Production Configuration ```python # Optimized for production llm = LLM( model="mistralai/Mixtral-8x7B-v0.1", # Parallelism tensor_parallel_size=2, enable_expert_parallel=True, # Memory gpu_memory_utilization=0.9, swap_space=4, # 4GB CPU swap # Performance use_v2_block_manager=True, # Fused kernels max_num_seqs=64, max_num_batched_tokens=4096, # Optional: Quantization quantization="fp8" ) ``` ### Monitoring ```python import time # Track metrics def monitor_inference(llm, prompts): start = time.time() outputs = llm.generate(prompts) end = time.time() total_time = end - start total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs) print(f"Throughput: {total_tokens / total_time:.2f} tokens/sec") print(f"Latency: {total_time / len(prompts):.2f} sec/request") return outputs # Usage outputs = monitor_inference(llm, ["Prompt 1", "Prompt 2"]) ``` ## Optimization Checklist **From MoE-Inference-Bench best practices:** - [ ] Use FP8 quantization (20-30% speedup) - [ ] Enable fused MoE kernels (12-18% speedup) - [ ] Tune batch size for your hardware - [ ] Use tensor parallelism for multi-GPU - [ ] Consider speculative decoding (1.5-2.5× speedup) - [ ] Profile expert utilization, prune if needed - [ ] Optimize active expert count (top-1 vs top-2) - [ ] Monitor and tune GPU memory utilization ## Resources - **MoE-Inference-Bench**: https://arxiv.org/abs/2508.17467 - **vLLM Documentation**: https://docs.vllm.ai - **PyTorch MoE Optimization**: https://pytorch.org/blog/accelerating-moe-model/

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI