[ PROMPT_NODE_22311 ]

Distributed Training Megatron Core – Benchmarks

[ SKILL_DOCUMENTATION ]

# Performance Benchmarks Performance metrics and benchmarks for Megatron-Core across different model sizes and hardware configurations. ## Model FLOP Utilization (MFU) **H100 Clusters**: Up to 47% MFU achieved MFU increases with larger model sizes due to higher arithmetic intensity in larger matrix multiplications (GEMMs). ## Throughput Metrics by Model Size ### GPT-3 175B - **Hardware**: H100 - **Configuration**: TP=4, PP=8 - **GPUs**: 128-512 - **MFU**: 47% on H100 - **Throughput**: 390 TFlops/GPU on H100 ### LLaMA Configurations | Model | Size | GPUs | TP | PP | CP | Seq Length | Hardware | Notes | |-------|------|------|----|----|----| -----------|----------|-------| | LLaMA-3 | 8B | 8 | 1 | 1 | 2 | 8K | H100 | CP for long sequences | | LLaMA-3 | 70B | 64 | 4 | 4 | 2 | 4K | H100 | TP+PP parallelism | | LLaMA-3.1 | 405B | 1024 | 8 | 8 | 2 | 4K | H100 | 3D parallelism | **LLaMA-3 405B Details**: - 16K H100 GPUs (two 24K GPU clusters) - TP=8, PP=8, CP=2 - 400 TFlops/GPU average - 95%+ uptime - 3× efficiency improvement vs LLaMA 2 ### Mixtral (Mixture of Experts) | Model | Active Params | Total Params | GPUs | TP | PP | EP | Experts | Hardware | |-------|---------------|--------------|------|----|----|----|---------| ---------| | Mixtral | 7B (active) | 8×7B (56B) | 64 | 1 | 4 | 8 | 8 | H100 | | Mixtral | 22B (active) | 8×22B (176B) | 256 | 4 | 4 | 8 | 8 | H100 | ### DeepSeek-V3 - **Active Parameters**: 37B per token - **Total Parameters**: 671B - **GPUs**: 1024 H100 - **Configuration**: TP=2, PP=16, EP=64 - **Parallelism**: 4D with Expert Parallel ### GPT-462B (Largest Benchmark) - **Parameters**: 462B - **GPUs**: 6144 H100 - **MFU**: 47-48% - **Throughput**: ~390 TFlops/GPU ## Hardware Performance Characteristics ### NVIDIA H100 (Hopper) - **Peak Performance**: - FP16: 1979 TFlops - BF16: 1979 TFlops - FP8: 3958 TFlops - **Memory**: 80GB HBM3 - **Memory Bandwidth**: 3.35 TB/s - **NVLink**: 900 GB/s per GPU **Achieved MFU**: 40-47% (typical range) ### NVIDIA A100 (Ampere) - **Peak Performance**: - FP16: 312 TFlops (with sparsity) - BF16: 312 TFlops - **Memory**: 40GB or 80GB HBM2e - **Memory Bandwidth**: 2 TB/s - **NVLink**: 600 GB/s per GPU **Typical MFU**: 35-42% ## Weak Scaling (Fixed Per-GPU Workload) As you add more GPUs while keeping per-GPU workload constant: | GPUs | Model Size | MFU | Efficiency | |------|------------|-----|------------| | 8 | 7B | 42% | 100% (baseline) | | 64 | 70B | 44% | 95% | | 512 | 175B | 45% | 93% | | 1024 | 405B | 46% | 90% | | 6144 | 462B | 47% | 88% | ## Strong Scaling (Fixed Total Workload) Distributing a fixed model across more GPUs: | Model | GPUs | Time per Iteration | Speedup | Efficiency | |-------|------|-------------------|---------|------------| | 70B | 64 | 1.0× (baseline) | 1.0× | 100% | | 70B | 128 | 0.52× | 1.92× | 96% | | 70B | 256 | 0.27× | 3.70× | 93% | ## Throughput Calculations **Formula**: ``` Throughput (TFlops/GPU) = Total FLOPs / (Time × Number of GPUs × 10^12) ``` **Example (GPT-3 175B)**: - Forward + Backward pass: 3 × (model FLOPs) - Per-token FLOPs: ~350 billion for 175B model - Batch size: 1536 (global) - Sequence length: 2048 - Time per iteration: ~5 seconds on 512 H100s - Throughput: ~390 TFlops/GPU ## Memory Usage vs Model Size | Model Size | Parameters | Memory (FP16) | Memory (BF16) | Memory (FP8) | |------------|------------|---------------|---------------|--------------| | 7B | 7 billion | 14 GB | 14 GB | 7 GB | | 13B | 13 billion | 26 GB | 26 GB | 13 GB | | 70B | 70 billion | 140 GB | 140 GB | 70 GB | | 175B | 175 billion | 350 GB | 350 GB | 175 GB | | 405B | 405 billion | 810 GB | 810 GB | 405 GB | **Note**: These are model weights only. Add ~2× for gradients and optimizer states during training. ## Communication Overhead ### Tensor Parallelism (TP) - **Bandwidth Required**: ~20 GB/GPU for LLaMA 70B with TP=4 - **Frequency**: Every layer (80+ layers) - **Best Practice**: Use NVLink, keep TP ≤8 within single node ### Pipeline Parallelism (PP) - **Bandwidth Required**: Activation size only (~100s of MB) - **Frequency**: Between pipeline stages - **Best Practice**: Use for cross-node scaling ### Data Parallelism (DP) - **Bandwidth Required**: Full gradient size - **Frequency**: Once per iteration - **Best Practice**: Use for remaining parallelism after TP/PP ## Optimization Impact ### Flash Attention - **Speedup**: 2-4× on attention layers - **Memory**: 10-20× reduction - **Overall Impact**: ~30% faster training ### Sequence Parallelism - **Memory Savings**: Activation memory / TP degree - **Example**: With TP=4, saves 75% of activation memory - **No Performance Cost**: Communication already happening ### Context Parallelism - **Use Case**: Sequences >8K tokens - **Memory Savings**: KV cache / CP degree - **Communication**: Ring all-to-all pattern ### FP8 Training (H100 Only) - **Speedup**: 1.5-2× vs BF16 - **Memory**: 50% reduction vs BF16 - **Quality**: Minimal degradation with proper scaling ## Production Deployments ### Meta LLaMA 3 Training - **Models**: 8B, 70B, 405B - **Cluster**: Two 24K H100 clusters - **Efficiency**: 400 TFlops/GPU sustained - **Uptime**: 95%+ - **Total Tokens**: 15 trillion for 405B model ### Microsoft Megatron-Turing NLG 530B - **GPUs**: 560 NVIDIA A100 (80GB) - **Parallelism**: DeepSpeed ZeRO-3 + Megatron TP/PP - **Duration**: Several months - **Year**: 2021 ### NVIDIA Nemotron-4 340B - **Architecture**: Mixture of Experts - **Framework**: NeMo (built on Megatron-Core) - **Production**: Commercial deployment ## Benchmarking Best Practices 1. **Measure Sustained Performance**: Not peak, measure over 100+ iterations 2. **Include All Operations**: Forward, backward, optimizer step, communication 3. **Report MFU**: Use theoretical peak FLOPs of hardware 4. **Specify Configuration**: TP, PP, CP, EP degrees, batch sizes, sequence length 5. **Note Optimizations**: Flash Attention, FP8, sequence parallel, etc. ## How to Measure Your Own Performance **Enable profiling**: ```bash torchrun pretrain_gpt.py --profile --profile-step-start 10 --profile-step-end 20 ``` **Calculate MFU**: ```python # Megatron logs this automatically # Check logs for: # - elapsed time per iteration (seconds) # - samples per second # - TFLOPs/s per GPU # - MFU percentage ``` **Key Metrics to Track**: - Elapsed time per iteration - Throughput (TFlops/GPU) - MFU (%) - Memory usage (GB) - Communication time (% of total) ## Troubleshooting Low Performance **If MFU 30% of time): 1. Reduce TP degree (especially across nodes) 2. Use interleaved pipeline schedule 3. Enable communication overlap flags 4. Check network topology (InfiniBand vs Ethernet) **If Memory Bound**: 1. Enable gradient checkpointing 2. Use lower precision (BF16 or FP8) 3. Increase parallelism degrees 4. Reduce micro-batch size ## References - NVIDIA Megatron-LM GitHub: https://github.com/NVIDIA/Megatron-LM - Performance Docs: https://docs.nvidia.com/megatron-core/ - LLaMA 3 Paper: Meta AI - DeepSeek-V3 Technical Report

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI