[ PROMPT_NODE_22313 ]

Parallelism Guide

[ SKILL_DOCUMENTATION ]

# Parallelism Strategies Guide Complete guide to parallelism in Megatron-Core: when to use each strategy, performance characteristics, and configuration examples. ## Parallelism Types ### 1. Data Parallel (DP) **How it works**: Replicate full model on each GPU, split data batches, synchronize gradients. **Memory**: Low efficiency - full model replication **Communication**: Low - only gradients (all-reduce) **Scalability**: Good for small models **When to use**: - Models 10B parameters - Have NVLink-connected GPUs - Within single node (network latency kills performance across nodes) **Configuration**: ```bash # Split model across 4 GPUs with TP torchrun --nproc_per_node=4 train.py --tensor-model-parallel-size 4 ``` **Performance**: - **1 node (8 GPUs, NVLink)**: 85-95% efficiency - **Across nodes**: 70B parameters - Multi-node training - Limited intra-node bandwidth **Configuration**: ```bash # 4-stage pipeline torchrun --nproc_per_node=8 --nnodes=4 train.py --pipeline-model-parallel-size 4 --num-layers 80 --num-layers-per-virtual-pipeline-stage 2 # Interleaved ``` **Performance**: - Interleaved schedule: 90-95% efficiency - Standard 1F1B: 75-85% efficiency ### 4. Sequence Parallel (SP) **How it works**: Split sequence dimension across tensor-parallel GPUs, reduce activation memory. **Memory**: Reduces activations by TP factor **Communication**: Same as TP (already using all-reduce) **Scalability**: Tied to TP **When to use**: - Long sequences (>4K tokens) - Using TP already - Activation memory is bottleneck **Configuration**: ```bash torchrun --nproc_per_node=8 train.py --tensor-model-parallel-size 4 --sequence-parallel # Requires TP > 1 ``` **Memory savings**: ``` 70B model, 4K sequence, TP=4: Without SP: 48GB activations per GPU With SP: 12GB activations per GPU Savings: 75% ``` ### 5. Context Parallel (CP) **How it works**: Split very long sequences across GPUs using Ring Attention. **Memory**: Reduces KV cache and activations **Communication**: Medium - ring communication pattern **Scalability**: Good for >8K sequences **When to use**: - Sequences >8K tokens - Long-context models (>32K) - KV cache memory bottleneck **Configuration**: ```bash torchrun --nproc_per_node=8 train.py --context-parallel-size 2 --seq-length 32768 # 32K tokens ``` **Memory savings** (32K sequence): ``` Without CP: 64GB KV cache With CP=4: 16GB KV cache per GPU ``` ### 6. Expert Parallel (EP) **How it works**: For MoE models, distribute different experts across GPUs. **Memory**: Excellent - only store 1/N experts per GPU **Communication**: Low - only route tokens to experts **Scalability**: Matches number of experts **When to use**: - Mixture of Experts models - Want model capacity without memory cost - Have ≥8 GPUs **Configuration**: ```bash # Mixtral 8x7B: 8 experts torchrun --nproc_per_node=8 train.py --expert-model-parallel-size 4 --num-experts 8 --tensor-model-parallel-size 2 ``` **Memory** (Mixtral 8×7B): ``` Without EP: 8 experts × 7B = 56GB With EP=4: 2 experts × 7B = 14GB Savings: 75% ``` ## Combining Parallelism Strategies ### 3D Parallelism (TP + PP + DP) Standard for large models. **LLaMA 3 70B on 64 GPUs**: ```bash TP=4 # Within each node PP=4 # Across nodes DP=4 # Remaining dimension Total = 4 × 4 × 4 = 64 GPUs ``` **Memory per GPU**: 70B / 4 (TP) / 4 (PP) = 4.4B params ≈ 20GB **Configuration**: ```bash torchrun --nproc_per_node=8 --nnodes=8 train.py --tensor-model-parallel-size 4 --pipeline-model-parallel-size 4 # DP is implicit: 64 / (4*4) = 4 ``` ### 4D Parallelism (TP + PP + DP + CP) For very large models or long context. **LLaMA 3 405B on 256 GPUs**: ```bash TP=8 # Max NVLink PP=8 # Across nodes CP=2 # Long sequences DP=2 # Remaining Total = 8 × 8 × 2 × 2 = 256 GPUs ``` **Configuration**: ```bash torchrun --nproc_per_node=8 --nnodes=32 train.py --tensor-model-parallel-size 8 --pipeline-model-parallel-size 8 --context-parallel-size 2 ``` ### 4D + EP (5D Parallelism) For sparse MoE models. **DeepSeek-V3 671B (37B active) on 1024 GPUs**: ```bash TP=2 # Limited by active params PP=16 # Many stages EP=64 # 256 experts / 4 experts per GPU DP=2 # Small data parallel Total = 2 × 16 × 64 × 2 = 4096 (uses 1024 in practice) ``` ## Decision Guide ### By Model Size | Model Size | GPUs | Recommended Strategy | |------------|------|---------------------| | <1B | 1-8 | DP only | | 1-10B | 8-16 | TP=2-4 + DP | | 10-70B | 16-64 | TP=4 + PP=2-4 + DP | | 70-175B | 64-256 | TP=8 + PP=4-8 + DP | | 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP | | 500B+ | 1024+ | 4D or 5D (with EP) | ### By Hardware Topology **Single node (8 GPUs with NVLink)**: ```bash # Up to 70B TP=8 # Use all NVLink bandwidth ``` **Multiple nodes (InfiniBand)**: ```bash # Minimize cross-node communication TP=8 # Within node only PP=N # Across nodes DP=remaining ``` **Limited network (Ethernet)**: ```bash # Avoid TP across nodes TP=1-4 # Within node PP=many # PP has low communication ``` ### By Sequence Length | Sequence | Parallelism | |----------|------------| | 8K ``` 5. **Avoid TP across nodes** (network latency kills performance) 6. **Match TP to GPU topology** (TP=8 for 8-GPU nodes) 7. **Profile first iteration** to check memory and communication: ```bash --profile # Enable profiling --profile-ranks 0 # Profile first rank only ``` ## Troubleshooting **High communication overhead (low MFU)**: - Reduce TP degree (especially across nodes) - Increase PP degree instead - Enable interleaved pipeline schedule **Out of memory**: - Increase TP/PP (split model more) - Enable gradient checkpointing: ```bash --recompute-granularity full --recompute-method block ``` - Reduce micro-batch size **Pipeline bubbles (low GPU util)**: - Use interleaved schedule: ```bash --num-layers-per-virtual-pipeline-stage 2 ``` - Increase number of microbatches: ```bash --global-batch-size 1024 --micro-batch-size 1 # More microbatches = smaller bubbles ``` **Load imbalance in MoE**: - Tune load balancing: ```bash --moe-router-load-balancing-type aux_loss --moe-aux-loss-coeff 0.01 ``` - Increase expert parallel degree: ```bash --expert-model-parallel-size 8 # More experts per GPU ```

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI