[ PROMPT_NODE_22309 ]
Training Llms Megatron
[ SKILL_DOCUMENTATION ]
# Megatron-Core - Large-Scale LLM Training
## Quick start
Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.
**Installation**:
```bash
# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3
# Or pip
pip install megatron-core
```
**Simple distributed training**:
```bash
# Train with 2 GPUs using data parallelism
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py
# Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh
```
## Common workflows
### Workflow 1: Train LLaMA-style model with 3D parallelism
Copy this checklist:
```
LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics
```
**Step 1: Choose parallelism configuration**
Model size determines parallelism strategy:
| Model Size | GPUs | Tensor Parallel | Pipeline Parallel | Data Parallel | Context Parallel |
|------------|------|-----------------|-------------------|---------------|------------------|
| 7B | 8 | 1 | 1 | 8 | 1 |
| 13B | 8 | 2 | 1 | 4 | 1 |
| 70B | 64 | 4 | 4 | 4 | 1 |
| 405B | 128 | 8 | 8 | 2 | 2 |
**Step 2: Configure training hyperparameters**
```bash
#!/bin/bash
# train_llama_70b.sh
GPUS_PER_NODE=8
NNODES=8 # 64 GPUs total
TP=4 # Tensor parallel
PP=4 # Pipeline parallel
CP=1 # Context parallel
# LLaMA 70B configuration
MODEL_SIZE=70 # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096
# Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4
torchrun
--nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
pretrain_gpt.py
--tensor-model-parallel-size $TP
--pipeline-model-parallel-size $PP
--context-parallel-size $CP
--sequence-parallel
--num-layers $NUM_LAYERS
--hidden-size $HIDDEN_SIZE
--num-attention-heads $NUM_HEADS
--seq-length $SEQ_LENGTH
--max-position-embeddings $SEQ_LENGTH
--micro-batch-size $MICRO_BATCH
--global-batch-size $GLOBAL_BATCH
--lr $LR
--train-iters 100000
--lr-decay-style cosine
--lr-warmup-iters 2000
--weight-decay 0.1
--clip-grad 1.0
--bf16
--use-mcore-models
--transformer-impl transformer_engine
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
```
**Step 3: Launch distributed training**
```bash
# Single node (8 GPUs)
bash train_llama_70b.sh
# Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh
```
**Step 4: Monitor performance metrics**
Key metrics to track:
```
Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: 70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs
```
Example 405B on 128 H100s:
```
TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs
```
## When to use vs alternatives
**Use Megatron-Core when:**
- Training models >10B parameters
- Need maximum efficiency (target >40% MFU)
- Using NVIDIA GPUs (A100, H100)
- Production training at scale
- Want fine-grained parallelism control
**Use alternatives instead:**
- **PyTorch FSDP**: Models <70B, simpler API, PyTorch native
- **DeepSpeed**: Easier setup, good for <100B models
- **HuggingFace Accelerate**: Prototyping, simpler workflows
- **LitGPT**: Educational, single-file implementations
## Common issues
**Issue: Low GPU utilization (8
--tensor-model-parallel-size 4 # Was 16
```
**Issue: Out of memory**
Reduce memory with:
```bash
--tensor-model-parallel-size 2 # Split model across GPUs
--recompute-granularity full # Gradient checkpointing
--recompute-method block # Checkpoint transformer blocks
--recompute-num-layers 1 # Checkpoint every layer
```
Or use CPU/NVMe offloading:
```bash
--cpu-optimizer # Offload optimizer to CPU
--cpu-optimizer-type ADAM # CPU Adam variant
```
**Issue: Training slower than expected**
Check:
1. **Network bottleneck**: Ensure InfiniBand/NVLink enabled
2. **Pipeline bubbles**: Use interleaved pipeline schedule
```bash
--num-layers-per-virtual-pipeline-stage 2
```
3. **Data loading**: Use fast data loader
```bash
--dataloader-type cyclic
```
**Issue: Diverging loss**
Stabilize training:
```bash
--lr-warmup-iters 2000 # Longer warmup
--clip-grad 1.0 # Gradient clipping
--init-method-std 0.006 # Smaller init
--attention-dropout 0.0 # No dropout in attention
--hidden-dropout 0.0 # No dropout in FFN
```
## Advanced topics
**Parallelism strategies**: See [references/parallelism-guide.md](references/parallelism-guide.md) for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.
**Performance benchmarks**: See [references/benchmarks.md](references/benchmarks.md) for MFU numbers across different model sizes and GPU configurations.
**Production configurations**: See [references/production-examples.md](references/production-examples.md) for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.
**Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.
## Hardware requirements
- **GPU**: NVIDIA Ampere+ (A100, H100, B200)
- Turing works but slower
- FP8 requires Hopper/Ada/Blackwell
- **Network**: InfiniBand or 400Gb+ Ethernet for multi-node
- **Memory per GPU**:
- 7B model: 40GB+
- 70B model: 80GB (with TP=4)
- 405B model: 80GB (with TP=8, PP=8)
- **Storage**: Fast NVMe for checkpoints (1TB+ for 70B+ models)
## Resources
- Docs: https://docs.nvidia.com/megatron-core/
- GitHub: https://github.com/NVIDIA/Megatron-LM
- Papers:
- "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" (2019)
- "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM" (2021)
- NeMo Framework: https://docs.nvidia.com/nemo-framework/ (built on Megatron-Core)
Source: claude-code-templates (MIT). See About Us for full credits.