[ PROMPT_NODE_22315 ]
Production Examples
[ SKILL_DOCUMENTATION ]
# Production Examples
Real-world production deployments using Megatron-Core and Megatron-LM.
## Meta LLaMA 3
### Overview
- **Models**: 8B, 70B, 405B parameters
- **Training Infrastructure**: Two 24,000 H100 GPU clusters
- **Total Investment**: Massive scale, months of training
- **Training Data**: 15 trillion tokens for 405B model
- **Status**: Production deployed (llama.meta.com)
### LLaMA 3.1 405B - Largest Open Model
**Architecture**:
```yaml
Parameters: 405 billion
Layers: 126
Hidden size: 16384
Attention heads: 128
Query groups: 8 (GQA)
FFN size: 53248
Vocabulary: 128,256 tokens
Max context: 128K tokens (supports up to)
Position encoding: RoPE
Activation: SwiGLU
Normalization: RMSNorm
```
**Training Configuration**:
```bash
# 1024 H100 GPUs (128 nodes × 8 GPUs)
Tensor Parallel (TP): 8 # Within node
Pipeline Parallel (PP): 8 # Across nodes
Context Parallel (CP): 2 # For long sequences
Data Parallel (DP): 8 # Remaining dimension
Total GPUs: 8 × 8 × 2 × 8 = 1024
Effective batch size: 2048
Micro-batch per GPU: 1
Sequence length: 4096 tokens
```
**Performance Metrics**:
- **Sustained throughput**: 400 TFlops/GPU
- **MFU**: ~46% on H100
- **Uptime**: 95%+ over months
- **Efficiency improvement**: 3× vs LLaMA 2 training
**Training Duration**:
- 15 trillion tokens total
- ~54 days on 16,384 H100 GPUs
- Or ~6 months on 1,024 H100 GPUs
**Key Optimizations Used**:
```bash
--use-mcore-models
--transformer-impl transformer_engine
--sequence-parallel
--context-parallel-size 2
--use-distributed-optimizer
--overlap-grad-reduce
--overlap-param-gather
--use-flash-attn-v2
--bf16
```
**Production Serving**:
- Deployed on llama.meta.com
- Available via API and download
- Used in Meta products (Instagram, Facebook, WhatsApp)
### LLaMA 3 70B
**Training Configuration**:
```bash
# 64 H100 GPUs (8 nodes × 8 GPUs)
TP=4, PP=4, CP=2, DP=2
torchrun --nproc_per_node=8 --nnodes=8 pretrain_gpt.py
--num-layers 80
--hidden-size 8192
--num-attention-heads 64
--num-query-groups 8
--seq-length 4096
--micro-batch-size 1
--global-batch-size 1024
--tensor-model-parallel-size 4
--pipeline-model-parallel-size 4
--context-parallel-size 2
--bf16
--use-mcore-models
```
**Memory per GPU**:
- Model parameters: 140GB / 4 (TP) / 4 (PP) = 8.75GB
- Optimizer states: ~17.5GB
- Activations: ~3GB
- **Total**: ~30GB per H100 (fits in 80GB)
## NVIDIA Nemotron-4 340B
### Overview
- **Organization**: NVIDIA
- **Parameters**: 340 billion
- **Framework**: NeMo (built on Megatron-Core)
- **Purpose**: Enterprise AI foundation model
- **Status**: Commercial deployment
**Key Features**:
- Mixture of Experts architecture
- Optimized for enterprise use cases
- NeMo framework integration
- Production-ready deployment
**Architecture**:
```yaml
Type: Mixture of Experts (MoE)
Total parameters: 340B
Active parameters per token: ~40B
Experts: 8
Router: Top-2
Context length: 4096
```
**Training Infrastructure**:
- NVIDIA DGX H100 systems
- Megatron-Core + NeMo
- Multi-node training
- Enterprise-grade fault tolerance
**Production Features**:
- NeMo Guardrails integration
- Enterprise support
- Customization options
- On-premise deployment available
## Microsoft & NVIDIA Megatron-Turing NLG 530B
### Overview
- **Organization**: Microsoft + NVIDIA collaboration
- **Parameters**: 530 billion (largest dense model when released)
- **Year**: 2021
- **Framework**: DeepSpeed ZeRO-3 + Megatron tensor/pipeline parallelism
- **Hardware**: 560 NVIDIA A100 80GB GPUs
**Architecture**:
```yaml
Parameters: 530 billion
Layers: 105
Hidden size: 20480
Attention heads: 128
Vocabulary: 51,200 tokens
Sequence length: 2048
```
**Training Configuration**:
```bash
# 560 A100 80GB GPUs
Tensor Parallel: 8
Pipeline Parallel: 35
Data Parallel: 2
Total: 8 × 35 × 2 = 560
DeepSpeed ZeRO Stage 3:
- Full parameter sharding
- Gradient sharding
- Optimizer state sharding
```
**Innovations**:
- First to combine DeepSpeed ZeRO-3 with Megatron parallelism
- Demonstrated training at 500B+ scale
- Proved viability of extreme parallelism
**Performance**:
- Trained on 339 billion tokens
- Multiple months of training
- Achieved state-of-the-art results in 2021
## BigScience BLOOM 176B
### Overview
- **Organization**: BigScience (1000+ researchers)
- **Parameters**: 176 billion
- **Year**: 2022
- **Framework**: Megatron-DeepSpeed
- **Hardware**: 384 NVIDIA A100 80GB GPUs
- **Training Duration**: 46 days
**Architecture**:
```yaml
Parameters: 176 billion
Layers: 70
Hidden size: 14336
Attention heads: 112
Vocabulary: 250,680 tokens (multilingual)
Sequence length: 2048
Languages: 46 natural languages + 13 programming languages
```
**Training Configuration**:
```bash
# 384 A100 80GB GPUs on Jean Zay supercomputer
Tensor Parallel: 4
Pipeline Parallel: 12
Data Parallel: 8
Total: 4 × 12 × 8 = 384
Global batch size: 2048
Micro-batch size: 4
Learning rate: 6e-5
Optimizer: Adam (β1=0.9, β2=0.95)
```
**Training Data**:
- 366 billion tokens (1.6TB)
- ROOTS corpus (custom multilingual dataset)
- 46 natural languages
- 13 programming languages
**Key Achievements**:
- Largest multilingual open-source model at release
- Trained on public supercomputer (Jean Zay)
- Fully documented training process
- Open-source model and training code
**Public Impact**:
- Downloaded 100,000+ times
- Used in hundreds of research papers
- Enabled multilingual AI research
- Demonstrated open science at scale
## DeepSeek-V3
### Overview
- **Organization**: DeepSeek
- **Parameters**: 671 billion total, 37B active per token
- **Type**: Mixture of Experts (MoE)
- **Year**: 2024-2025
- **Framework**: Megatron-Core
**Architecture**:
```yaml
Type: Mixture of Experts
Total parameters: 671B
Active parameters per token: 37B
Layers: 61
Hidden size: 7168
Attention heads: 128
Query groups: 16
Experts: 256 (massive MoE)
Router top-k: 8 (Multi-head Latent Attention)
Shared expert size: 18432
```
**Training Configuration**:
```bash
# 1024 H100 GPUs
Tensor Parallel (TP): 2
Pipeline Parallel (PP): 16
Expert Parallel (EP): 64
Context Parallel (CP): 1
Total: 2 × 16 × 64 = 2048 slots
# Uses overlapping parallelism
Global batch size: 4096
Sequence length: 4096
Training tokens: 14.8 trillion
```
**Innovations**:
- Multi-head Latent Attention (MLA) router
- Shared experts + routed experts
- Ultra-large expert count (256)
- Advanced load balancing
**Performance**:
- Competitive with GPT-4
- 37B active params rivals 70B+ dense models
- Efficient inference (only 37B active)
## OpenAI GPT-3 175B (2020)
### Overview
- **Organization**: OpenAI
- **Parameters**: 175 billion
- **Year**: 2020
- **Framework**: Megatron-inspired custom implementation
- **Hardware**: Thousands of NVIDIA V100 GPUs
**Architecture**:
```yaml
Parameters: 175 billion
Layers: 96
Hidden size: 12288
Attention heads: 96
FFN size: 49152
Vocabulary: 50,257 tokens (GPT-2 BPE)
Sequence length: 2048
Context window: 2048 tokens
```
**Training Configuration**:
```bash
# Estimated configuration
Tensor Parallel: 4-8
Pipeline Parallel: 8-16
Data Parallel: Remaining GPUs
Global batch size: 1536
Learning rate: 6e-5
Training tokens: 300 billion
```
**Training Compute**:
- 3.14 × 10^23 FLOPs
- Equivalent to ~355 GPU-years on V100
- Estimated cost: $4-12 million
**Impact**:
- Launched modern era of large language models
- Demonstrated few-shot learning
- Foundation for ChatGPT
## Stability AI StableLM
### Overview
- **Organization**: Stability AI
- **Framework**: GPT-NeoX (Megatron + DeepSpeed)
- **Hardware**: Training on supercomputers
- **Status**: Open-source
**Models**:
- StableLM-Base-Alpha: 3B, 7B
- StableLM-Tuned-Alpha: Fine-tuned versions
- StableCode: Code-specialized
**Training Configuration**:
```yaml
Framework: GPT-NeoX
Parallelism: Megatron TP/PP + DeepSpeed ZeRO
GPUs: A100 clusters
Training data: 1.5 trillion tokens (The Pile)
```
**Key Features**:
- Fully open-source (Apache 2.0)
- GPT-NeoX framework
- Trained on The Pile dataset
- Multiple model sizes
## Common Production Patterns
### Fault Tolerance
**Checkpoint Strategy**:
```bash
--save-interval 500 # Save every 500 iterations
--save /checkpoints/model_name # Checkpoint directory
--load /checkpoints/model_name # Auto-resume from latest
```
**Monitoring**:
```python
# Check in progress.txt
Job throughput: 45.2 TFLOPs/GPU
Cumulative throughput: 44.8 TFLOPs/GPU
Memory usage: 68.2 GB / 80 GB
Loss: 2.143
```
### Data Pipeline
**Preprocessing**:
```bash
python tools/preprocess_data.py
--input data.jsonl
--output-prefix /data/processed
--vocab-file vocab.json
--merge-file merges.txt
--tokenizer-type GPT2BPETokenizer
--append-eod
--workers 64
```
**Training with Preprocessed Data**:
```bash
--data-path /data/processed_text_document
--split 969,30,1 # Train/valid/test split
```
### Monitoring & Logging
**Key Metrics to Track**:
```bash
# Training metrics
- Loss (should steadily decrease)
- Learning rate (follows schedule)
- Gradient norm (watch for spikes)
- Throughput (TFlops/GPU)
- MFU percentage
# System metrics
- GPU utilization (>90%)
- Memory usage (<95% of capacity)
- Network bandwidth (saturated for TP)
- Data loading time (should be minimal)
```
**Production Monitoring Tools**:
- TensorBoard for loss curves
- Weights & Biases for experiment tracking
- Prometheus + Grafana for system metrics
- Custom scripts for MFU calculation
### Multi-Datacenter Training
**Challenges**:
- Higher latency between datacenters
- Network bandwidth limitations
- Fault isolation
**Solutions**:
```bash
# Keep TP within datacenter
--tensor-model-parallel-size 8 # Single node only
# Use PP across datacenters
--pipeline-model-parallel-size 16 # Across sites
# Data parallel across everything
# Automatic from remaining GPUs
```
## Lessons from Production
1. **Fault Tolerance is Critical**
- Save checkpoints frequently (every 500-1000 steps)
- Test checkpoint recovery regularly
- Monitor for GPU failures
2. **Data Quality Matters More Than Quantity**
- LLaMA 3: Carefully curated 15T tokens
- Better than naive web scraping
- Investment in data preprocessing pays off
3. **Parallelism Strategy Evolves with Scale**
- <70B: TP + DP sufficient
- 70-175B: Add PP
- 175B+: 3D or 4D parallelism required
- MoE: Add EP dimension
4. **Hardware Matters**
- H100 vs A100: 2× speedup from better hardware
- NVLink topology affects TP efficiency
- InfiniBand essential for multi-node
5. **Monitoring is Essential**
- Track MFU to catch performance issues
- Monitor loss for training health
- Watch memory usage to avoid OOM
- Log everything for debugging
## References
- Meta LLaMA 3 technical report
- NVIDIA Nemotron blog posts
- Microsoft Megatron-Turing NLG paper
- BigScience BLOOM documentation
- DeepSeek-V3 technical report
Source: claude-code-templates (MIT). See About Us for full credits.