[ PROMPT_NODE_22315 ]

Production Examples

[ SKILL_DOCUMENTATION ]

# Production Examples Real-world production deployments using Megatron-Core and Megatron-LM. ## Meta LLaMA 3 ### Overview - **Models**: 8B, 70B, 405B parameters - **Training Infrastructure**: Two 24,000 H100 GPU clusters - **Total Investment**: Massive scale, months of training - **Training Data**: 15 trillion tokens for 405B model - **Status**: Production deployed (llama.meta.com) ### LLaMA 3.1 405B - Largest Open Model **Architecture**: ```yaml Parameters: 405 billion Layers: 126 Hidden size: 16384 Attention heads: 128 Query groups: 8 (GQA) FFN size: 53248 Vocabulary: 128,256 tokens Max context: 128K tokens (supports up to) Position encoding: RoPE Activation: SwiGLU Normalization: RMSNorm ``` **Training Configuration**: ```bash # 1024 H100 GPUs (128 nodes × 8 GPUs) Tensor Parallel (TP): 8 # Within node Pipeline Parallel (PP): 8 # Across nodes Context Parallel (CP): 2 # For long sequences Data Parallel (DP): 8 # Remaining dimension Total GPUs: 8 × 8 × 2 × 8 = 1024 Effective batch size: 2048 Micro-batch per GPU: 1 Sequence length: 4096 tokens ``` **Performance Metrics**: - **Sustained throughput**: 400 TFlops/GPU - **MFU**: ~46% on H100 - **Uptime**: 95%+ over months - **Efficiency improvement**: 3× vs LLaMA 2 training **Training Duration**: - 15 trillion tokens total - ~54 days on 16,384 H100 GPUs - Or ~6 months on 1,024 H100 GPUs **Key Optimizations Used**: ```bash --use-mcore-models --transformer-impl transformer_engine --sequence-parallel --context-parallel-size 2 --use-distributed-optimizer --overlap-grad-reduce --overlap-param-gather --use-flash-attn-v2 --bf16 ``` **Production Serving**: - Deployed on llama.meta.com - Available via API and download - Used in Meta products (Instagram, Facebook, WhatsApp) ### LLaMA 3 70B **Training Configuration**: ```bash # 64 H100 GPUs (8 nodes × 8 GPUs) TP=4, PP=4, CP=2, DP=2 torchrun --nproc_per_node=8 --nnodes=8 pretrain_gpt.py --num-layers 80 --hidden-size 8192 --num-attention-heads 64 --num-query-groups 8 --seq-length 4096 --micro-batch-size 1 --global-batch-size 1024 --tensor-model-parallel-size 4 --pipeline-model-parallel-size 4 --context-parallel-size 2 --bf16 --use-mcore-models ``` **Memory per GPU**: - Model parameters: 140GB / 4 (TP) / 4 (PP) = 8.75GB - Optimizer states: ~17.5GB - Activations: ~3GB - **Total**: ~30GB per H100 (fits in 80GB) ## NVIDIA Nemotron-4 340B ### Overview - **Organization**: NVIDIA - **Parameters**: 340 billion - **Framework**: NeMo (built on Megatron-Core) - **Purpose**: Enterprise AI foundation model - **Status**: Commercial deployment **Key Features**: - Mixture of Experts architecture - Optimized for enterprise use cases - NeMo framework integration - Production-ready deployment **Architecture**: ```yaml Type: Mixture of Experts (MoE) Total parameters: 340B Active parameters per token: ~40B Experts: 8 Router: Top-2 Context length: 4096 ``` **Training Infrastructure**: - NVIDIA DGX H100 systems - Megatron-Core + NeMo - Multi-node training - Enterprise-grade fault tolerance **Production Features**: - NeMo Guardrails integration - Enterprise support - Customization options - On-premise deployment available ## Microsoft & NVIDIA Megatron-Turing NLG 530B ### Overview - **Organization**: Microsoft + NVIDIA collaboration - **Parameters**: 530 billion (largest dense model when released) - **Year**: 2021 - **Framework**: DeepSpeed ZeRO-3 + Megatron tensor/pipeline parallelism - **Hardware**: 560 NVIDIA A100 80GB GPUs **Architecture**: ```yaml Parameters: 530 billion Layers: 105 Hidden size: 20480 Attention heads: 128 Vocabulary: 51,200 tokens Sequence length: 2048 ``` **Training Configuration**: ```bash # 560 A100 80GB GPUs Tensor Parallel: 8 Pipeline Parallel: 35 Data Parallel: 2 Total: 8 × 35 × 2 = 560 DeepSpeed ZeRO Stage 3: - Full parameter sharding - Gradient sharding - Optimizer state sharding ``` **Innovations**: - First to combine DeepSpeed ZeRO-3 with Megatron parallelism - Demonstrated training at 500B+ scale - Proved viability of extreme parallelism **Performance**: - Trained on 339 billion tokens - Multiple months of training - Achieved state-of-the-art results in 2021 ## BigScience BLOOM 176B ### Overview - **Organization**: BigScience (1000+ researchers) - **Parameters**: 176 billion - **Year**: 2022 - **Framework**: Megatron-DeepSpeed - **Hardware**: 384 NVIDIA A100 80GB GPUs - **Training Duration**: 46 days **Architecture**: ```yaml Parameters: 176 billion Layers: 70 Hidden size: 14336 Attention heads: 112 Vocabulary: 250,680 tokens (multilingual) Sequence length: 2048 Languages: 46 natural languages + 13 programming languages ``` **Training Configuration**: ```bash # 384 A100 80GB GPUs on Jean Zay supercomputer Tensor Parallel: 4 Pipeline Parallel: 12 Data Parallel: 8 Total: 4 × 12 × 8 = 384 Global batch size: 2048 Micro-batch size: 4 Learning rate: 6e-5 Optimizer: Adam (β1=0.9, β2=0.95) ``` **Training Data**: - 366 billion tokens (1.6TB) - ROOTS corpus (custom multilingual dataset) - 46 natural languages - 13 programming languages **Key Achievements**: - Largest multilingual open-source model at release - Trained on public supercomputer (Jean Zay) - Fully documented training process - Open-source model and training code **Public Impact**: - Downloaded 100,000+ times - Used in hundreds of research papers - Enabled multilingual AI research - Demonstrated open science at scale ## DeepSeek-V3 ### Overview - **Organization**: DeepSeek - **Parameters**: 671 billion total, 37B active per token - **Type**: Mixture of Experts (MoE) - **Year**: 2024-2025 - **Framework**: Megatron-Core **Architecture**: ```yaml Type: Mixture of Experts Total parameters: 671B Active parameters per token: 37B Layers: 61 Hidden size: 7168 Attention heads: 128 Query groups: 16 Experts: 256 (massive MoE) Router top-k: 8 (Multi-head Latent Attention) Shared expert size: 18432 ``` **Training Configuration**: ```bash # 1024 H100 GPUs Tensor Parallel (TP): 2 Pipeline Parallel (PP): 16 Expert Parallel (EP): 64 Context Parallel (CP): 1 Total: 2 × 16 × 64 = 2048 slots # Uses overlapping parallelism Global batch size: 4096 Sequence length: 4096 Training tokens: 14.8 trillion ``` **Innovations**: - Multi-head Latent Attention (MLA) router - Shared experts + routed experts - Ultra-large expert count (256) - Advanced load balancing **Performance**: - Competitive with GPT-4 - 37B active params rivals 70B+ dense models - Efficient inference (only 37B active) ## OpenAI GPT-3 175B (2020) ### Overview - **Organization**: OpenAI - **Parameters**: 175 billion - **Year**: 2020 - **Framework**: Megatron-inspired custom implementation - **Hardware**: Thousands of NVIDIA V100 GPUs **Architecture**: ```yaml Parameters: 175 billion Layers: 96 Hidden size: 12288 Attention heads: 96 FFN size: 49152 Vocabulary: 50,257 tokens (GPT-2 BPE) Sequence length: 2048 Context window: 2048 tokens ``` **Training Configuration**: ```bash # Estimated configuration Tensor Parallel: 4-8 Pipeline Parallel: 8-16 Data Parallel: Remaining GPUs Global batch size: 1536 Learning rate: 6e-5 Training tokens: 300 billion ``` **Training Compute**: - 3.14 × 10^23 FLOPs - Equivalent to ~355 GPU-years on V100 - Estimated cost: $4-12 million **Impact**: - Launched modern era of large language models - Demonstrated few-shot learning - Foundation for ChatGPT ## Stability AI StableLM ### Overview - **Organization**: Stability AI - **Framework**: GPT-NeoX (Megatron + DeepSpeed) - **Hardware**: Training on supercomputers - **Status**: Open-source **Models**: - StableLM-Base-Alpha: 3B, 7B - StableLM-Tuned-Alpha: Fine-tuned versions - StableCode: Code-specialized **Training Configuration**: ```yaml Framework: GPT-NeoX Parallelism: Megatron TP/PP + DeepSpeed ZeRO GPUs: A100 clusters Training data: 1.5 trillion tokens (The Pile) ``` **Key Features**: - Fully open-source (Apache 2.0) - GPT-NeoX framework - Trained on The Pile dataset - Multiple model sizes ## Common Production Patterns ### Fault Tolerance **Checkpoint Strategy**: ```bash --save-interval 500 # Save every 500 iterations --save /checkpoints/model_name # Checkpoint directory --load /checkpoints/model_name # Auto-resume from latest ``` **Monitoring**: ```python # Check in progress.txt Job throughput: 45.2 TFLOPs/GPU Cumulative throughput: 44.8 TFLOPs/GPU Memory usage: 68.2 GB / 80 GB Loss: 2.143 ``` ### Data Pipeline **Preprocessing**: ```bash python tools/preprocess_data.py --input data.jsonl --output-prefix /data/processed --vocab-file vocab.json --merge-file merges.txt --tokenizer-type GPT2BPETokenizer --append-eod --workers 64 ``` **Training with Preprocessed Data**: ```bash --data-path /data/processed_text_document --split 969,30,1 # Train/valid/test split ``` ### Monitoring & Logging **Key Metrics to Track**: ```bash # Training metrics - Loss (should steadily decrease) - Learning rate (follows schedule) - Gradient norm (watch for spikes) - Throughput (TFlops/GPU) - MFU percentage # System metrics - GPU utilization (>90%) - Memory usage (<95% of capacity) - Network bandwidth (saturated for TP) - Data loading time (should be minimal) ``` **Production Monitoring Tools**: - TensorBoard for loss curves - Weights & Biases for experiment tracking - Prometheus + Grafana for system metrics - Custom scripts for MFU calculation ### Multi-Datacenter Training **Challenges**: - Higher latency between datacenters - Network bandwidth limitations - Fault isolation **Solutions**: ```bash # Keep TP within datacenter --tensor-model-parallel-size 8 # Single node only # Use PP across datacenters --pipeline-model-parallel-size 16 # Across sites # Data parallel across everything # Automatic from remaining GPUs ``` ## Lessons from Production 1. **Fault Tolerance is Critical** - Save checkpoints frequently (every 500-1000 steps) - Test checkpoint recovery regularly - Monitor for GPU failures 2. **Data Quality Matters More Than Quantity** - LLaMA 3: Carefully curated 15T tokens - Better than naive web scraping - Investment in data preprocessing pays off 3. **Parallelism Strategy Evolves with Scale** - <70B: TP + DP sufficient - 70-175B: Add PP - 175B+: 3D or 4D parallelism required - MoE: Add EP dimension 4. **Hardware Matters** - H100 vs A100: 2× speedup from better hardware - NVLink topology affects TP efficiency - InfiniBand essential for multi-node 5. **Monitoring is Essential** - Track MFU to catch performance issues - Monitor loss for training health - Watch memory usage to avoid OOM - Log everything for debugging ## References - Meta LLaMA 3 technical report - NVIDIA Nemotron blog posts - Microsoft Megatron-Turing NLG paper - BigScience BLOOM documentation - DeepSeek-V3 technical report

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI