[ PROMPT_NODE_22362 ]
moe-training
[ SKILL_DOCUMENTATION ]
# MoE 训练:混合专家模型
## 何时使用此技能
当您需要执行以下操作时,请使用 MoE 训练:
- **训练更大的模型** 且算力有限(相比稠密模型成本降低 5 倍)
- **扩展模型容量** 而无需成比例增加算力
- **实现更好的性能** 每算力预算比稠密模型更高
- **专家专业化** 针对不同领域/任务/语言
- **降低推理延迟** 使用稀疏激活(Mixtral 中仅 13B/47B 参数处于活跃状态)
- **实现 SOTA 模型** 如 Mixtral 8x7B, DeepSeek-V3, Switch Transformers
**知名 MoE 模型**: Mixtral 8x7B (Mistral AI), DeepSeek-V3, Switch Transformers (Google), GLaM (Google), NLLB-MoE (Meta)
## 安装
bash
# 支持 MoE 的 DeepSpeed
pip install deepspeed>=0.6.0
# 用于大规模训练的 Megatron-DeepSpeed
git clone https://github.com/microsoft/Megatron-DeepSpeed
cd Megatron-DeepSpeed
pip install -r requirements.txt
# 替代方案: HuggingFace Transformers
pip install transformers accelerate
## 快速开始
### 基础 MoE 架构
python
import torch
import torch.nn as nn
class MoELayer(nn.Module):
"""稀疏混合专家层。"""
def __init__(self, hidden_size, num_experts=8, top_k=2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
# 专家网络 (FFN)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_size, 4 * hidden_size),
nn.GELU(),
nn.Linear(4 * hidden_size, hidden_size)
)
for _ in range(num_experts)
])
# 门控网络 (路由器)
self.gate = nn.Linear(hidden_size, num_experts)
def forward(self, x):
# x 形状: (batch_size, seq_len, hidden_size)
batch_size, seq_len, hidden_size = x.shape
# 展平以便路由
x_flat = x.view(-1, hidden_size) # (batch_size * seq_len, hidden_size)
# 计算门控分数
gate_logits = self.gate(x_flat) # (batch_size * seq_len, num_experts)
# Top-k 路由
gate_scores = torch.softmax(gate_logits, dim=-1)
topk_scores, topk_indices = torch.topk(gate_scores, self.top_k, dim=-1)
# 归一化 top-k 分数
topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)
# 分发并组合专家输出
output = torch.zeros_like(x_flat)
for i in range(self.top_k):
expert_idx = topk_indices