[ PROMPT_NODE_22362 ]

moe-training

[ SKILL_DOCUMENTATION ]

# MoE 训练：混合专家模型 ## 何时使用此技能当您需要执行以下操作时，请使用 MoE 训练： - **训练更大的模型** 且算力有限（相比稠密模型成本降低 5 倍） - **扩展模型容量** 而无需成比例增加算力 - **实现更好的性能** 每算力预算比稠密模型更高 - **专家专业化** 针对不同领域/任务/语言 - **降低推理延迟** 使用稀疏激活（Mixtral 中仅 13B/47B 参数处于活跃状态） - **实现 SOTA 模型** 如 Mixtral 8x7B, DeepSeek-V3, Switch Transformers **知名 MoE 模型**: Mixtral 8x7B (Mistral AI), DeepSeek-V3, Switch Transformers (Google), GLaM (Google), NLLB-MoE (Meta) ## 安装 bash # 支持 MoE 的 DeepSpeed pip install deepspeed>=0.6.0 # 用于大规模训练的 Megatron-DeepSpeed git clone https://github.com/microsoft/Megatron-DeepSpeed cd Megatron-DeepSpeed pip install -r requirements.txt # 替代方案: HuggingFace Transformers pip install transformers accelerate ## 快速开始 ### 基础 MoE 架构 python import torch import torch.nn as nn class MoELayer(nn.Module): """稀疏混合专家层。""" def __init__(self, hidden_size, num_experts=8, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # 专家网络 (FFN) self.experts = nn.ModuleList([ nn.Sequential( nn.Linear(hidden_size, 4 * hidden_size), nn.GELU(), nn.Linear(4 * hidden_size, hidden_size) ) for _ in range(num_experts) ]) # 门控网络 (路由器) self.gate = nn.Linear(hidden_size, num_experts) def forward(self, x): # x 形状: (batch_size, seq_len, hidden_size) batch_size, seq_len, hidden_size = x.shape # 展平以便路由 x_flat = x.view(-1, hidden_size) # (batch_size * seq_len, hidden_size) # 计算门控分数 gate_logits = self.gate(x_flat) # (batch_size * seq_len, num_experts) # Top-k 路由 gate_scores = torch.softmax(gate_logits, dim=-1) topk_scores, topk_indices = torch.topk(gate_scores, self.top_k, dim=-1) # 归一化 top-k 分数 topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True) # 分发并组合专家输出 output = torch.zeros_like(x_flat) for i in range(self.top_k): expert_idx = topk_indices

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI