[ PROMPT_NODE_22334 ]
ray-train
[ SKILL_DOCUMENTATION ]
# Ray Train - 分布式训练编排
## 快速开始
Ray Train 只需极少的代码修改,即可将机器学习训练从单 GPU 扩展到多节点集群。
**安装**:
bash
pip install -U "ray[train]"
**基础 PyTorch 训练** (单节点):
python
import ray
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
import torch
import torch.nn as nn
# 定义训练函数
def train_func(config):
# 你的常规 PyTorch 代码
model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 准备分布式环境 (Ray 处理设备放置)
model = train.torch.prepare_model(model)
for epoch in range(10):
# 你的训练循环
output = model(torch.randn(32, 10))
loss = output.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 上报指标 (自动记录)
train.report({"loss": loss.item(), "epoch": epoch})
# 运行分布式训练
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=4, # 4 个 GPU/工作节点
use_gpu=True
)
)
result = trainer.fit()
print(f"最终损失: {result.metrics['loss']}")
**就是这样!** Ray 处理了以下内容:
- 分布式协调
- GPU 分配
- 容错机制
- 检查点保存
- 指标聚合
## 常见工作流
### 工作流 1: 扩展现有的 PyTorch 代码
**原始单 GPU 代码**:
python
model = MyModel().cuda()
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(epochs):
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
**Ray Train 版本** (扩展至多 GPU/多节点):
python
from ray.train.torch import TorchTrainer
from ray import train
def train_func(config):
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
# 准备分布式环境 (自动设备放置)
model = train.torch.prepare_model(model)
dataloader = train.torch.prepare_data_loader(dataloader)
for epoch in range(epochs):
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
# 上报指标
train.report({"loss": loss.item()})
# 扩展至 8 个 GPU
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(num_workers=8, use_gpu=True)
)
trainer.fit()
**B