[ SKILL_DOCUMENTATION ]
# 模型选择与评估参考
## 概述
使用 scikit-learn 的模型选择工具评估模型、调优超参数并选择最佳模型的综合指南。
## 训练集-测试集拆分
### 基本拆分
python
from sklearn.model_selection import train_test_split
# 基本拆分(默认 75/25)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# 分层拆分(保留类别分布)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
# 三路拆分(训练集/验证集/测试集)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
## 交叉验证
### 交叉验证策略
**KFold**
- 标准 k 折交叉验证
- 将数据拆分为 k 个连续折叠
python
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
**StratifiedKFold**
- 在每个折叠中保留类别分布
- 用于不平衡分类
python
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
**TimeSeriesSplit**
- 用于时间序列数据
- 尊重时间顺序
python
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
**GroupKFold**
- 确保来自同一组的样本不会同时出现在训练集和验证集中
- 当样本不独立时使用
python
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
for train_idx, val_idx in gkf.split(X, y, groups=group_ids):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
**LeaveOneOut (LOO)**
- 每个样本作为一次验证集
- 用于非常小的数据集
- 计算成本高
python
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train_idx, val_idx in loo.split(X):
X_train, X_