[ PROMPT_NODE_22762 ]
Multimodal Llava 训练
[ SKILL_DOCUMENTATION ]
# LLaVA 训练指南
LLaVA 模型训练与微调指南。
## 训练阶段
### 第一阶段:特征对齐(预训练)
**目的**: 对齐视觉编码器与语言模型
**数据**: 55.8 万图文对(CC3M 子集)
bash
# 下载预训练投影层或从头开始训练
bash scripts/v1_5/pretrain.sh
**配置:**
- 基础模型: Vicuna-7B 或 LLaMA-2-7B
- 视觉编码器: CLIP ViT-L/14
- 训练时间: ~20 小时(8× A100)
### 第二阶段:视觉指令微调
**目的**: 教导模型遵循视觉指令
**数据**: 15 万条 GPT 生成的多模态指令数据
bash
# 使用指令数据进行微调
bash scripts/v1_5/finetune.sh
**配置:**
- Epochs: 1
- Batch size: 128 (跨 8 个 GPU)
- 学习率: 2e-5
- 训练时间: ~24 小时(8× A100)
## 数据格式
### 指令数据格式
[
{
"id": "001",
"image": "path/to/image.jpg",
"conversations": [
{
"from": "human",
"value": "nWhat is in this image?"
},
{
"from": "gpt",
"value": "The image shows a dog playing in a park."
},
{
"from": "human",
"value": "What breed is the dog?"
},
{
"from": "gpt",
"value": "It appears to be a Golden Retriever."
}
]
}
]
## 在自定义数据上微调
### 准备数据
python
import json
# 创建指令数据
data = []
for image_path, qa_pairs in your_dataset:
conversations = []
for q, a in qa_pairs:
conversations.append({"from": "human", "value": f"n{q}"})
conversations.append({"from": "gpt", "value": a})
data.append({
"id": str(len(data)),
"image": image_path,
"conversations": conversations
})
# 保存
with open("custom_data.json", "w") as f:
json.dump(data, f, indent=2)
### 微调脚本
bash
#!/bin/bash
# 设置路径
DATA_PATH="custom_data.json"
IMAGE_FOLDER="path/to/images"
MODEL_PATH="liuhaotian/llava-v1.5-7b"
OUTPUT_DIR="./checkpoints/llava-custom"
# 微调
deepspeed llava/train/train_mem.py
--deepspeed ./scripts/zero2.json
--model_name_or_path $MODEL_PATH
--version v1
--data_path $DATA_PATH
--image_folder $IMAGE_FOLDER
--vision_tower openai/clip-vit-large-patch14-336
--mm_projec