[ PROMPT_NODE_22891 ]
Post Training Verl – API Reference
[ SKILL_DOCUMENTATION ]
# verl API Reference
## Core Classes
### RayPPOTrainer
The central controller for the training loop. Manages resource allocation and coordinates worker groups.
```python
from verl import RayPPOTrainer
trainer = RayPPOTrainer(
config=config,
resource_pool_manager=resource_manager,
ray_worker_group_cls=RayWorkerGroup,
)
trainer.init_workers()
trainer.fit()
```
### ResourcePoolManager
Manages GPU allocation across different worker groups using Ray PlacementGroups.
```python
from verl.trainer.ppo.resource_pool import ResourcePoolManager
manager = ResourcePoolManager(
resource_pool_spec={
"actor_rollout_ref": {"gpu": 4},
"critic": {"gpu": 2},
}
)
```
### RayWorkerGroup
Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls.
```python
from verl.trainer.ppo.ray_worker_group import RayWorkerGroup
worker_group = RayWorkerGroup(
num_workers=8,
worker_cls=ActorRolloutRefWorker,
resource_pool=pool,
)
```
### ActorRolloutRefWorker
Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching.
```python
# Typically configured via YAML, not instantiated directly
# See configuration section below
```
### RolloutReplica
Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace.
```python
from verl.workers.rollout import RolloutReplica
# Backend selection via config
rollout:
name: vllm # or: sglang, hf, tensorrt-llm
```
## Configuration Schema
### PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`)
```yaml
# Data configuration
data:
train_files: /path/to/train.parquet
val_files: /path/to/val.parquet
train_batch_size: 256 # Global batch size of prompts
max_prompt_length: 512
max_response_length: 2048
# Algorithm configuration
algorithm:
adv_estimator: gae # gae, grpo, rloo, reinforce_plus_plus
gamma: 0.99 # Discount factor
lam: 0.95 # GAE lambda
use_kl_in_reward: false # Add KL term to reward
# Actor configuration
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-7B-Instruct
backend: fsdp # fsdp, fsdp2, megatron
actor:
ppo_mini_batch_size: 64 # Mini-batch for actor updates
ppo_epochs: 1 # Number of actor update epochs
clip_ratio: 0.2 # PPO clip range
use_kl_loss: true # Use KL loss in actor
kl_loss_coef: 0.001 # KL loss coefficient
kl_loss_type: low_var # KL divergence calculation method
loss_agg_mode: token-mean # token-mean or sequence-mean
gradient_checkpointing: true
max_grad_norm: 1.0 # Gradient clipping
lr: 1e-6 # Learning rate
rollout:
name: vllm # vllm, sglang, hf
n: 8 # Samples per prompt
temperature: 0.7
top_p: 0.95
log_prob_micro_batch_size: 8
# Critic configuration (PPO only)
critic:
model:
path: Qwen/Qwen2.5-7B-Instruct
ppo_mini_batch_size: 64
ppo_epochs: 1 # Defaults to actor epochs
# Trainer configuration
trainer:
total_epochs: 3
n_gpus_per_node: 8
nnodes: 1
save_freq: 100
experiment_name: my_experiment
async_weight_update: false
```
### GRPO Configuration (`docs/algo/grpo.md`)
```yaml
algorithm:
adv_estimator: grpo # Enable GRPO
gamma: 1.0
lam: 1.0
actor_rollout_ref:
rollout:
n: 8 # Must be > 1 for GRPO
actor:
use_kl_loss: true # Required for GRPO
kl_loss_coef: 0.001
kl_loss_type: low_var # or: k1, k2, k3
loss_agg_mode: token-mean
```
### Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`)
```yaml
actor_rollout_ref:
rollout:
name: sglang # Required for multi-turn
multi_turn:
enable: true
tool_config_path: /path/to/tools.yaml
interaction_config_path: /path/to/interaction.yaml
```
## Reward Functions
### Built-in Reward Types
```yaml
# Model-based reward
reward_model:
path: OpenRLHF/Llama-3-8b-rm-700k
# Custom function-based reward
custom_reward_function:
path: /path/to/reward.py
name: compute_score # Function name, default: compute_score
```
### Custom Reward Function Signature
```python
# reward.py
def compute_score(responses: list[str], ground_truths: list[str], **kwargs) -> list[float]:
"""
Compute rewards for a batch of responses.
Args:
responses: Generated completions
ground_truths: Expected answers from data
**kwargs: Additional metadata
Returns:
List of reward scores (floats)
"""
rewards = []
for response, gt in zip(responses, ground_truths):
# Your reward logic
score = 1.0 if correct(response, gt) else 0.0
rewards.append(score)
return rewards
```
## Backend-Specific Configuration
### FSDP Configuration
```yaml
actor_rollout_ref:
actor:
strategy: fsdp
fsdp_config:
mixed_precision: bf16
sharding_strategy: FULL_SHARD
offload_policy: false
```
### FSDP2 Configuration
```yaml
actor_rollout_ref:
actor:
strategy: fsdp2
fsdp_config:
offload_policy: true # CPU offloading
reshard_after_forward: true
```
### Megatron Configuration
```yaml
actor_rollout_ref:
model:
backend: megatron
actor:
strategy: megatron
tensor_model_parallel_size: 8
pipeline_model_parallel_size: 2
megatron:
use_mbridge: true # Required for format conversion
```
### vLLM Rollout Configuration
```yaml
actor_rollout_ref:
rollout:
name: vllm
tensor_parallel_size: 2
gpu_memory_utilization: 0.9
max_num_seqs: 256
enforce_eager: false
```
### SGLang Rollout Configuration
```yaml
actor_rollout_ref:
rollout:
name: sglang
tp_size: 2
mem_fraction_static: 0.8
context_length: 8192
```
## Algorithm Reference
| Algorithm | `adv_estimator` | Requires Critic | Best For |
|-----------|-----------------|-----------------|----------|
| PPO | `gae` | Yes | Dense rewards, value estimation |
| GRPO | `grpo` | No | Sparse rewards, math/reasoning |
| RLOO | `rloo` | No | Leave-one-out baseline |
| REINFORCE++ | `reinforce_plus_plus` | No | Variance reduction |
| DAPO | `dapo` | No | Doubly-adaptive optimization |
## Vision-Language Model Support
```yaml
actor_rollout_ref:
model:
path: Qwen/Qwen2.5-VL-7B-Instruct
rollout:
name: vllm
enable_vision: true
max_model_len: 32768
```
## LoRA Configuration
```yaml
actor_rollout_ref:
actor:
lora:
enabled: true
r: 16
alpha: 32
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
dropout: 0.05
```
## Resources
- Documentation: https://verl.readthedocs.io/
- GitHub: https://github.com/volcengine/verl
- Paper: https://arxiv.org/abs/2409.19256 (HybridFlow)
Source: claude-code-templates (MIT). See About Us for full credits.