[ PROMPT_NODE_22891 ]

Post Training Verl – API Reference

[ SKILL_DOCUMENTATION ]

# verl API Reference ## Core Classes ### RayPPOTrainer The central controller for the training loop. Manages resource allocation and coordinates worker groups. ```python from verl import RayPPOTrainer trainer = RayPPOTrainer( config=config, resource_pool_manager=resource_manager, ray_worker_group_cls=RayWorkerGroup, ) trainer.init_workers() trainer.fit() ``` ### ResourcePoolManager Manages GPU allocation across different worker groups using Ray PlacementGroups. ```python from verl.trainer.ppo.resource_pool import ResourcePoolManager manager = ResourcePoolManager( resource_pool_spec={ "actor_rollout_ref": {"gpu": 4}, "critic": {"gpu": 2}, } ) ``` ### RayWorkerGroup Abstraction for distributed method execution. Spawns Ray actors and dispatches method calls. ```python from verl.trainer.ppo.ray_worker_group import RayWorkerGroup worker_group = RayWorkerGroup( num_workers=8, worker_cls=ActorRolloutRefWorker, resource_pool=pool, ) ``` ### ActorRolloutRefWorker Worker class implementing policy training, generation, and reference model computations. Manages hybrid engine mode switching. ```python # Typically configured via YAML, not instantiated directly # See configuration section below ``` ### RolloutReplica Interface for inference backends with implementations for vLLM, SGLang, TensorRT-LLM, and HuggingFace. ```python from verl.workers.rollout import RolloutReplica # Backend selection via config rollout: name: vllm # or: sglang, hf, tensorrt-llm ``` ## Configuration Schema ### PPO Configuration (`verl/trainer/config/ppo_trainer.yaml`) ```yaml # Data configuration data: train_files: /path/to/train.parquet val_files: /path/to/val.parquet train_batch_size: 256 # Global batch size of prompts max_prompt_length: 512 max_response_length: 2048 # Algorithm configuration algorithm: adv_estimator: gae # gae, grpo, rloo, reinforce_plus_plus gamma: 0.99 # Discount factor lam: 0.95 # GAE lambda use_kl_in_reward: false # Add KL term to reward # Actor configuration actor_rollout_ref: model: path: Qwen/Qwen2.5-7B-Instruct backend: fsdp # fsdp, fsdp2, megatron actor: ppo_mini_batch_size: 64 # Mini-batch for actor updates ppo_epochs: 1 # Number of actor update epochs clip_ratio: 0.2 # PPO clip range use_kl_loss: true # Use KL loss in actor kl_loss_coef: 0.001 # KL loss coefficient kl_loss_type: low_var # KL divergence calculation method loss_agg_mode: token-mean # token-mean or sequence-mean gradient_checkpointing: true max_grad_norm: 1.0 # Gradient clipping lr: 1e-6 # Learning rate rollout: name: vllm # vllm, sglang, hf n: 8 # Samples per prompt temperature: 0.7 top_p: 0.95 log_prob_micro_batch_size: 8 # Critic configuration (PPO only) critic: model: path: Qwen/Qwen2.5-7B-Instruct ppo_mini_batch_size: 64 ppo_epochs: 1 # Defaults to actor epochs # Trainer configuration trainer: total_epochs: 3 n_gpus_per_node: 8 nnodes: 1 save_freq: 100 experiment_name: my_experiment async_weight_update: false ``` ### GRPO Configuration (`docs/algo/grpo.md`) ```yaml algorithm: adv_estimator: grpo # Enable GRPO gamma: 1.0 lam: 1.0 actor_rollout_ref: rollout: n: 8 # Must be > 1 for GRPO actor: use_kl_loss: true # Required for GRPO kl_loss_coef: 0.001 kl_loss_type: low_var # or: k1, k2, k3 loss_agg_mode: token-mean ``` ### Multi-Turn Configuration (`verl/trainer/config/rollout/rollout.yaml`) ```yaml actor_rollout_ref: rollout: name: sglang # Required for multi-turn multi_turn: enable: true tool_config_path: /path/to/tools.yaml interaction_config_path: /path/to/interaction.yaml ``` ## Reward Functions ### Built-in Reward Types ```yaml # Model-based reward reward_model: path: OpenRLHF/Llama-3-8b-rm-700k # Custom function-based reward custom_reward_function: path: /path/to/reward.py name: compute_score # Function name, default: compute_score ``` ### Custom Reward Function Signature ```python # reward.py def compute_score(responses: list[str], ground_truths: list[str], **kwargs) -> list[float]: """ Compute rewards for a batch of responses. Args: responses: Generated completions ground_truths: Expected answers from data **kwargs: Additional metadata Returns: List of reward scores (floats) """ rewards = [] for response, gt in zip(responses, ground_truths): # Your reward logic score = 1.0 if correct(response, gt) else 0.0 rewards.append(score) return rewards ``` ## Backend-Specific Configuration ### FSDP Configuration ```yaml actor_rollout_ref: actor: strategy: fsdp fsdp_config: mixed_precision: bf16 sharding_strategy: FULL_SHARD offload_policy: false ``` ### FSDP2 Configuration ```yaml actor_rollout_ref: actor: strategy: fsdp2 fsdp_config: offload_policy: true # CPU offloading reshard_after_forward: true ``` ### Megatron Configuration ```yaml actor_rollout_ref: model: backend: megatron actor: strategy: megatron tensor_model_parallel_size: 8 pipeline_model_parallel_size: 2 megatron: use_mbridge: true # Required for format conversion ``` ### vLLM Rollout Configuration ```yaml actor_rollout_ref: rollout: name: vllm tensor_parallel_size: 2 gpu_memory_utilization: 0.9 max_num_seqs: 256 enforce_eager: false ``` ### SGLang Rollout Configuration ```yaml actor_rollout_ref: rollout: name: sglang tp_size: 2 mem_fraction_static: 0.8 context_length: 8192 ``` ## Algorithm Reference | Algorithm | `adv_estimator` | Requires Critic | Best For | |-----------|-----------------|-----------------|----------| | PPO | `gae` | Yes | Dense rewards, value estimation | | GRPO | `grpo` | No | Sparse rewards, math/reasoning | | RLOO | `rloo` | No | Leave-one-out baseline | | REINFORCE++ | `reinforce_plus_plus` | No | Variance reduction | | DAPO | `dapo` | No | Doubly-adaptive optimization | ## Vision-Language Model Support ```yaml actor_rollout_ref: model: path: Qwen/Qwen2.5-VL-7B-Instruct rollout: name: vllm enable_vision: true max_model_len: 32768 ``` ## LoRA Configuration ```yaml actor_rollout_ref: actor: lora: enabled: true r: 16 alpha: 32 target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"] dropout: 0.05 ``` ## Resources - Documentation: https://verl.readthedocs.io/ - GitHub: https://github.com/volcengine/verl - Paper: https://arxiv.org/abs/2409.19256 (HybridFlow)

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI