SHAOJIE'S BOOK

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence12 minutes read (About 1776 words)

导言

异步 RL 的核心不是简单“并行化 PPO”，而是把 rollout、reward / logprob、训练更新和参数同步之间的同步屏障拆成可控队列与版本语义。它用 bounded staleness 换取更高 E2E throughput，但必须同时回答 old logprob 一致性、policy lag、partial rollout、样本丢弃和复现实验的问题。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence9 minutes read (About 1298 words)

VeRL Feature Matrix

导言

这篇文章作为索引页，专门回答每个特性：怎么开、代码在哪、逻辑是什么、实践效果怎样、为什么默认不开、对 MFU / SMA 有什么作用。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence4 minutes read (About 648 words)

VeRL Checkpoint

导言

RL checkpoint 比普通 SFT checkpoint 更复杂，因为它不仅要保存模型参数，还要保存 optimizer、scheduler、global step、采样状态，以及在异步模式下可能存在的队列和策略版本状态。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence15 minutes read (About 2269 words)

VeRL Performance Optimization

导言

MFU / SMA 低不一定说明 kernel 慢，也可能是 rollout、reward、checkpoint、通信、异步队列或 token 分布造成的等待。性能优化的第一步不是开特性，而是建立 E2E 性能模型。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence11 minutes read (About 1601 words)

VeRL Rollout Inference

导言

RL 中的 rollout 不是普通离线推理。它不仅要生成 response，还要和训练阶段共享策略版本、返回 token 级信息，并参与后续 logprob、reward 和 advantage 计算。

因此 vLLM 图模式也不能只写成“开不开 CUDA Graph”。在 verl rollout 里，enforce_eager、compilation_config.cudagraph_mode 和 cudagraph_capture_sizes 共同决定性能、显存、capture 成本和兼容性。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence4 minutes read (About 648 words)

VeRL Training Flow

导言

这篇文章聚焦 verl 的训练链路：RayPPOTrainer.fit() 如何组织 rollout、reward、logprob、ref 和 actor update，以及这些阶段如何通过 worker 和 DataProto 串起来。