SHAOJIE'S BOOK

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence5 minutes read (About 676 words)

导言

RL 训练的指标不能只看 reward、loss 和 throughput。真正可用的 DFX 体系，需要同时解释 正确性、稳定性、显存、性能、负载均衡和数据质量。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence14 minutes read (About 2115 words)

导言

这篇文章只回答一个问题：一条 RL 样本从 prompt 进入系统，到 rollout、reward、logprob、advantage、loss、backward，最后回到下一轮训练时，数据到底怎么流、shape 怎么变、显存为什么涨。