SHAOJIE'S BOOK

Posted 2026-07-03Updated 2026-07-03Artificial Intelligence23 minutes read (About 3458 words)

导言

模型训练建模不是先问“MFU 有多高”，而是先把模型结构、硬件账本、并行切分、调度路径和实测校准放到同一个估算器里。MFU 是其中最干净的计算口径：它把模型理论必需 FLOPs、设备峰值和实测步时连在一起；但显存能不能放下、通信会不会卡住、padding 是否浪费、EP/TP/SP 是否合适，必须另算。

Posted 2026-07-01Updated 2026-07-03Artificial Intelligence20 minutes read (About 3031 words)

NPU Training Operators - GMM

导言

GMM 在 Qwen3.5 MoE 里的接入点是 routed experts 的两次矩阵乘：hidden -> gate/up 和 intermediate -> hidden。shared_expert 仍是普通 Qwen3_5MoeMLP，attention 不动，Dense 版 Qwen3.5 的普通 MLP 也不是替换对象。

PR #2664 的公开 diff 主要是给 mindspeed_mm.fsdp.ops.moe_ops.gemm.grouped_matmul 增加 fused/eager 一致性 UT，并放宽 unpermute UT 容差；它可以作为 GMM wrapper 接口被测试覆盖的证据，不能写成完整功能接入 PR。[^gmm-pr-api][^gmm-pr-files]

Posted 2026-06-30Updated 2026-07-03Artificial Intelligence15 minutes read (About 2266 words)

NPU Training Operators - MC2

导言

MC2 的核心不是异步通信，而是 fused operator 内部的计算/通信切分与流水。MindSpeed-LLM 文档里的典型场景是 TP/SP 下的 matmul + all_reduce/all_gather/reduce_scatter；MindSpeed-MM PR #2480 接入的是 MoE expert parallel 下的 AllToAllv + GroupedMatmul 和 GroupedMatmul + AllToAllv。

本文只记录可迁移信息：PR 改了哪些文件、ep_mc2_forward 怎么跑、迁移前检查什么、怎么验证、哪些结论不能从公开资料直接外推。

Posted 2026-06-30Updated 2026-07-03Artificial Intelligence10 minutes read (About 1545 words)

VeRL Router Replay

导言

Router Replay 的核心不是让 MoE 路由更快，而是把 rollout、old logprob 重算和 new logprob 更新三段路径的专家选择对齐。MoE 的 top-k routing 是离散分叉，微小数值差异会导致 expert 集合突变；一旦 old/new logprob 的差异混入“路由换了”而不是“策略变了”，PPO / GRPO 的 ratio、clip 和 KL 都会失真。

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags