Frontier post-training recipe review with Finbarr Timbers
As I’ve been recapping fundamentals of post-training to wrap up my RLHF / Post-training book I knew I needed to get Finbarr Timbers back on the podcast to talk about the state of play. Over the last few months we’ve had many discussions on what we’d need to do to take an Olmo-style recipe to the frontier, supported by Finbarr’s extensive reading of recent model technical reports. To prepare for this, I put together a summary slide deck on the key post-training recipes histori
As I’ve been recapping fundamentals of post-training to wrap up my RLHF / Post-training book I knew I needed to get Finbarr Timbers back on the podcast to talk about the state of play. Over the last few months we’ve had many discussions on what we’d need to do to take an Olmo-style recipe to the frontier, supported by Finbarr’s extensive reading of recent model technical reports. To prepare for this, I put together a summary slide deck on the key post-training recipes historically — the path from InstructGPT to today — and today — the key open frontier models. This deck is summarized below as the technical summary, but we do spend 20-35 minutes on it in the podcast, so watching on YouTube is likely the best experience for this one. I previously interviewed Finbarr in December of 2024, shortly after the release of o1 and Tülu 3 (and before he joined Ai2) on the “We are so back” era of RL. Chapters: 00:00 Introduction & Olmo reflections 06:28 Post-train recipes review (history) 23:00 2026’s model recipes (MiMo Flash, DeepSeek V4, GLM 5, Kimi K2.6, etc.) 39:05 Open-ended post-training discussions 48:22 Career advice in the LLM race Share Listen on Apple Podcasts , Spotify , and where ever you get your podcasts . For other Interconnects interviews, go here . For more educational post-training videos, see the course I’m putting together. Technical Summary These are notes cleaned up from a slide-deck created with AI assistance — mostly useful as a discussion topic and reference. The shape of a post-training recipe has changed more in the last year than in the prior three. 2022–2023 (InstructGPT): one pipeline — SFT → reward model → RL. 2024 (Llama 3, Tülu 3, etc.): open recipes formalize SFT → DPO → RL with verifiable rewards. Closed recipes use many stages of RLHF. 2025 (DeepSeek R1): reasoning RL (R1) makes large-scale RL the centerpiece. 2026 (MiMo Flash V2): recipes fragment into many specialist models that are merged back into one. The new thing: MOPD Multi-teacher On-Policy Distillation (MOPD) is the pattern showing up across the 2026 frontier. Train N domain-specialist teachers (each: SFT, then RL on the relevant domains). Train one general student by sampling its own trajectories (this is the final post-trained model). On each rollout, minimize reverse-KL to the relevant teacher’s output distribution, token by token. Lineage: MiMo Flash v2 introduced it → DeepSeek V4 & Nemotron 3 Ultra scale it to >10 teachers. Why did MOPD emerge? RL got expensive and conflict-prone. Mixing math, code, and agentic RL in one run eventually trades capabilities off against each other. Specialists are cheap to make / organizationally scalable. SFT-then-RL on a single domain is well understood and parallelizable. As post-training becomes more complex, scaling it across organizations is a big win. On-policy distillation matured. Literature and know-how continued to emerge through the RLVR renaissance. Sources: DeepSeek V4 §5.1 , MiMo-V2-Flash Key historical recipes InstructGPT (Mar. 2022) — the canonical 3 steps · paper SFT on human demonstrations Reward model trained on human comparisons PPO against the reward model Llama 2 (Jul. 2023) — multi-stage RLHF · paper · interconnects recap SFT, then iterative RLHF over multiple rounds Each round: rejection sampling → PPO Two reward models — separate helpfulness and safety Llama 3 (Jul. 2024) — a complex multi-stage recipe with simpler optimizers · paper · interconnects recap Per round: reward model → sample K per prompt → rejection sampling → SFT → DPO No online RL — the RM only filters; run over 6 rounds, best models seed the next Tülu 3 (Nov. 2024) — simple three-stage post-training · paper · interconnects recap Curated prompts → SFT → DPO → RLVR (RL with verifiable rewards — the acronym was coined in this paper). OLMo 3 (Dec. 2025) — a reasoning update to the Tülu 3 recipe · paper · interconnects recap DeepSeek R1 (Jan. 2025) — RL as the centerpiece · paper · interconnects recap The recipe: R1-Zero — pure RL (GRPO) on the base, no SFT ; used to seed reasoning behaviors for the full run, not a separate product R1 — cold-start SFT → reasoning RL → rejection-sampling SFT → final RL → distill to dense A big change in recipes: Large-scale RLVR as the primary driver, SFT to distill and refine RL behaviors DeepSeek evolution after V3 V3 · Dec ‘24 — SFT + GRPO RL. R1 · Jan ‘25 — multi-stage RL; reasoning emerges . V3.1 · Aug ‘25 — hybrid think / non-think in one model. V3.2 · Dec ‘25 — 6 specialists via RL → SFT distillation → one mixed GRPO. V4 · Apr ‘26 — 10+ domain experts → MOPD. 2026 style recipes! MiMo Flash v2 (Jan. 2026) — where MOPD started · paper Stages: Stage 1 SFT → Stage 2 train ~6 domain-specialist teachers (with older style post-training recipes) → Stage 3 MOPD into a single student. First clean articulation of multi-teacher on-policy distillation as the consolidation step
📌 Kaynak
Bu haber XML kaynağından derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →