惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
博客园_首页
H
Hackread – Cybersecurity News, Data Breaches, AI and More
T
ThreatConnect
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
博客园 - 聂微东
H
Help Net Security
T
Threat Research - Cisco Blogs
Blog — PlanetScale
Blog — PlanetScale
A
Arctic Wolf
G
Google Developers Blog
量子位
U
Unit 42
I
InfoQ
V
V2EX
F
Fox-IT International blog
P
Privacy & Cybersecurity Law Blog
V
Visual Studio Blog
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
T
The Exploit Database - CXSecurity.com
T
Tailwind CSS Blog
SecWiki News
SecWiki News
Know Your Adversary
Know Your Adversary
MyScale Blog
MyScale Blog
宝玉的分享
宝玉的分享
The Hacker News
The Hacker News
Project Zero
Project Zero
Application and Cybersecurity Blog
Application and Cybersecurity Blog
月光博客
月光博客
Recent Commits to openclaw:main
Recent Commits to openclaw:main
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
G
GRAHAM CLULEY
C
Cisco Blogs
I
Intezer
Simon Willison's Weblog
Simon Willison's Weblog
O
OpenAI News
Recorded Future
Recorded Future
T
Tenable Blog
W
WeLiveSecurity
腾讯CDC
Stack Overflow Blog
Stack Overflow Blog
T
The Blog of Author Tim Ferriss
www.infosecurity-magazine.com
www.infosecurity-magazine.com
D
Docker
C
Cybersecurity and Infrastructure Security Agency CISA
PCI Perspectives
PCI Perspectives

Google Developers Blog

Get ready for Google I/O: Livestream schedule revealed Build Better AI Agents: 5 Developer Tips from the Agent Bake-Off New enhancements for merchant initiated transactions with the Google Pay API Subagents have arrived in Gemini CLI TorchTPU: Running PyTorch Natively on TPUs at Google Scale Supercharge your AI agents: The New ADK Integrations Ecosystem How we built the Google I/O 2026 Save the Date experience You can't stream the energy: A developer's guide to Google Cloud Next '26 in Vegas What's new in TensorFlow 2.21 Introducing Wednesday Build Hour Unleash Your Development Superpowers: Refining the Core Coding Experience Introducing Finish Changes and Outlines, now available in Gemini Code Assist extensions on IntelliJ and VS Code Announcing the Colab MCP Server: Connect Any AI Agent to Google Colab Developer’s Guide to AI Agent Protocols Build a smart financial assistant with LlamaParse and Gemini 3.1 Closing the knowledge gap with agent skills Announcing ADK for Java 1.0.0: Building the Future of AI Agents in Java ADK Go 1.0 Arrives! Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText Developer’s Guide to Building ADK Agents with Skills Supporting Google Account username change in your app Bring state-of-the-art agentic skills to the edge with Gemma 4 Plan mode is now available in Gemini CLI Jump to play: Building with Gemini & MediaPipe Supercharge your integration workflow with the Google Pay & Wallet Developer MCP server The latest updates to Google Pay Enhancing Android Checkout with Dynamic Callbacks in Google Pay Announcing ADK for Kotlin and ADK for Android 0.1.0: Building AI Agents on Android and Beyond Empowering Service Providers and Hardware Partners with Gemini for Home All the news from the Google I/O 2026 Developer keynote A Smarter Google AI Edge Gallery: MCP integration, notifications, and session continuity Blazing fast on-device GenAI with LiteRT-LM One Year of Innovation: Celebrating 100k Members in the Google Cloud x NVIDIA Developer Community Google Tensor SDK Beta with LiteRT An important update: Transitioning Gemini CLI to Antigravity CLI Accelerating on-device AI: A look at Arm and Google AI Edge optimization Announcing Genkit Middleware: Intercept, extend, and harden your agentic apps Build Long-running AI agents that pause, resume, and never lose context with ADK Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding Building with Gemini Embedding 2: Agentic multimodal RAG and beyond Speeding Up AI: Bringing Google Colossus to PyTorch via GCSFS and Rapid Bucket Building real-world on-device AI with LiteRT and NPU Agents CLI in Agent Platform: create to production in one CLI Production-Ready AI Agents: 5 Lessons from Refactoring a Monolith A2UI v0.9: The New Standard for Portable, Framework-Agnostic Generative UI MaxText Expands Post-Training Capabilities: Introducing SFT and RL on Single-Host TPUs
How the community trained Gemma to "Think" with Tunix and TPUs
2026-05-29 · via Google Developers Blog

MAY 28, 2026

Large Language Models (LLMs) often benefit from "thinking" before they speak for complex tasks. Frontier LLMs like Gemini 3 and leading open weight models like Gemma 4 can produce explicit reasoning traces, commonly called Chain-of-Thought, before answering user questions. But how this reasoning capability is trained is often not disclosed. While there are many reasoning tutorials available on the Internet to train for simple verifiable tasks such as math or coding, accessible and easy-to-reproduce training recipes (including data, training strategy, runnable code and evaluations) for general reasoning remain scarce.

This motivated us to hold the Google Tunix Hack: Train a model to show its work hackathon on Kaggle: we challenged developers to transform non-reasoning base models (Gemma-2-2B and Gemma-3-1B) into general reasoning models, using Tunix and Kaggle TPUs. The response was overwhelming: over 11,000 entrants and 300+ high-quality submissions proved that decent reasoning training can be done by the community even with a very limited compute budget (Kaggle TPU v5e-8 for 9 hours). In this post, we’ll highlight the techniques used by the winners and share key recipes that allow models to reason across key vertical industries, so you can train your own reasoning models.

Highlighting the Winners: Key Innovations

The winning submissions demonstrated a sophisticated understanding of post-training, combining supervised learning, preference optimization, and reinforcement learning in creative ways.

🥇 1st Place: G-RaR (Rubric-Based Reinforcement Learning)

G-RaR trains Gemma models to produce structured reasoning by combining Supervised Fine-Tuning (SFT) with GRPO, driven by a novel rubric-based LLM-as-judge reward system.

  • How It Improves Reasoning The model's reasoning power is improved by explicitly training it to "show its work" inside <reasoning> tags before outputting an answer. The underlying technique (for GRPO), G-RaR (Rubrics as Rewards), uses a larger judge model (Gemma-3-12B) to evaluate the quality of these intermediate logical steps based on task-specific rubrics. By converting discrete rubric scores into continuous, normalized reward signals, the technique provides dense, smooth feedback on the model's logic. This allows the model to continuously improve its reasoning capabilities without relying solely on exact-match correctness, making it highly effective even for open-ended, non-verifiable tasks.
  • Technical Solution The team utilized a two-stage post-training pipeline:
    • Stage 1 (SFT): The Gemma-2-2B-IT model is fine-tuned via LoRA on a ~33k sample dataset to establish a baseline. This "warm start" teaches the model to reliably output the <reasoning>...</reasoning><answer>...</answer> structure.
    • Stage 2 (GRPO): The model is then refined using GRPO-based on a composite reward function (Format Reward + Exact Answer Reward + G-RaR Score). To overcome compute constraints, the team used a split-mesh architecture on a single Kaggle TPU v5e-8, placing the policy/reference models on one mesh and the judge model on the other for true parallel execution.

🥈 2nd Place: Pinocchio-1B (Creating a Reasoning Model in 3 Acts)

Evolving a 1B parameter model into a structured reasoning engine ("Pinocchio") via a highly efficient, 9-hour TPU pipeline (SFT → SimPO → GRPO)

  • How it Improves Reasoning The model learns to generate a structured <reasoning> trace before answering, shifting from basic pattern matching to logical deduction. This is built sequentially: SFT instills foundational Chain-of-Thought, SimPO locks in strict formatting (preventing verbosity hacks), and GRPO refines logic by using an LLM-as-a-Judge to reward coherence and heavily penalize hallucinations..
  • Technical Solution The pipeline consists of three stages:
    • SFT (Distillation): Trained on 70k prompts using an OSS-120B teacher model and a Gemini task-router.
    • SimPO (Alignment): Replaced memory-heavy DPO to efficiently enforce strict XML formatting.
    • GRPO (Refinement): Used Gemini 2.0 Flash as an asynchronous judge to dynamically reward accuracy, logic, and format.
  • Customizing Tunix: The team explicitly extended the Tunix library to support this workflow by:
    • Injecting a custom SimPO loss function (with length normalization) into the DPOTrainer.
    • Creating a high-throughput, asynchronous evaluation engine to process GRPO reward signals on the fly.

🥉 3rd Place: IDEA-E Distillation with Curriculum Guided GRPO Training

Distilling the structured "IDEA-E" ethical reasoning framework into a 2B model using curriculum-guided GRPO and a fast TF-IDF reward system.

  • Why it Improves Reasoning The IDEA-E scaffold forces the model through a step-by-step logical deduction before answering, preventing premature guessing. Simultaneously, the TF-IDF reward prevents verbose "yapping" by incentivizing the use of context-relevant vocabulary in the reasoning trace.
  • Technical Solution The pipeline features two stages:
    • SFT: Fine-tuning on teacher data to establish the IDEA-E format.
    • GRPO: Reinforcement learning using curriculum guidance and a TF-IDF reward instead of a slow LLM judge.
  • Customizing Tunix: The team extended Tunix by integrating their custom TF-IDF reward function into the Tunix GRPO pipeline, allowing for rapid, non-blocking reward calculations on the CPU.

Honorable Mentions

While the top three spots took the podium, several other submissions showcased strong creativity and technical depth:

🌟 Eliciting Reasoning via On-Policy Distillation

  • The Approach: Instead of relying solely on static offline datasets, they implemented an on-policy distillation method from scratch within the Tunix framework. They used a larger, highly capable teacher model (trained in 3 phases) to generate reasoning traces dynamically in response to the student model's generations during training, creating a tighter feedback loop.

🌟 Gemma2-Deep” Incentivizing Gemma to Reason before Answering

  • The Approach: Developed by participant TheItCrow, this project focused on custom dataset curation and structured reward modeling.
    • They curated the Deep-CoRGI (Cognitive Reasoning Guided Interface) dataset, specifically designed to teach Chain of Thought.
    • They trained a custom ThoughtTeacher reward model to evaluate not just the correctness of the final answer, but the logical flow of the reasoning steps themselves.

We are also very impressed with several submissions that focus on reasoning training in specific domains, such as medical, chemistry, legal and robotics.

  • Medical: GRPO generates structured, step-by-step reasoning traces enhancing the interpretability and reliability of its complex clinical problem-solving outputs
  • Chemistry: step-by-step reasoning traces benefited the chemistry use case by enabling a small language model to solve complex chemistry reasoning tasks.
  • Legal: Post-training via GRPO reinforces structured, step-by-step reasoning, enabling the Gemma 3 1B model to accurately analyze complex legal data and produce reliable, logically sound interpretations.
  • Robotics: step-by-step reasoning generation allows the model to solve multi-step robotics planning and decision-making tasks under single-session training constraints.

Ready to build?

The Tunix Hackathon democratizes training highly capable, structured reasoning models by producing so many impressive reasoning training recipes that are now all publicly available. With Tunix and free Kaggle TPUs, developers can now achieve strong results on accessible hardware.

If you're ready to start post-training your own reasoning models, here are some resources to get started:

  1. Explore Tunix on GitHub: Check out the official Tunix repository to access the code, documentation, and community examples.
  2. Try a Colab Tutorial: Spin up a free TPU instance in Google Colab and try the Tunix examples to run your first SFT or RL loop.
  3. Learn More About Reinforcement Learning: Read the RL documentation in Tunix to understand how to leverage reinforcement learning to finetune your model.

Previous

Next