惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

PCI Perspectives
PCI Perspectives
Apple Machine Learning Research
Apple Machine Learning Research
Recent Announcements
Recent Announcements
量子位
H
Hackread – Cybersecurity News, Data Breaches, AI and More
腾讯CDC
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
Schneier on Security
Microsoft Azure Blog
Microsoft Azure Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
小众软件
小众软件
Recorded Future
Recorded Future
P
Privacy International News Feed
Cisco Talos Blog
Cisco Talos Blog
Latest news
Latest news
C
Check Point Blog
O
OpenAI News
N
Netflix TechBlog - Medium
U
Unit 42
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
P
Proofpoint News Feed
Hacker News - Newest:
Hacker News - Newest: "LLM"
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
宝玉的分享
宝玉的分享
F
Full Disclosure
Know Your Adversary
Know Your Adversary
GbyAI
GbyAI
W
WeLiveSecurity
Engineering at Meta
Engineering at Meta
Scott Helme
Scott Helme
云风的 BLOG
云风的 BLOG
I
InfoQ
D
Docker
N
News | PayPal Newsroom
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
T
Tor Project blog
The GitHub Blog
The GitHub Blog
www.infosecurity-magazine.com
www.infosecurity-magazine.com
T
ThreatConnect
人人都是产品经理
人人都是产品经理
S
Securelist
G
Google Developers Blog
Martin Fowler
Martin Fowler
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
P
Privacy & Cybersecurity Law Blog
L
Lohrmann on Cybersecurity
博客园 - 【当耐特】
博客园 - 司徒正美
Hugging Face - Blog
Hugging Face - Blog

DEV Community

I built a local voice AI that can change to 9 different personalities! UXRay: I Built an AI That Roasts Your UI Like a Senior Designer Would Wyrly DI: Type-safe Dependency Injection for Modern TypeScript The contract is the interface: agent-driven Steampipe Stave in one command ⚡ WordPress Performance: The Real Truth They Don't Tell You A Mobile App Usually Needs an Admin System First Customer Portals Should Remove Repeated Admin Work Episode 4: The Time Loop (Layers & Caching) I Built ContextForge with Gemma 4: A Project Memory Generator for Developers and AI Coding Agents Why shadow DOM beat iframe for inline tooltips HOW TO CREATE USER AND ASSIGN ROLES IN AZURE WITH ENTRA ID When AI Blackmail Goes Viral Episode 3: The Secret Scroll (The Dockerfile) Monte Carlo Simulation for Engineers: Turning Uncertainty Into Numbers The tokens-per-byte trap: character-level 'compression' adds tokens Nobody Reads Your Code Anymore Why I built a collection of 5 free, zero-signup career finance tools for solo builders 🚀 New React Challenge: Instant UI with useOptimistic Resolvendo a Alucinação da IA na Arquitetura de Software com Code Property Graphs e .NET 9 S1 — Clean Backtrace Crashes: How to Diagnose and Fix Them Cómo solucionar el bucle infinito en useEffect con objetos y arrays The Brutal Reality of Running Gemma 4 Locally I made Claude Code refuse to write code unless the ticket scores 80/100 I Fed React's Entire Hooks Transition History to Gemma 4. Here's What It Found That We Missed. Building a Private RAG System: Lessons from a Local-First AI Journal CodePulse AI — Reviving an AI-Powered Repository Intelligence Platform How to Split Video into Segments with FFmpeg (CLI + API) I've audited dozens of estate agency websites. The same 5 problems show up every single time. Part 1: Taming Asynchronous JavaScript: How to Build a "Mailbox" Queue Building My AI-Powered VS Code Extension 🚀 Google Login in Express with PassportJS & JWT Great example of Gemma 4 moving beyond chatbots into real-world decision support. Using AI to guide everyday actions like recycling shows how impactful applied LLMs can be when designed for usability, not just capability. #Gemma4 #AI #Sustainability Building a Production AI Chatbot for an Educational Institute: Architecture, Lessons & Full Stack Deep-Dive Google Login in Express with PassportJS & JWT How I reclaimed 47GB on my MacBook by cleaning developer project junk Operators Are Not Oracles: How We Learned to Stop Worrying and Love the Configuration I Built 6 Free Developer Tools for AI APIs, Cron, Docker, and Self-Hosting How I Built a Real-Time Precious Metals Price Feed for 30,000 Concurrent Users in Laravel How to Use a SERP API to Validate Whether a Project Idea Is Worth Building Gemma 4 discussions often focus on capability, but real-world impact depends on deployment context. For offline education, especially in low-connectivity regions, latency, cost, and local inference matter as much as model strength. Local Mind Explores it Space Complexity + Ω and Θ Notations Google I/O 2026 Just Confirmed the Shift From AI Chatbots to AI Agents How to Add API Monitoring to an Express App in 5 Minutes (2026) Designing an In-Game Inflation Tracking Algorithm for Web Utility Apps Google AI Studio Just Changed the Shape of App Development If you struggle to learn then this is for you. Best AI Agent Security & Guardrails Tools in 2026: LLM Guard vs NeMo vs Guardrails AI Building Dynamic RBAC in React 19: From Permission Strings to Component-Level Access Control How to Build a Self-Hosted AI Code Review Tool in Python Why We Switched from React to HTMX in Production: A 200-Site Case Study Gemma-Loom: The Intent-Based Virtual Machine (IVM) for Edge Sovereignty Java实习海投攻略:3天300个沟通,我是怎么拿到面试的 I Deployed Netflix's Web Server in 30 Seconds (And So Can You) - Docker Project 1 Debugging Android 14 WebRTC Disconnects on a coturn Relay Path 1/30 Days System Design Question Testing FastAPI + SQLAlchemy with Real PostgreSQL Fixtures: No More Mocking Misery FAQ Schema Markup Generators: What They Actually Do (and What They Don't Tell You) How a pure-TypeScript flex layout engine closed the last WASM-Yoga gap Spot instances as GitHub Actions runners Agents Need Receipts, Not Just Better Prompts readmegen — Generate beautiful README.md in seconds (12 templates, open source) When AI Reads Blueprints: The Hidden Attack Surface of Multimodal Engineering Intelligence Simplicity scales — complexity kills side projects AI does exactly what you ask — that's the problem How a model upgrade silently broke our extraction prompt (and how we caught it) The Best Form Backend for Static Sites in 2026 # ⛽ I Built a Cross-Platform Fuel Finder with React & Supabase: The Indie Dev Journey The 11 Major Cloud Service Providers in 2025 Membangun Karya Visual: Mengintip Fasilitas Multimedia dan Studio Kreatif Amikom What Is IOPS? Visualizing Database Design: From Interactive Canvas to Drizzle, Prisma, and SQL in Real-time A tool to make your GitHub README impossible to ignore 🚀 Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere. We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found. Jenkins CI/CD Pipeline for a Dockerized Node.js Application: Manual Trigger vs Automatic Trigger Using GitHub Webhooks How to Stream Live Forex Rates to Google Sheets API: A Complete Guide Small Models Will Beat Giant Models (And Most People Haven’t Realized Why Yet) How I Built 5 Linux Automation Scripts on AWS EC2 I built TokenPatch to measure AI coding cost per applied patch I built a Chrome extension to stop squinting at the web Producer audit clean, six tests red Conversa — A Multi-Agent AI Platform Powered by Gemma 4 Build a Real Agent in 15 Minutes with Gemini's New Managed Agents API What I Actually Build: AI Systems That Ship, Not Demos That Impress The Box Ticked While You Read This: LinkedIn, AI Training, and the Switch You Did Not Flip Investasi Masa Depan: Mengintip Fasilitas Laboratorium Komputer Kelas Dunia di Yogyakarta I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead How To Build an Image Cropper in Browser (Simple Steps) I built a macOS disk cleaner for developers and just launched it would love feedback Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot
Gemma 4's Hidden Superpower: Why Built-in Thinking Tokens Change Everything for Evaluation Tasks
Thy Alpha · 2026-05-23 · via DEV Community

This is a submission for the Gemma 4 Challenge: Write about Gemma 4

Most coverage of Gemma 4 focuses on the headline numbers: 128K context, 31B parameters, multimodal input. But after building a production app with it, I think the most underrated feature is something barely mentioned in the docs: native thinking tokens.

Here's what they are, why they matter, and what they unlock.

What Are Thinking Tokens?

When you call Gemma 4's API, the response includes parts with a thought: true flag:

{
  "candidates": [{
    "content": {
      "parts": [
        {
          "text": "The user asked about sorting algorithms...\n- They mentioned merge sort but not quicksort...\n- Their Big-O analysis is correct...\n- They missed the space complexity point...",
          "thought": true
        },
        {
          "text": "Great explanation of merge sort! One thing to add: while your O(n log n) analysis is correct, it's worth mentioning..."
        }
      ]
    }
  }],
  "usageMetadata": {
    "promptTokenCount": 45,
    "candidatesTokenCount": 52,
    "totalTokenCount": 312,
    "thoughtsTokenCount": 215
  }
}

Enter fullscreen mode Exit fullscreen mode

The model generates internal reasoning (215 tokens of thinking) that you, the developer, can see but that the model treats as its "inner monologue." The final output (52 tokens) is what the user sees.

This is not chain-of-thought prompting. This is chain-of-thought architecture — the model does it automatically, without being told to "think step by step."

Why This Matters for Evaluation Tasks

I discovered this while building an AI interview coach. The app asks interview questions, the user answers, and Gemma 4 evaluates the response.

Here's the thing: evaluation is harder than generation. Generating a plausible interview question is easy. Evaluating whether someone's answer is good — considering structure, specificity, depth, relevance, and experience level — requires the model to hold multiple criteria in working memory simultaneously.

Without thinking tokens (typical LLM behavior):

User: "Tell me about a time you led a team through a difficult project."
Answer: "I led my team to deliver the Q3 release on time."

Model output: "Good job mentioning leadership! Try to add more details next time."

Enter fullscreen mode Exit fullscreen mode

The model races to produce output without deeply considering what was missing.

With thinking tokens (Gemma 4):

Thinking: "The candidate claims they led a team through a difficult project.
  - STAR analysis:
    - Situation: Not specified (what project? what was difficult?)
    - Task: Vague ('deliver Q3 release')
    - Action: Missing entirely (what did THEY specifically do?)
    - Result: 'on time' — but no metrics, no impact
  - Experience level: Mid-level, so I should expect specific examples
  - Pattern: This is the third vague answer — I should push harder this time
  - Selected approach: Acknowledge the topic, then probe for specifics"

Output: "You picked a relevant example, but I need more. What specifically made
this project difficult? What decisions did YOU make as the lead? And what
was the measurable outcome? Give me numbers — team size, timeline, impact."

Enter fullscreen mode Exit fullscreen mode

The thinking tokens let the model systematically work through evaluation criteria before committing to a response. The result feels like feedback from an experienced interviewer, not a chatbot.

Three Patterns Where Thinking Tokens Excel

1. Multi-Criteria Evaluation

When your task has multiple evaluation dimensions (like grading an essay, reviewing code, or scoring an interview answer), thinking tokens let the model address each criterion before synthesizing a final judgment.

# The model naturally breaks down scoring without explicit prompting
prompt = """
Score this answer for: clarity, depth, and relevance.
Answer: "Machine learning is when computers learn from data."
"""
# Gemma 4's thinking will evaluate each dimension separately,
# then produce a balanced score. Other models might just say "7/10."

Enter fullscreen mode Exit fullscreen mode

2. Pattern Recognition Across Context

With 128K context AND thinking tokens, Gemma 4 can notice patterns across a long conversation and reason about them:

Thinking: "Looking at answers 1-7:
  - Answers 1, 3, and 5 all lack specific metrics
  - Answer 4 was the strongest — they used concrete numbers
  - This is a pattern: the candidate is good at storytelling
    but weak at quantification
  - I should focus my feedback on this specific gap"

Enter fullscreen mode Exit fullscreen mode

This combination — long context for memory + thinking for analysis — is genuinely new.

3. Calibrated Difficulty Adjustment

The model can reason about whether to make the next question easier or harder:

Thinking: "The candidate nailed the last two system design questions.
  - They correctly identified the bottleneck
  - Their caching strategy was solid
  - But they haven't been tested on distributed consensus yet
  - Since they're interviewing for a senior role, I should
    push into harder territory now"

Enter fullscreen mode Exit fullscreen mode

26B MoE vs 31B Dense: When to Use Each

After testing both extensively, here's my take:

26B MoE 31B Dense
Speed 1-3s per response 5-10s per response
Thinking quality Good — covers main points Excellent — catches edge cases
Best for Interactive chat, quick feedback Deep analysis, final reports
Thinking tokens used ~100-200 per response ~200-400 per response

My recommendation: Use 26B MoE for the conversational back-and-forth, and 31B Dense for summary/evaluation tasks where speed matters less.

In my interview coach app, I default to 26B MoE because conversational latency matters more than marginal evaluation quality. But if you're building a code review tool or essay grader where the user can wait 10 seconds, go with 31B Dense.

Practical Tips for Building with Thinking Tokens

1. Filter them in your UI

const parts = response.candidates[0].content.parts;
const visibleText = parts.filter(p => !p.thought).map(p => p.text).join('');
const thinkingText = parts.filter(p => p.thought).map(p => p.text).join('');

Enter fullscreen mode Exit fullscreen mode

Users should see the polished output, not the internal reasoning.

2. Log the thinking for debugging

The thinking tokens are incredibly useful for understanding why the model gave a particular response. I log them during development:

if (process.env.NODE_ENV === 'development') {
    console.log('🧠 Model thinking:', thinkingText);
}

Enter fullscreen mode Exit fullscreen mode

3. Don't fight the thinking — design around it

If you prompt Gemma 4 to "just give me a one-word answer," it'll still think internally. That's fine — the thinking tokens don't appear in the output. But they DO count toward your token usage.

For simple tasks (classification, yes/no), the thinking overhead might not be worth it. For complex tasks (evaluation, planning, multi-step reasoning), it's exactly what you want.

4. Temperature affects thinking quality

At low temperature (0.1-0.3), thinking tokens are more systematic and thorough. At high temperature (0.8+), they're more creative but occasionally tangential. For evaluation tasks, I recommend 0.5-0.7.

The Bigger Picture

Thinking tokens represent a shift from "prompt engineering" to "reasoning architecture." Instead of crafting elaborate prompts that force step-by-step reasoning, the model does it natively.

This matters because:

  1. Simpler prompts, better results — You don't need "Let's think step by step." The model already does.
  2. More reliable evaluation — The model is less likely to give snap judgments on complex tasks.
  3. Transparent reasoning — You can inspect the thinking to understand (and debug) the model's logic.

Combined with 128K context and the efficiency of the MoE architecture, Gemma 4 is uniquely positioned for applications that need to reason over long interactions — tutoring, coaching, mentoring, code review, and any task where shallow responses aren't good enough.

Try It Yourself

The best way to see thinking tokens in action is to build something that requires evaluation. Here's a minimal example:

curl "https://generativelanguage.googleapis.com/v1beta/models/gemma-4-26b-a4b-it:generateContent?key=YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [{"parts":[{"text":"Evaluate this Python code for bugs and style issues:\n\ndef fibonacci(n):\n  if n <= 1: return n\n  return fibonacci(n-1) + fibonacci(n-2)"}]}]
  }'

Enter fullscreen mode Exit fullscreen mode

Look at the response. You'll see the thinking tokens breaking down the code analysis before the final review appears. That's Gemma 4's hidden superpower at work.


This post was inspired by building Interview Coach, an open-source AI interview practice tool powered by Gemma 4. The code is MIT licensed — fork it and build something better.