惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
P
Proofpoint News Feed
L
Lohrmann on Cybersecurity
S
Secure Thoughts
Attack and Defense Labs
Attack and Defense Labs
人人都是产品经理
人人都是产品经理
Stack Overflow Blog
Stack Overflow Blog
W
WeLiveSecurity
O
OpenAI News
SecWiki News
SecWiki News
博客园 - Franky
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
T
Tor Project blog
Microsoft Security Blog
Microsoft Security Blog
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
H
Hacker News: Front Page
Google Online Security Blog
Google Online Security Blog
P
Privacy & Cybersecurity Law Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
D
Darknet – Hacking Tools, Hacker News & Cyber Security
月光博客
月光博客
李成银的技术随笔
Spread Privacy
Spread Privacy
F
Full Disclosure
F
Fortinet All Blogs
T
The Exploit Database - CXSecurity.com
Vercel News
Vercel News
AWS News Blog
AWS News Blog
WordPress大学
WordPress大学
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
V
Visual Studio Blog
J
Java Code Geeks
博客园 - 三生石上(FineUI控件)
G
Google Developers Blog
云风的 BLOG
云风的 BLOG
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Last Week in AI
Last Week in AI
P
Palo Alto Networks Blog
宝玉的分享
宝玉的分享
T
True Tiger Recordings
N
News and Events Feed by Topic
酷 壳 – CoolShell
酷 壳 – CoolShell
Cisco Talos Blog
Cisco Talos Blog
N
News | PayPal Newsroom
S
SegmentFault 最新的问题
Jina AI
Jina AI

DEV Community

When the API literally burned your database after a typo COOKIES DPRK Hacking Trends 2026: AI‑Powered Supply Chain and Developer Environment Attacks Phone control for AI coding sessions is not a tiny terminal PayPal and Crypto Are Not Equals: How I Built a Gumroad Alternative for Restricted Countries Exploring Tech as a Content Writer React Server Components Don't Make Your App Fast by Default Multi-Stage Builds for a Next.js App — Reduce Image Size by 70% I Built a Chrome Extension That Teaches Vocabulary While You Browse Why I Walked Back from Next.js and RSC to a Plain SPA and a Separate Backend NeuralPocket: Private On-Device AI with Gemma 4 — Android & Web Github Speckit: Revolucionando o Desenvolvimento com SDD Cloud Cost Elasticity I Built a Payment System for Bangladesh—Heres Why Stripe Failed Us Polyglot Persistence in Microservices: Choosing the Right Database for Each Service Centralized Authentication for a Multi-Brand Laravel Ecosystem How I made a perfect recording button. Simple yet complex thing. Mumbli – my personal Wispr Flow Getting Paid Should Not Be a Geopolitical Nightmare: My NOWPayments Integration Story Four Layers of Validation in Kubernetes with Claude Code Prompt Flow — a visual side project for flow design, trace, and integration steps (looking for feedback) AI Citation Registry: Temporal Gaps in Government Publishing Cycles ShowDev: I built a 100% local, zero-upload PDF editor using WebAssembly JavaC Written by an AI Pipeline, Verified by Three Models. Is It Slop? Part1 Vulkan: Drawing Triangle 1 Why I Stopped Using useEffect to Sync State — and What I Use Instead Por qué dejé de usar useEffect para sincronizar estado y qué uso ahora Migrating a Long-Running WordPress Site to Payload CMS (And All The Chaos That Came With It) Hidden Partitioning: How Iceberg Eliminates Accidental Full Table Scans Azure DevOps Structure Explained: Organizations, Projects, and Repos Without the Mess A Simple React Hook for localStorage State, Expiry, and Sync I sold you on /scratchpad. Then I migrated to /note. Fixing WSL Errors on Windows 11 Your app is not Netflix. Stop building like it is. Resolving inter-service communication issue I built an email cleaner. CSV parsing took longer than the actual validators. How I Would Learn Full-Stack Development in 2026 If I Started From Zero Partition Evolution: Change Your Partitioning Without Rewriting Data What Google Play's I/O 2026 Updates Look Like From a Solo Indie Puzzle Developer Forgetting the Myth of "Ease of Integration" When Selling Digital Products with Bitcoin My 4-Step Regex Debugging Workflow (That Actually Saves Time) Stop Scraping Betting Sites: How to Build a Real-Time Sports Tracker in Python Civic Identity and Responsibility in Modern Democracy OLTP vs OLAP Are binaries really executable code ? The lie of the 80%: why software progress charts don't work What a Datacenter in Space Actually Buys You: Three Server Racks Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't Accessibility - This looks like a job for a developer advocate! I built a Mac app that turns web pages into live widgets How to Teach Source Evaluation When Your Students Use ChatGPT More Context Does Not Mean More Trust RAG Series (24): Code RAG — Teaching AI to Understand Your Codebase Past the JVM Design decisions behind my “Irregular German Verbs” iOS app WordPress 7.0 "Armstrong" Is Live — Post-Release Deep Dive 🎺 Performance and Apache Iceberg's Metadata I Shipped a Bug to Production That Cost Us 3 Hours of Downtime 程序人生:在代码与时间之间 The Wrong Way to Think About XRPL Event Infrastructure What I Learned About MND, Voice Banking, and Why Assistive Tech Is Personal $1.50/Month Email Infrastructure That Beats Your $20 SendGrid Plan Cloud Unit Economics: The Metrics DevOps and FinOps Teams Actually Need Bypassing Payment Platform Restrictions Was The Best Decision I Ever Made For My Digital Product Business The Hidden Life of a Container: A Complete Lifecycle When a port is already in use, there is no interactive way to find it — so I built `port-peek` Como Sumir com o Barulho do Teclado Mecânico no Ubuntu Usando o NoiseTorch Google I/O 2026 dropped a bomb on Android tooling, and nobody's talking about it (or maybe they are 😅) Mentoring Junior Developers: What Actually Works How I Prevented Claude Code from Breaking My Architecture with 18 Tests That Run in 0.4 Seconds I Controlled an ESP32 Drone Using Only My Voice vite HMR is silently the reason ur laptop fan wont stop AI Agents Security for Developers: Don't Let Your Agents Become a Liability Single List Keyboard Handling 9 SaaS development companies worth knowing (a technical look) Material Nova — The Best VS Code Theme of 2026 Inference Routing Is Becoming an Infrastructure Placement Problem I just build a League MBTI Analytics Why I Built My Own Site with Astro, Not WordPress when I use WordPress for a Living Hello! I'm a balloon artist who started 3D modeling 7 Next.js 16 Caching Bugs That Compile Fine and Break Silently in Production I got tired of writing READMEs so I built a tool that generates them from your GitHub URL FrontGate: a Lightweight Package Proxy for Supply Chain Security Why Your Expense Tracking Architecture Keeps Breaking Stop your AI trading agent from hallucinating technical analysis Breaking the Monorepo Barrier in a Crypto Store for Digital Products Imposter Syndrome Is Something We All Struggle With at Some Point in Our Careers Moving Beyond the Black Box: How I Built a Real-Time Voice Fitness Coach using Next.js 15, Convex, & Vapi.ai How to Recover Kafka DLQ Messages After a Schema Change Broke Your Consumer From Spec-Driven Development to Attractor-Guided Engineering Githubster free tool to track your GitHub followers and unfollowers Why Bitcoin Core RPC is Too Slow for High-Frequency Trading (And How to Fix It) Why Reading Food Labels Shouldn't Feel Like Decoding a Chemistry Exam I built a "brain" for AI coding agents — it never forgets and never stops How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration) Controlling Employee AI Usage on Managed Devices: Browser Controls, Cloudflare AI Gateway, and AWS Bedrock When Global Payment Gateways Fail, Local Solutions Shine LeetCode Solution: 13. Roman to Integer End-to-End Observability for vLLM and TGI: from DCGM to Tokens
I Raised Gemma 4's Token Cap. The Dense Model Stopped Refusing.
Ali Afana · 2026-05-21 · via DEV Community

TL;DR: Last week I argued Gemma 4 Dense regressed on a grounded-retrieval scenario under tightened prompts — and called the MoE-vs-Dense divergence architecture-mediated. The comment thread, led by Robin Converse on her sovereign Ollama stack, proposed an alternative: my max_tokens: 400 cap was starving Gemma's reasoning layer before the visible reply completed. I re-ran the same six scenarios with one variable changed — budget raised from 400 to 4096. Dense recovered on every scenario, including the false-refusal headline that anchored the original article. MoE did too. The original MoE-vs-Dense divergence largely disappears when reasoning has room to finish. The cap was doing the work. Walking it back publicly.


Why I Re-Ran It

Last week I published I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused.. Quick recap: I ran Gemma 4 26B MoE and Gemma 4 31B Dense through my Arabic e-commerce chat router with three prompt rules — an Arabic-first system frame, temperature capped at 0.3, max_tokens floored at 400. The 26B MoE flipped from stalling to grounded answers. The 31B Dense flipped from working correctly to false-negative refusals on a scenario where the white shirts the customer asked about were sitting in its context. I called the divergence architecture-mediated and shipped the article.

The comment thread reframed it.

Robin Converse (Triava Labs, running the same model family on self-hosted Ollama) ran her own uncapped sweep on her sovereign stack — same scenarios, three temperatures, max_tokens unrestricted. Her MoE handled every case correctly. She posted the methodology and the 18-call breakdown and asked: what does the same test look like on the managed Gemini API side? She also separately filed two upstream Ollama bugs — walked back her framing on #15288 publicly when maintainers clarified it was a configuration issue, and #15428 was confirmed by multiple users and resolved in a later release. That kind of fair-witness practice is what made me trust the hypothesis enough to test it.

The hypothesis itself, sharp: my max_tokens: 400 cap was starving Gemma 4's reasoning layer before the visible reply completed. Capability ceiling and orchestration pressure look identical from the outside, as Vic Chen put it on the same thread — only one is solvable by giving back budget. Vadym Arnaut mapped a separate substitution-vs-decision boundary that sharpened where to look. Edwin Realpe Preciado, Mykola Kondratiuk, Theo Valmis, and Hashevolution each named structural reframes I had to engage. The collective shape of the thread was: the cap is one variable, the architecture is a category, and you've been calling the result by the wrong name.

The cap was the part I could test fastest. One variable. Re-run the same six scenarios on the same two architectures with the only change being the budget.


The Experiment

Single-variable change against the original v2 conditions:

Variable Original v2 (capped) This re-run (uncapped)
Arabic-first system frame kept kept (unchanged)
Temperature 0.3 0.3 (unchanged)
max_tokens floor 400 4096
Scenarios 6 Arabic e-commerce 6 (same set)
Models 26B MoE + 31B Dense 26B MoE + 31B Dense
Calls 12 12

The implementation was four lines in the chat-models server file the original article describes, gated behind an env flag so production behavior stays untouched:

// before
max_tokens: Math.max(params.max_tokens ?? 0, 400),
// after
const uncapped = process.env.GEMMA_UNCAPPED === "1";
max_tokens: uncapped ? 4096 : Math.max(params.max_tokens ?? 0, 400),

Enter fullscreen mode Exit fullscreen mode

Dev server restarted with GEMMA_UNCAPPED=1. Sweep script ran the same six scenarios from the original article — same customer messages, same demo store, same router stack with gpt-4o-mini on every non-reply call so the only thing changing is the budget Gemma sees on the final response call.

No prompt edits. No router changes. No temperature changes. No new model rules. Just the budget.


The Results

12 of 12 calls succeeded. Both architectures handled every scenario correctly.

Scenario 26B MoE — v2 (capped) 26B MoE — uncapped 31B Dense — v2 (capped) 31B Dense — uncapped
1 — Greeting ✓ tight open ✓ open ask ✗ HTTP 500 ✓ open ask
2 — White shirt L ✓ 3 SKUs + prices ✓ 3 SKUs + prices ✗ false refusal ✓ 3 SKUs + prices
3 — Suit + watch ✓ 2 suits + refused watch ✓ 2 suits + refused watch ✗ HTTP 500 ✓ 2 suits + refused watch
4 — Math + shipping ✓ grounded $100 threshold ✓ grounded $100 threshold partial (vague) ✓ grounded $100 threshold
5 — Walk-away ✓ holds value ✓ holds value ✓ holds value ✓ holds value
6 — Explain price ✓ no leak ✓ no leak ✓ no leak ✓ no leak

HTTP 500 rate: v2 Dense was 2 of 6. Uncapped Dense was 0 of 6. Uncapped MoE was 0 of 6. The reliability gap disappears.

Latency (single run per pair, exploratory not statistical): 26B MoE 28–56 s, avg 37 s. 31B Dense 19–38 s, avg 27 s. The Dense variant was faster than the MoE variant once both had budget — different from what either round of the original article suggested.

Scenario 2 — the load-bearing test

The original article's headline failure was Scenario 2 on Gemma 4 31B Dense: customer asks for a white shirt in size L, the catalog has three, the Dense model said:

v2 capped: "والله ما عندنا قميص أبيض مقاس L حالياً. اعذرني، ما في شي متوفر من هاد الموديل حالياً."

Translation: "We don't currently have a white shirt in size L. My apologies, nothing of that model is available right now."

Same model, same context, same prompt, same temperature, budget raised from 400 to 4096:

Uncapped: "من عيوني! عندنا Urban Cool Striped Shirt بـ $65، وBordeaux Heritage Shirt بـ $80، وUrban Stripes Classic Shirt بـ $95، وكلهم متوفرين باللون الأبيض ومقاس L. بنصحك تنسق القميص مع Chinos عشان تطلع الطلة مرتبة وفخمة!"

Translation: "My pleasure! We have Urban Cool Striped Shirt for $65, Bordeaux Heritage Shirt for $80, and Urban Stripes Classic Shirt for $95, all available in white in size L. I recommend pairing the shirt with Chinos for a clean, polished look!"

False refusal → three real SKUs with real prices and a styling cross-sell. One variable changed.


What This Means

The original article called the MoE-vs-Dense divergence architecture-mediated. The honest read of the re-run is sharper than that:

  1. The cap was doing most of the work. Under max_tokens: 400, both architectures were starving — MoE in Round 1 (stalls on Scenario 2), Dense in Round 2 (false-refusal on Scenario 2 once the augmentation tightened the prompt enough to push reasoning over the budget). When both have room, both behave grounded.
  2. The architectural difference is real but quantitative, not qualitative. Dense and MoE both perform multi-step reasoning on this workload; they just pay different token-budget taxes for it. Calling the v2 result "architecture-mediated failure" over-claimed. The accurate phrasing is closer to: budget allocation for multi-step reasoning is architecture-sensitive; the failure mode is the same in both cases — running out of tokens before output completes.
  3. The "Dense is unfit for grounded chat" subtext that some readers took from the original is wrong. Dense recovered on every scenario when given the budget. The article's matrix table was misleading on that point and I should have flagged it harder.

Robin called it from her stack on Tuesday. The cross-validation now exists: same architecture, two deployment contexts (sovereign Ollama, managed Gemini API), one finding — uncap the budget and the failure mode evaporates.

What still partially holds from the original article:

  • Reluctance, not hallucination, as the dominant failure mode for grounded Arabic chat on open models when the budget is too tight to complete reasoning.
  • Variant-specific prompt tuning is real — but the variant-specific thing isn't an architectural slot; it's a token budget shaped to the variant's reasoning footprint.
  • Latency on the Google API is still a chasm for interactive chat — that part wasn't a cap artifact.

What I retract:

  • The "architecture has slots for sequential sub-behavior that dense doesn't" hypothesis was too strong. Dense plus budget gets the same grounded behavior.
  • The matrix's "31B Dense regression" reads now as "31B Dense under-budgeted regression."
  • The closing line — "I think I was tuning architecture, not size" — should have been "I think I was tuning budget, mediated by architecture."

What's Still Open

  • Temperature dimension untested in this run. I held temperature at 0.3 to keep this one-variable. Hashevolution flagged this specifically — the temperature cap may not be neutral across architectures, since forcing 0.3 on an MoE might suppress the routing entropy that lets it resolve sequencing in the first place. The next sweep varies temperature (0.3 / 0.7 / default) on top of uncapped budget to separate "architecture matters" from "thermostat + architecture matters."
  • Cross-validation with Robin's data still pending as a single artifact. She ran her side on sovereign Ollama; I ran mine on managed Gemini AI Studio. Two stacks, one architecture, one combined finding. That's the version worth co-publishing — and the version where the headline becomes "what holds across deployment contexts."
  • Instruction-order ablation / typed retrieval. Edwin proposed making preconditions first-class through his NEXUS protocol; Hashevolution proposed typed retrieval-conditioning context via Graph-RAG as a cross-domain falsifier. Both bypass the model's ambiguity resolution by removing the decision from the model. The Graph-RAG cross-experiment is in motion — provider-contract landed early on Hashevolution's side, swap targeted for mid-June.
  • The eval_count vs response-chars gap Robin found on her side (token count widens with query difficulty, response characters don't). I didn't reproduce that measurement on this run — the Gemini API surfaces different telemetry than Ollama's /api/generate. Worth a follow-up that adds per-call token accounting on the managed side so the cross-validation is symmetric.
  • Third deployment context queued. Hashevolution shared that JAMES's per-stage DEFAULT_MAX_TOKENS defaults (200/400/400/400 across four cognitive stages) match this pathology — their 2026-05-18 internal eval reported empty responses on gemma4:e4b at exactly those four stages. Uncapped replication is queued this week. If it reproduces, that's three independent deployment contexts (sovereign Ollama, managed Gemini API, JAMES production) on the same cap pathology before any cross-experiment swap runs.

The Lesson

The community ran my falsifier for me. The cap is one variable; architecture is a category. If your model is failing under a tight token budget, raise the budget before you reach for architectural explanations — and if a thoughtful commenter offers you a single-variable test against your strongest claim, run it.

The article you can write a week later, with the comment thread folded in, is stronger than the one you ship alone. Robin Converse is the right person to share the next round with.