惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

Why I Built Mneme HQ: Preventing AI Agent Architectural Drift I Built a Pay-Per-Call Crypto Signal API with x402 — Heres the Architecture 🚀 “From Prompts to Autonomous Agents: What Google I/O 2026 Changed” The Power of Distributed Consensus in Autonomous SOCs Sixteen TUI components, copy-paste, no dependency The Boring Reliability Layer Every Autonomous Agent Needs Nven - Secret manager Building Multi-Tenant Row-Level Security in PostgreSQL: A Production Pattern Building Vylo — Looking for Collaborators, Partners & Early Support I Thought Memory Fades With Time. It Actually Fades With Information. ORA-00064 오류 원인과 해결 방법 완벽 가이드 I registered an AI agent at 1 AM and something cracked open in my head Pitch: Nven - Sync secrets. Ship faster. Why y=mx+b is the heart of AI From Routines to a Crew — Building a System That Plans Its Own Work & executes it 25 React Interview Questions 2026 (With Answers) — Hooks, React 19, Concurrent Mode An open source LLM eval tool with two independent quality signals Using Dashboard Filtering to Get Customer Usage in Seconds from TBs of Data Skills, Java 17, And Theme Accents 4 Hard Lessons on Optimizing AI Coding Agents Arctype: Cross-Platform Database GUI for LLM Artifacts Your robots.txt says GPTBot is welcome. Your server says 403. Organizing How to Use AWS Glue Workflow 5 n8n Automations Every Digital Agency Should Be Running (Bill More, Work Less) Getting Started with TorchGeo — Remote Sensing with PyTorch Designing a Scalable Cross-Platform Appium Framework Google Antigravity 2.0 & Slash Commands Building a Unified Adaptive Learning Intelligence with Gemma 4, Flutter, and Multi-Model Orchestration Looking for beta testers for a £60 server management application The Disk-Pressure Incident That Taught Me to Always Set LimitRanges and Other Lessons from Mirroring EKS Locally. Why AI Should Not Write SQL Against ERP Databases Vibe coding works until it doesn't. The debt is real. Shipping at the Edge: Migrating a Coffee Subscription Platform to Cloudflare Workers Stop Tab-Switching: A Developer's Guide to Color Tools That Actually Fit the Workflow DevOps vs MLOps vs AIOps: What Changes, What Stays, and a Simple Roadmap to Get Started Run Powerful AI Coding Locally on a Normal Laptop 5 n8n Automations Every WooCommerce Store Needs (Save 10+ Hours/Week) What I Learned Building My Own AI Harness Hytale Servers Will Fail Treasure Hunts Until We Fix Our Event Handling Redux in React: Managing Global State Like a Pro Unfreezing Your GitHub Actions: Troubleshooting Stuck Deployments and Protecting Your Git Repo Statistics Unlocking Project Discoverability on GHES: A Key to Software Engineering Productivity When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Why Hardcoded Automations Fail AI Agents Stop Calling It an AI Assistant. It’s Already Managing Your Company Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution AWS Savings Plan Buying Strategy: How to Layer, Size, and Time Commitments
Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing
SleepyQuant · 2026-05-18 · via DEV Community

SleepyQuant

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

I lost two hours last week to a Qwen 3.6 quirk that doesn't show up in any quickstart guide. My agent kept returning malformed JSON. Logs showed the model output started with <think> and a 200-token reasoning monologue before the actual JSON I asked for. Parser exploded every time.

The fix is one keyword argument. The frustration is that nothing in the obvious places — model card, MLX docs, generic chat template examples — tells you about it.

If you're running Qwen 3.6 MoE for an agent setup and your structured outputs are broken, read on.

The symptom

I had a tool-calling loop that asked Qwen to emit JSON. Something like:

prompt = "Return a JSON object with keys 'action' and 'target'."
response = generate(model, tokenizer, prompt)
data = json.loads(response)

Enter fullscreen mode Exit fullscreen mode

Worked fine with Qwen 2.5. Broke immediately with Qwen 3.6. The output looked like:

<think>
The user wants a JSON object. I need to think about what action and target make sense.
Let me consider the context...
[200 more tokens of reasoning]
</think>

{"action": "search", "target": "weather"}

Enter fullscreen mode Exit fullscreen mode

JSON parser saw the <think> block as garbage, threw a JSONDecodeError. Easy enough to spot once I logged the raw output. But it took me a while to realize this was a model feature, not a prompt problem.

What's actually happening

Qwen 3.6 ships with reasoning mode default-on. The chat template injects markers — <think> and </think> — and the model is trained to fill them with its chain-of-thought before producing the user-facing answer. For interactive chat, this is sometimes useful: you can show or hide the reasoning to a user, and the reasoning content does measurably improve answer quality on hard problems.

For an agent loop that parses structured output, it's silently destructive. Every response starts with hundreds of tokens you have to strip before you can use the actual answer. And worse, the reasoning length is unpredictable — sometimes 50 tokens, sometimes 800 — so your max_tokens budget gets eaten by thinking instead of output. On a memory-tight Mac running a 35B model already, those wasted tokens also fragment Metal cache faster — separate problem but they compound. (I wrote up the memory side in my MLX memory safety checklist if that's the angle you hit first.)

The fix

In apply_chat_template, pass enable_thinking=False:

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # <-- this
)
response = generate(model, tokenizer, text)

Enter fullscreen mode Exit fullscreen mode

That's it. No <think> blocks, no reasoning preamble, just the answer. JSON parses cleanly. max_tokens budget goes to the actual response.

Where the flag has to go

This took me embarrassingly long to figure out. The flag belongs at template apply time, not at generation time. You can't pass it to model.generate() and have it work. You can't set it as a tokenizer kwarg at load time. It only has effect inside apply_chat_template.

I tried these wrong things first:

# These do nothing — flag is ignored
generate(model, tokenizer, prompt, enable_thinking=False)
tokenizer = AutoTokenizer.from_pretrained(model_id, enable_thinking=False)
model.generate(prompt, enable_thinking=False)

Enter fullscreen mode Exit fullscreen mode

If you've inherited a codebase where chat formatting is wrapped in a custom function, the wrapper probably calls apply_chat_template somewhere. That's the spot. Patch it there.

When you actually want thinking on

For interactive chat where a user reads the response, leaving enable_thinking=True (the default) usually helps. The model is genuinely smarter on multi-step reasoning when it gets to think out loud. Math problems, code debugging, multi-constraint planning — all measurably better with thinking on.

So the rule isn't "always disable." It's "disable for any path where the output gets machine-parsed, kept on for any path where a human reads it."

In my own setup (a multi-agent local stack on M1 Max — full hardware notes in the 19 GB memory compression writeup), I split into two generate functions:

def generate_for_agent(messages, max_tokens=512):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=False  # parser-safe
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

def generate_for_chat(messages, max_tokens=2000):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=True  # quality boost for chat
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

Enter fullscreen mode Exit fullscreen mode

Two functions, two contexts. Same model, same tokenizer, different chat template flag. Clean separation.

Why the docs don't surface this

This is my speculation, not authoritative — but here's what I think happened. Qwen 3.6 launched as Alibaba's flagship reasoning model. The whole pitch is "thinks before it answers." Disabling that flag in the quickstart would undercut the marketing of the feature itself. So the docs assume you want thinking on by default, and the flag is buried in API reference, not the first-page tutorial.

If your use case is agent JSON, you'll find this gotcha on day one. If your use case is human chat, you might never need to touch the flag and won't see why anyone would.

It's a real-world case where the default optimizes for the most demo-worthy path, not the most common production path.

Verification

After patching, you can verify the flag took effect by inspecting the rendered template before generation:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
print(text[-200:])  # tail of the prompt

Enter fullscreen mode Exit fullscreen mode

You should see the assistant generation prompt with no <think> marker. If you see <think> in the tail, the flag didn't apply — most likely because you're calling a wrapper that doesn't pass it through.

You can also check by inspecting the first 100 tokens of any response. Reasoning-on output starts with <think>. Reasoning-off output starts with the actual answer.

What this isn't

This is specifically Qwen 3.6 behavior. Earlier Qwen versions (2.5 and below) don't have the enable_thinking flag because reasoning mode wasn't a feature yet. Other reasoning-mode models (DeepSeek-R1, the o1 family on the OpenAI API) have similar dynamics but different flags or modes — check their respective chat templates.

If your output isn't parsable but doesn't have <think> blocks, the cause is somewhere else. Common alternatives I've hit:

  • Trailing whitespace or newlines in the response — strip before parsing
  • Markdown code-fence wrapping around the JSON — strip json ` and `
  • Model adding explanatory text before/after the JSON — tighten the system prompt with explicit "no preamble, no explanation"

The <think> block fix only solves the reasoning-leak case. The other cases need other fixes.

The smaller lesson

When a new model breaks an existing pipeline silently, the bug is usually in the chat template, not the generate call. The template is the interface between your code and the model's expectations. Most upstream API changes happen there.

For Qwen 3.6, the gotcha is enable_thinking. For the next model in two months, it'll be something else. The diagnostic habit — log the rendered template, not just the response — saves hours over the year.

If you've hit a different Qwen 3.6 surprise that nobody flags, I'd genuinely like to know. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.