慣性聚合 高效追讀感興趣之博客、新聞、科技資訊
閱原文 以慣性聚合開啟

推薦訂閱源

博客园 - 司徒正美
V
V2EX
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
aimingoo的专栏
aimingoo的专栏
Apple Machine Learning Research
Apple Machine Learning Research
IT之家
IT之家
Blog — PlanetScale
Blog — PlanetScale
A
About on SuperTechFans
月光博客
月光博客
T
The Blog of Author Tim Ferriss
宝玉的分享
宝玉的分享
Martin Fowler
Martin Fowler
博客园 - 聂微东
The GitHub Blog
The GitHub Blog
V
Visual Studio Blog
WordPress大学
WordPress大学
酷 壳 – CoolShell
酷 壳 – CoolShell
Engineering at Meta
Engineering at Meta
GbyAI
GbyAI

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python)
Qwen 3.6 27B与35B MTP对比标准模式于16GB GPU
Rost · 2026-05-24 · via DEV Community

Rost

余试Qwen 3.6 27B与35B之推测解码(多词预测,MTP)于RTX 4080,显存16GB.

欲观诸模型于同硬件下词速与显存权衡之全貌,可参16GB显存LLM基准测试与llama.cpp.

MTP(多词预测)何为

多标记预测,乃一种揣测解码之术,直植于某些模型之存档。非每前向一通预测一标记,此模型携额外“多标记预测首脑”,于单步中提议数未来标记——继而并行验证之。若揣测获纳,则有效吞吐量升,而输出之质无改。

Qwen 3.6系列,既发标准之型号。 GGUF 文件與 MTP -啟用變體。於 llama.cpp 中,MTP 透過以下方式啟動:

--spec-type draft-mtp --spec-draft-n-max 3

進入全屏模式 退出全屏模式

--spec-draft-n-max 為關鍵調節旋鈕。其設定 MTP 每步所提議的推測詞元數量。較高值或可提升潛在速度,然則須額外 VRAM 以供草稿緩衝區,此實為 16 GB 卡之實際限制。

所测者何?所测者如何?

吾测二Qwen 3.6模型,具MTP之能,较之标准解码,于GPU(16GB VRAM之RTX 4080)之表现若何。

欲使模型权重及KV缓存入VRAM,吾用重度量化之变体:

  • Qwen3.6-27B-UD-IQ3_XXSQwen3.6-27B-UD-IQ3_XXS-MTP
  • Qwen3.6-35B-A3B-UD-IQ3_SQwen3.6-35B-A3B-UD-IQ3_S-MTP

。 __JHSNS_SEG_a1f1ba12_22__每运行,计二语境预算:

  • 平均上下文 — 此乃 llama.cpp 所需之上下文大小,需约 14.8 GB VRAM,余留其他应用(Xorg、GNOME Shell、Cursor)约 500 MB 之缓冲。
  • 最大上下文 — 此乃 llama.cpp 所能分配之最大上下文,然同桌面应用已占约 500 MB VRAM。

持平均上下文于实用之域,其要因在于 赫尔墨斯代理。 — 吾以此为要务之智囊,通联此机之 llama.cpp — 须至少六十四 K 之境 — 乃其常例,若模型之窗较小,则于启程时必拒之。阈下之模型,其工作之忆量不足,遂不能持多步之工器。至於 llama.cpp,此即需传之理也。--ctx-size 65536 或更大。凡 MTP 配置使平均可用上下文显著低于六十四 K 者,皆不适宜于日常赫尔墨斯工作负载,故下表中之平均上下文数,实为最关决断者。

皆试 KV 缓存量化层级:q8(质高,VRAM 多)及 q5(减低虚拟显存,延展情境)。须知自q8迁至q5 KV缓存,可致质素显著损落——于吾之测试,其劣化之甚,足令q5不适宜吾之作业。q5之速与情境数,列于下文,以全其备,然尔宜于己之任务,先试其应答之质,而后决意采用之。

Qwen 3.6 27B MTP 对比 标准

KV缓存 q8

MTP至四 MTP至二 MTP至三 MTP至四 标准(IQ3_XXS)
提示速率 148 t/s 151 t/s 148 t/s 147 t/s 200 t/s
生成速率 65 t/s 75 t/s 73 t/s 75 t/s 45 t/s
Avg Ctx 40 K 40 K 40 K 30 K 80 K
Max Ctx 60 K 60 K 60 K 50 K 100 K

With q8 KV cache, MTP at--spec-draft-n-max 2送之~百分之六十七之速生成(75比45 t/s)然以平均语境窗口减半为代价,自80 K降至40 K。提示摄入速率自200降至~150 t/s,盖因MTP需于预填充阶段进行设备至主机的数据传输。

KV缓存q5

MTP至一 MTP最大2 MTP至三 MTP至四 标准(IQ3_XXS)
提示速度 145次/秒 144次/秒 141次/秒 139次/秒 191次/秒
生成速度 57次/秒 62次/秒 67次/秒 66次/秒 41次/秒
平均上下文 70千 60 K 60 K 50 K 130 K
Max Ctx 100 K 100 K 90 K 80 K 160 K

转至q5 KV缓存,恢复意蕴之境:--spec-draft-n-max 1赐70 K之平均境,57 t/s之速——39%之生成加速。超乎寻常之解码,而犹存有用之境。至--spec-draft-n-max 3,境降六十千,而生成速达六十七每秒(增六十三%)。

Qwen 3.6 27B所得

MTP诚为27B密模型所宜。十六GB显存之妙境:

  • q8 KV +--spec-draft-n-max 2 — 最佳原始速度(75 t/s),上下文缩减至40–60 K
  • q5 KV + --spec-draft-n-max 1 — 最佳速度与上下文平衡(57 t/s,70 K平均上下文)

Qwen 3.6 35B MTP 与标准之比较

35B模型为专家混合架构(Mixture-of-Experts, MoE)35B-A3B者,三十五亿参数也,每词约有三亿活跃参数。MoE模型常受益于MTP,盖因稀疏路由使MTP头相对于全前向传递而言,计算成本甚低。

KV缓存,q8

MTP至多1 MTP至多2 MTP至多3 MTP至多4 标准(IQ3_S)
提示速率 二百七十七每秒 二百七十七每秒 每秒二百六十五 二百七十五每秒 每秒三百六十八
生速 百八十六每秒 每秒189 百八十每秒 每秒171 每秒146
平均上下文 十五仟 十仟 80 K
至极境 80 K 至极境 60 K 至极境 至极境

至极境__JHSNS_SEG_a1f1ba12_149__至极境__JHSNS_SEG_a1f1ba12_150__至极境__JHSNS_a_SEG1f1ba12_151__至极境__JHSNS_SEG_a1f1ba12_152__至极境__JHSNS_SEG_a1f1ba12_153__至极境__JHSNS_SEG_a1f1ba12_154__至极境__JHSNS_SEG_a1f1ba12_155__至极境__JHSNS_SEG_a1f1ba12_156__至极境__JHSNS_SEG_a1f1ba12_157__至极境__JHSNS_SEG_a1f1ba12_158__至极境__JHSNS_SEG_a1f1ba12_159__至极境__JHSNS_SEG_a1f1ba12_160__至极境__JHSNS_SEG_a1f1ba12_161__至极境__JHSNS_SEG_a1f1ba12_162__至极境__JHSNS_SEG_a1f1ba12_163__至极境__JHSNS_SEG_a1f1ba12_164__至极境__JHSNS_SEG_a1f1ba12_165__至极境__JHSNS_SEG_a1f1ba12_166__至极境__JHSNS_SEG_a1f1ba12_167__至极境__JHSNS_SEG_a1f1ba12_168__至极境__JHSNS_SEG_a1f1ba12_169__至极境__JHSNS_SEG_a1f1ba12_170__至极境__JHSNS_SEG_a1f1ba12_171__至极境__JHSNS_SEG_a1f1ba12_172__至极境__JHSNS_SEG_a1f1ba12_173__至极境__JHSNS_SEG_a1f1ba12_174__至极境__JHSNS_SEG_a1f1ba12_175__至极境__JHSNS_SEG_a1f1ba12_176__至极境__JHSNS_SEG_a1f1ba12_177__至极境__JHSNS_SEG_a1f1ba12_178__至极境__JHSNS_SEG_a1f1ba12_179__至极境__JHSNS_SEG_a1f1ba12_180__至极境__JHSNS_SEG_a1f1ba12_181__至极境__JHSNS_SEG_a1f1ba12_182__至极境__JHSNS_SEG_a1f1ba12_183__至极境__JHSNS_SEG_a1f1ba12_184__至极境__JHSNS_SEG_a1f1ba12_185__至极境__JHSNS_SEG_a1f1ba12_186__至极境__JHSNS_SEG_a1f1ba12_187__至极境__JHSNS_SEG_a1f1ba12_188__至极境__JHSNS_SEG_a1f1ba12_189__至极境__JHSNS_SEG_a1f1ba12_190__至极境__JHSNS_SEG_a1f1ba12_191__至极境__JHSNS_SEG_a1f1ba12_192__至极境__JHSNS_SEG_a1f1ba12_193__至极境__JHSNS_SEG_a1f1ba12_194__至极境__JHSNS_SEG_a1f1ba12_195__至极境__JHSNS_SEG_a1f1ba12_196__至极境__JHSNS_SEG_a1f1ba12_197__至极境__JHSNS_SEG_a1f1ba12_198__至极境__JHSNS_SEG_a1f1ba12_199__至极境__JHSNS_SEG_a1f1ba12_200__至极境__JHSNS_SEG_a1f1ba12_201__至极境__JHSNS_SEG_a1f1ba12_202__至极境__JHSNS_SEG_a1f1ba12_203__至极境__JHSNS_SEG_a1f1ba12_204__至极境__JHSNS_SEG_a1f1ba12_205__至极境__JHSNS_SEG_a1f1ba12_206__至极境__JHSNS_SEG_a1f1ba12_207__至极境__JHSNS_SEG_a1f1ba12_208__至极境__JHSNS_SEG_a1f1ba12_209__至极境__JHSNS_SEG_a1f1ba12_210__至极境__JHSNS_SEG_a1f1ba12_211__至极境__JHSNS_SEG_a1f1ba12_212__至极境__JHSNS_SEG_a1f1ba12_213__至极境__JHSNS_SEG_a1f1ba12_214__至极境__JHSNS_SEG_a1f1ba12_215__至极境__JHSNS_SEG_a1f1ba12_216__至极境__JHSNS_SEG_a1f1ba12_217__至极境__JHSNS_SEG_a1f1ba12_218__至极境__JHSNS_SEG_a1f1ba12_219__至极境__JHSNS_SEG_a1f1ba12_220__至极境__JHSNS_SEG_a1f1ba12_221__至极境__JHSNS_SEG_a1f1ba12_222__至极境__JHSNS_SEG_a1f1ba12_223__至极境__JHSNS_SEG_a1f1ba12_224__至极境__JHSNS_SEG_a1f1ba12_225__至极境__JHSNS_SEG_a1f1ba12_226__至极境__JHSNS_SEG_a1f1ba12_227__至极境__JHSNS_SEG_a1f1ba12_228__至极境__JHSNS_SEG_a1f1ba12_229__至极境__JHSNS_SEG_a1f1ba12_230__至极境__JHSNS_SEG_a1f1ba12_231__至极境__JHSNS_SEG_a1f1ba12_232__至极境__JHSNS_SEG_a1f1ba12_233__至极境__JHSNS_SEG_a1f1ba12_234__至极境__JHSNS_SEG_a1f1ba12_235__至极境__JHSNS_SEG_a1f1ba12_236__至极境__JHSNS_SEG_a1f1ba12_237__至极境__JHSNS_SEG_a1f1ba12_238__至极境__JHSNS_SEG_a1f1ba12_239__至极境__JHSNS_SEG_a1f1ba12_240__至极境__JHSNS_SEG_a1f1ba12_241__至极境__JHSNS_SEG_a1f1ba12_242__至极境__JHSNS_SEG_a1f1ba12_243__至极境__JHSNS_SEG_a1f1ba12_244__至极境__JHSNS_SEG_a1f1ba12_245__至极境__JHSNS_SEG_a1f1ba12_246__至极境__JHSNS_SEG_a1f1ba12_247__至极境__JHSNS_SEG_a1f1ba12_248__至极境__JHSNS_SEG_a1f1ba12_249__至极境__JHSNS_SEG_a1f1ba12_250__至极境__JHSNS_SEG_a1f1ba12_251__至极境__JHSNS_SEG_a1f1ba12_252__至极境__JHSNS_SEG_a1f1ba12_253__至极境__JHSNS_SEG_a1f1ba12_254__至极境__JHSNS_SEG_a1f1ba12_255__至极境__JHSNS_SEG_a1f1ba12_256__至极境__JHSNS_SEG_a1f1ba12_257__至极境__JHSNS_SEG_a1f1ba12_258__至极境__JHSNS_SEG_a1f1ba12_259__至极境__JHSNS_SEG_a1f1ba12_260__至极境__JHSNS_SEG_a1f1ba12_261__至极境__JHSNS_SEG_a1f1ba12_262__至极境__JHSNS_SEG_a1f1ba12_263__至极境__JHSNS_SEG_a1f1ba12_264__至极境__JHSNS_SEG_a1f1ba12_265__至极境__JHSNS_SEG_a1f1ba12_266__至极境__JHSNS_SEG_a1f1ba12_267__至极境__JHSNS_SEG_a1f1ba12_268__至极境__JHSNS_SEG_a1f1ba12_269__至极境__JHSNS_SEG_a1f1ba12_270__至极境__JHSNS_SEG_a1f1ba12_271__至极境__JHSNS_SEG_a1f1ba12_272__至极境__JHSNS_SEG_a1f1ba12_273__至极境__JHSNS_SEG_a1f1ba12_274__至极境__JHSNS_SEG_a1f1ba12_275__至极境__JHSNS_SEG_a1f1ba12_276__至极境__JHSNS_SEG_a1f1ba12_277__至极境__JHSNS_SEG_a1f1ba12_278__至极境__JHSNS_SEG_a1f1ba12_279__至极境__JHSNS_SEG_a1f1ba12_280__至极境__JHSNS_SEG_a1f1ba12_281__至极境__JHSNS_SEG_a1f1ba12_282__至极境__JHSNS_SEG_a1f1ba12_283__至极境__JHSNS_SEG_a1f1ba12_284__至极境__JHSNS_SEG_a1f1ba12_285__至极境__JHSNS_SEG_a1f1ba12_286__至极境__JHSNS_SEG_a1f1ba12_287__至极境__JHSNS_SEG_a1f1ba12_288__至极境__JHSNS_SEG_a1f1ba12_289__至极境__JHSNS_SEG_a1f1ba12_290__至极境__JHSNS_SEG_a1f1ba12_291__至极境__JHSNS_SEG_a1f1ba12_292__至极境__JHSNS_SEG_a1f1ba12_293__至极境__JHSNS_SEG_a1f1ba12_294__至极境__JHSNS_SEG_a1f1ba12_295__至极境__JHSNS_SEG_a1f1ba12_296__至极境__JHSNS_SEG_a1f1ba12_297__至极境__JHSNS_SEG_a1f1ba12_298__至极境__JHSNS_SEG_a1f1ba12_299__至极境__JHSNS_SEG_a1f1ba12_300__至极境__JHSNS_SEG_a1f1ba12_301__至极境__JHSNS_SEG_a1f1ba12_302__至极境__JHSNS_SEG_a1f1ba12_303__至极境__JHSNS_SEG_a1f1ba12_304__至极境__JHSNS_SEG_a1f1ba12_305__至极境__JHSNS_SEG_a1f1ba12_306__至极境__JHSNS_SEG_a1f1ba12_307__至极境__JHSNS_SEG_a1f1ba12_308__至极境__JHSNS_SEG_a1f1ba12_309__至极境__JHSNS_SEG_a1f1ba12_310__至极境__JHSNS_SEG_a1f1ba12_311__至极境__JHSNS_SEG_a1f1ba12_312__至极境__JHSNS_SEG_a1f1ba12_313__至极境__JHSNS_SEG_a1f1ba12_314__至极境__JHSNS_SEG_a1f1ba12_315__至极境__JHSNS_SEG_a1f1ba12_316__至极境__JHSNS_SEG_a1f1ba12_317__至极境__JHSNS_SEG_a1f1ba12_318__至极境__JHSNS_SEG_a1f1ba12_319__至极境__JHSNS_SEG_a1f1ba12_320__至极境__JHSNS_SEG_a1f1ba12_321__至极境__JHSNS_SEG_a1f1ba12_322__至极境__JHSNS_SEG_a1f1ba12_323__至极境__JHSNS_SEG_a1f1ba12_324__至极境__JHSNS_SEG_a1f1ba12_325__至极境__JHSNS_SEG_a1f1ba12_326__至极境__JHSNS_SEG_a1f1ba12_327__至极境__JHSNS_SEG_a1f1ba12_328__至极境__JHSNS_SEG_a1f1ba12_329__至极境__JHSNS_SEG_a1f1ba12_330__至极境__JHSNS_SEG_a1f1ba12_331__至极境__JHSNS_SEG_a1f1ba12_332__至极境__JHSNS_SEG_a1f1ba12_333__至极境__JHSNS_SEG_a1f1ba12_334__至极境__JHSNS_SEG_a1f1ba12_335__至极境__JHSNS_SEG_a1f1ba12_336__至极境__JHSNS_SEG_a1f1ba12_337__至极境__JHSNS_SEG_a1f1ba12_338__至极境__JHSNS_SEG_a1f1ba12_339__至极境__JHSNS_SEG_a1f1ba12_340__至极境__JHSNS_SEG_a1f1ba12_341__至极境__JHSNS_SEG_a1f1ba12_342__至极境__JHSNS_SEG_a1f1ba12_343__至极境__JHSNS_SEG_a1f1ba12_344__至极境__JHSNS_SEG_a1f1ba12_345__至极境__JHSNS_SEG_a1f1ba12_346__至极境__JHSNS_SEG_a1f1ba12_347__至极境__JHSNS_SEG_a1f1ba12_348__至极境__JHSNS_SEG_a1f1ba12_349__至极境__JHSNS_SEG_a1f1ba12_350__至极境__JHSNS_SEG_a1f1ba12_351__至极境__JHSNS_SEG_a1f1ba12_352__至极境__JHSNS_SEG_a1f1ba12_353__至极境__JHSNS_SEG_a1f1ba12_354__至极境__JHSNS_SEG_a1f1ba12_355__至极境__JHSNS_SEG_a1f1ba12_356__至极境__JHSNS_SEG_a1f1ba12_357__至极境__JHSNS_SEG_a1f1ba12_358__至极境__JHSNS_SEG_a1f1ba12_359__至极境__JHSNS_SEG_a1f1ba12_360__至极境__JHSNS_SEG_a1f1ba12_361__至极境__JHSNS_SEG_a1f1ba12_362__至极境__JHSNS_SEG_a1f1ba12_363__至极境__JHSNS_SEG_a1f1ba12_364__至极境__JHSNS_SEG_a1f1ba12_365__至极境__JHSNS_SEG_a1f1ba12_366__至极境__JHSNS_SEG_a1f1ba12_367__至极境__JHSNS_SEG_a1f1ba12_368__至极境__JHSNS_SEG_a1f1ba12_369__至极境__JHSNS_SEG_a1f1ba12_370__至极境__JHSNS_SEG_a1f1ba12_371__至极境__JHSNS_SEG_a1f1ba12_372__至极境__JHSNS_SEG_a1f1ba12_373__至极境__JHSNS_SEG_a1f1ba12_374__至极境__JHSNS_SEG_a1f1ba12_375__至极境__JHSNS_SEG_a1f1ba12_376__至极境__JHSNS_SEG_a1f1ba12_377__至极境__JHSNS_SEG_a1f1ba12_378__至极境__JHSNS_SEG_a1f1ba12_379__至极境__JHSNS_SEG_a1f1ba12_380__至极境__JHSNS_SEG_a1f1ba12_381__至极境__JHSNS_SEG_a1f1ba12_382__至极境__JHSNS_SEG_a1f1ba12_383__至极境__JHSNS_SEG_a1f1ba12_384__至极境__JHSNS_SEG_a1f1ba12_385__至极境__JHSNS_SEG_a1f1ba12_386__至极境__JHSNS_SEG_a1f1ba12_387__至极境__JHSNS_SEG_a1f1ba12_388__至极境__JHSNS_SEG_a1f1ba12_389__至极境__JHSNS_SEG_a1f1ba12_390__至极境__JHSNS_SEG_a1f1ba12_391__至极境__JHSNS_SEG_a1f1ba12_392__至极境__JHSNS_SEG_a1f1ba12_393__至极境__JHSNS_SEG_a1f1ba12_394__至极境__JHSNS_SEG_a1f1ba12_395__至极境__JHSNS_SEG_a1f1ba12_396__至极境__JHSNS_SEG_a1f1ba12_397__至极境__JHSNS_SEG_a1f1ba12_398__至极境__JHSNS_SEG_a1f1ba12_399__至极境__JHSNS_SEG_a1f1ba12_400__至极境__JHSNS_SEG_a1f1ba12_401__至极境__JHSNS_SEG_a1f1ba12_402__至极境__JHSNS_SEG_a1f1ba12_403__至极境__JHSNS_SEG_a1f1ba12_404__至极境__JHSNS_SEG_a1f1ba12_405__至极境__JHSNS_SEG_a1f1ba12_406__至极境__JHSNS_SEG_a1f1ba12_407__至极境__JHSNS_SEG_a1f1ba12_408__至极境__JHSNS_SEG_a1f1ba12_409__至极境__JHSNS_SEG_a1f1ba12_410__至极境__JHSNS_SEG_a1f1ba12_411__至极境__JHSNS_SEG_a1f1ba12_412__至极境__JHSNS_SEG_a1f1ba12_413__至极境__JHSNS_SEG_a1f1ba12_414__至极境__JHSNS_SEG_a1f1ba12_415__至极境__JHSNS_SEG_a1f1ba12_416__至极境__JHSNS_SEG_a1f1ba12_417__至极境__JHSNS_SEG_a1f1ba12_418__至极境__JHSNS_SEG_a1f1ba12_419__至极境__JHSNS_SEG_a1f1ba12_420__至极境__JHSNS_SEG_a1f1ba12_421__至极境__JHSNS_SEG_a1f1ba12_422__至极境__JHSNS_SEG_a1f1ba12_423__至极境__JHSNS_SEG_a1f1ba12_424__至极境__JHSNS_SEG_a1f1ba12_425__至极境__JHSNS_SEG_a1f1ba12_426__至极境__JHSNS_SEG_a1f1ba12_427__至极境__JHSNS_SEG_a1f1ba12_428__至极境__JHSNS_SEG_a1f1ba12_429__至极境__JHSNS_SEG_a1f1ba12_430__至极境__JHSNS_SEG_a1f1ba12_431__至极境__JHSNS_SEG_a1f1ba12_432__至极境__JHSNS_SEG_a1f1ba12_433__至极境__JHSNS_SEG_a1f1ba12_434__至极境__JHSNS_SEG_a1f1ba12_435__至极境__JHSNS_SEG_a1f1ba12_436__至极境__JHSNS_SEG_a1f1ba12_437__至极境__JHSNS_SEG_a1f1ba12_438__至极境__JHSNS_SEG_a1f1ba12_439__至极境__JHSNS_SEG_a1f1ba12_440__至极境__JHSNS_SEG_a1f1ba12_441__至极境__JHSNS_SEG_a1f1ba12_442__至极境__JHSNS_SEG_a1f1ba12_443__至极境__JHSNS_SEG_a1f1ba12_444__至极境__JHSNS_SEG_a1f1ba12_445__至极境__JHSNS_SEG_a1f1ba12_446__至极境__JHSNS_SEG_a1f1ba12_447__至极境__JHSNS_SEG_a1f1ba12_448__至极境__JHSNS_SEG_a1f1ba12_449__至极境__JHSNS_SEG_a1f1ba12_450__至极境__JHSNS_SEG_a1f1ba12_451__至极境__JHSNS_SEG_a1f1ba12_452__至极境__JHSNS_SEG_a1f1ba12_453__至极境__JHSNS_SEG_a1f1ba12_454__至极境__JHSNS_SEG_a1f1ba12_455__至极境__JHSNS_SEG_a1f1ba12_456__至极境__JHSNS_SEG_a1f1ba12_457__至极境__JHSNS_SEG_a1f1ba12_458__至极境__JHSNS_SEG_a1f1ba12_459__至极境__JHSNS_SEG_a1f1ba12_460__至极境__JHSNS_SEG_a1f1ba12_461__至极境__JHSNS_SEG_a1f1ba12_462__至极境__JHSNS_SEG_a1f1ba12_463__至极境__JHSNS_SEG_a1f1ba12_464__至极境__JHSNS_SEG_a1f1ba12_465__至极境__JHSNS_SEG_a1f1ba12_466__至极境__JHSNS_SEG_a1f1ba12_467__至极境__JHSNS_SEG_a1f1ba12_468__至极境__JHSNS_SEG_a1f1ba12_469__至极境__JHSNS_SEG_a1f1ba12_470__至极境__JHSNS_SEG_a1f1ba12_471__至极境__JHSNS_SEG_a1f1ba12_472__至极境__JHSNS_SEG_a1f1ba12_473__至极境__JHSNS_SEG_a1f1ba12_474__至极境__JHSNS_SEG_a1f1ba12_475__至极境__JHSNS_SEG_a1f1ba12_476__至极境__JHSNS_SEG_a1f1ba12_477__至极境__JHSNS_SEG_a1f1ba12_478__至极境__JHSNS_SEG_a1f1ba12_479__至极境__JHSNS_SEG_a1f1ba12_480__至极境__JHSNS_SEG_a1f1ba12_481__至极境__JHSNS_SEG_a1f1ba12_482__至极境__JHSNS_SEG_a1f1ba12_483__至极境__JHSNS_SEG_a1f1ba12_484__至极境__JHSNS_SEG_a1f1ba12_485__至极境__JHSNS_SEG_a1f1ba12_486__至极境__JHSNS_SEG_a1f1ba12_487__至极境__JHSNS_SEG_a1f1ba12_488__至极境__JHSNS_SEG_a1f1ba12_489__至极境__JHSNS_SEG_a1f1ba12_490__至极境__JHSNS_SEG_a1f1ba12_491__至极境__JHSNS_SEG_a1f1ba12_492__至极境__JHSNS_SEG_a1f1ba12_493__至极境__JHSNS_SEG_a1f1ba12_494__至极境__JHSNS_SEG_a1f1ba12_495__至极境__JHSNS_SEG_a1f1ba12_496__至极境__JHSNS_SEG_a1f1ba12_497__至极境__JHSNS_SEG_a1f1ba12_498__至极境__JHSNS_SEG_a1f1ba12_499__至极境__JHSNS_SEG_a1f1ba12_500__至极境--spec-draft-n-max 1所予平均境域仅十五K,仅足应寻常之务。高深草稿之境,于十六GB之卡,竟无可用之平均境域。

此乃MTP于民用硬件之核心VRAM耗损之问:额外草稿之缓冲,直食剩余VRAM之预算,而35B-A3B之模,配以q8 KV缓存,所余甚微。

KV缓存q5

MTP至多一 MTP最大2 MTP至三 MTP至四 标准(IQ3_S)
提示速 二百六十四每秒 每秒二百六十六 二百七十每秒 二百六十四每秒 每秒三百四十三
生速 每秒151 百四十七每秒 百三十七每秒 131 t/s 122 t/s
Avg Ctx 10 K 120 K
Max Ctx 120 K 110 K 110 K 80 K 200 K

之KV cache,仅稍增平均上下文之效。--spec-draft-n-max 1则示十K上下文,时率151。标准解码于

,时率122,上下文平均百二十K。

十六GB GPU上,三十五B MoE模型配MTP,遇硬障:平均可用上下文骤降至十至十五K词元,实难应实际工作。标准解码每秒122至146,上下文八十至十二万词元,则大有用处。

若汝有二十四GB+ VRAM,则三十五B+ MTP之组合大增其美——上下文窗口之患消,且存速之益。

择适--spec-draft-n-max

每步拟议之虚币几何数(--spec-draft-n-max)非有定解——视乎模架构与VRAM之可资:

  • 27B密于16GB:--spec-draft-n-max 2以q8 KV为速--spec-draft-n-max 1 配 q5 KV 者,最合乎境也。
  • 至若 35B MoE 之於 16 GB:--spec-draft-n-max 1 乃唯一可存可用之境者,然亦仅略尔而已。
  • 值若更高 (3),4增VRAM之压,而无相称之速——至四,所费VRAM与至二无异,然速增不与之俱。

如何使MTP于llama.cpp

务使用MTP之GGUF(文件名含MTP)。若初涉llama.cpp之旗,则参llama.cpp之快启于CLI与服务器。 覆盖万基。乃启 llama-server 抑或 llama-cli,以

llama-server \
  --model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
  --ctx-size 40000 \
  -ngl 99 --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 2

为之。 退出全屏模式

若为 q5 KV 缓存,则易 q8_0q5_1q5_0,并调 --ctx-size 之上:

llama-server \
  --model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
  --ctx-size 80000 \
  -ngl 99 --flash-attn on \
  --cache-type-k q5_1 --cache-type-v q5_1 \
  --spec-type draft-mtp \
  --spec-draft-n-max 1

退出全屏模式

MTP自 llama.cpp睹GGUF之文见MTP首而--spec-type draft-mtp立。
故Qwen3.6-27B-UD-IQ3_XXS.gguf不可行于MTP之式,须Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf。
然Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf可通于推演解之式与自归一之式。

结论

于十六吉字节GPU(RTX 4080)之上,此量化参数,llama.cpp之MTP,于Qwen 3.6 27B实为胜,于Qwen 3.6 35B则实为负,于实用:

Qwen 3.6 27B (IQ3_XXS) — MTP实为可取:

  • q8 KV + MTP至多二 → 速生约六十七分之百。,文境四十至六十千(较无MTP之八十至百千)
  • q5 KV加MTP至一,则较前速生约三十九分,文境七十至百千(较无MTP之百三十至百六十)
  • 速与VRAM效,得中于--spec-draft-n-max 2

Qwen 3.6 35B(IQ3_S)—十六GB处,MTP非实用

  • 生速增廿七至廿九成,然平均境缩至十至十五千于q8,五千于q5
  • 标准解码以百廿二至百四十六每秒,境涵八十至一百二十千,于实务更宜
  • 境况于廿四吉字以上VRAM大有改善

纸面而言,q5 KV缓存乃最大化上下文窗口而存MTP速效之显答——然实践之中,自q8降至q5,质降殊甚。宜于己务自试q5,方采之;于吾工作,其劣不可受,q8以更紧之上下文预算,犹为较优之权衡。

欲观大略于LLM服务之选与基础设权衡,请参二零二六年大模型托管之柱,二零二六年大模型性能。若汝调校Qwen 3.6采样器之设,与MTP同,则Qwen 3.6与Gemma 4能动式大模型推理参数参考,实为良伴。