余试Qwen 3.6 27B与35B之推测解码(多词预测,MTP)于RTX 4080,显存16GB.
欲观诸模型于同硬件下词速与显存权衡之全貌,可参16GB显存LLM基准测试与llama.cpp.
MTP(多词预测)何为
多标记预测,乃一种揣测解码之术,直植于某些模型之存档。非每前向一通预测一标记,此模型携额外“多标记预测首脑”,于单步中提议数未来标记——继而并行验证之。若揣测获纳,则有效吞吐量升,而输出之质无改。
Qwen 3.6系列,既发标准之型号。 GGUF 文件與 MTP -啟用變體。於 llama.cpp 中,MTP 透過以下方式啟動:
--spec-type draft-mtp --spec-draft-n-max 3
--spec-draft-n-max 為關鍵調節旋鈕。其設定 MTP 每步所提議的推測詞元數量。較高值或可提升潛在速度,然則須額外 VRAM 以供草稿緩衝區,此實為 16 GB 卡之實際限制。
所测者何?所测者如何?
吾测二Qwen 3.6模型,具MTP之能,较之标准解码,于GPU(16GB VRAM之RTX 4080)之表现若何。
欲使模型权重及KV缓存入VRAM,吾用重度量化之变体:
-
Qwen3.6-27B-UD-IQ3_XXS、Qwen3.6-27B-UD-IQ3_XXS-MTP -
Qwen3.6-35B-A3B-UD-IQ3_S及Qwen3.6-35B-A3B-UD-IQ3_S-MTP
。 __JHSNS_SEG_a1f1ba12_22__每运行,计二语境预算:
- 平均上下文 — 此乃 llama.cpp 所需之上下文大小,需约 14.8 GB VRAM,余留其他应用(Xorg、GNOME Shell、Cursor)约 500 MB 之缓冲。
- 最大上下文 — 此乃 llama.cpp 所能分配之最大上下文,然同桌面应用已占约 500 MB VRAM。
持平均上下文于实用之域,其要因在于 赫尔墨斯代理。 — 吾以此为要务之智囊,通联此机之 llama.cpp — 须至少六十四 K 之境 — 乃其常例,若模型之窗较小,则于启程时必拒之。阈下之模型,其工作之忆量不足,遂不能持多步之工器。至於 llama.cpp,此即需传之理也。--ctx-size 65536 或更大。凡 MTP 配置使平均可用上下文显著低于六十四 K 者,皆不适宜于日常赫尔墨斯工作负载,故下表中之平均上下文数,实为最关决断者。
皆试 KV 缓存量化层级:q8(质高,VRAM 多)及 q5(减低虚拟显存,延展情境)。须知自q8迁至q5 KV缓存,可致质素显著损落——于吾之测试,其劣化之甚,足令q5不适宜吾之作业。q5之速与情境数,列于下文,以全其备,然尔宜于己之任务,先试其应答之质,而后决意采用之。
Qwen 3.6 27B MTP 对比 标准
KV缓存 q8
| MTP至四 | MTP至二 | MTP至三 | MTP至四 | 标准(IQ3_XXS) | |
|---|---|---|---|---|---|
| 提示速率 | 148 t/s | 151 t/s | 148 t/s | 147 t/s | 200 t/s |
| 生成速率 | 65 t/s | 75 t/s | 73 t/s | 75 t/s | 45 t/s |
| Avg Ctx | 40 K | 40 K | 40 K | 30 K | 80 K |
| Max Ctx | 60 K | 60 K | 60 K | 50 K | 100 K |
With q8 KV cache, MTP at--spec-draft-n-max 2送之~百分之六十七之速生成(75比45 t/s)然以平均语境窗口减半为代价,自80 K降至40 K。提示摄入速率自200降至~150 t/s,盖因MTP需于预填充阶段进行设备至主机的数据传输。
KV缓存q5
| MTP至一 | MTP最大2 | MTP至三 | MTP至四 | 标准(IQ3_XXS) | |
|---|---|---|---|---|---|
| 提示速度 | 145次/秒 | 144次/秒 | 141次/秒 | 139次/秒 | 191次/秒 |
| 生成速度 | 57次/秒 | 62次/秒 | 67次/秒 | 66次/秒 | 41次/秒 |
| 平均上下文 | 70千 | 60 K | 60 K | 50 K | 130 K |
| Max Ctx | 100 K | 100 K | 90 K | 80 K | 160 K |
转至q5 KV缓存,恢复意蕴之境:--spec-draft-n-max 1赐70 K之平均境,57 t/s之速——39%之生成加速。超乎寻常之解码,而犹存有用之境。至--spec-draft-n-max 3,境降六十千,而生成速达六十七每秒(增六十三%)。
Qwen 3.6 27B所得
MTP诚为27B密模型所宜。十六GB显存之妙境:
-
q8 KV +
--spec-draft-n-max 2— 最佳原始速度(75 t/s),上下文缩减至40–60 K -
q5 KV +
--spec-draft-n-max 1— 最佳速度与上下文平衡(57 t/s,70 K平均上下文)
Qwen 3.6 35B MTP 与标准之比较
35B模型为专家混合架构(Mixture-of-Experts, MoE)35B-A3B者,三十五亿参数也,每词约有三亿活跃参数。MoE模型常受益于MTP,盖因稀疏路由使MTP头相对于全前向传递而言,计算成本甚低。
KV缓存,q8
| MTP至多1 | MTP至多2 | MTP至多3 | MTP至多4 | 标准(IQ3_S) | |
|---|---|---|---|---|---|
| 提示速率 | 二百七十七每秒 | 二百七十七每秒 | 每秒二百六十五 | 二百七十五每秒 | 每秒三百六十八 |
| 生速 | 百八十六每秒 | 每秒189 | 百八十每秒 | 每秒171 | 每秒146 |
| 平均上下文 | 十五仟 | 十仟 | — | — | 80 K |
| 至极境 | 80 K | 至极境 | 60 K | 至极境 | 至极境 |
至极境__JHSNS_SEG_a1f1ba12_149__至极境__JHSNS_SEG_a1f1ba12_150__至极境__JHSNS_a_SEG1f1ba12_151__至极境__JHSNS_SEG_a1f1ba12_152__至极境__JHSNS_SEG_a1f1ba12_153__至极境__JHSNS_SEG_a1f1ba12_154__至极境__JHSNS_SEG_a1f1ba12_155__至极境__JHSNS_SEG_a1f1ba12_156__至极境__JHSNS_SEG_a1f1ba12_157__至极境__JHSNS_SEG_a1f1ba12_158__至极境__JHSNS_SEG_a1f1ba12_159__至极境__JHSNS_SEG_a1f1ba12_160__至极境__JHSNS_SEG_a1f1ba12_161__至极境__JHSNS_SEG_a1f1ba12_162__至极境__JHSNS_SEG_a1f1ba12_163__至极境__JHSNS_SEG_a1f1ba12_164__至极境__JHSNS_SEG_a1f1ba12_165__至极境__JHSNS_SEG_a1f1ba12_166__至极境__JHSNS_SEG_a1f1ba12_167__至极境__JHSNS_SEG_a1f1ba12_168__至极境__JHSNS_SEG_a1f1ba12_169__至极境__JHSNS_SEG_a1f1ba12_170__至极境__JHSNS_SEG_a1f1ba12_171__至极境__JHSNS_SEG_a1f1ba12_172__至极境__JHSNS_SEG_a1f1ba12_173__至极境__JHSNS_SEG_a1f1ba12_174__至极境__JHSNS_SEG_a1f1ba12_175__至极境__JHSNS_SEG_a1f1ba12_176__至极境__JHSNS_SEG_a1f1ba12_177__至极境__JHSNS_SEG_a1f1ba12_178__至极境__JHSNS_SEG_a1f1ba12_179__至极境__JHSNS_SEG_a1f1ba12_180__至极境__JHSNS_SEG_a1f1ba12_181__至极境__JHSNS_SEG_a1f1ba12_182__至极境__JHSNS_SEG_a1f1ba12_183__至极境__JHSNS_SEG_a1f1ba12_184__至极境__JHSNS_SEG_a1f1ba12_185__至极境__JHSNS_SEG_a1f1ba12_186__至极境__JHSNS_SEG_a1f1ba12_187__至极境__JHSNS_SEG_a1f1ba12_188__至极境__JHSNS_SEG_a1f1ba12_189__至极境__JHSNS_SEG_a1f1ba12_190__至极境__JHSNS_SEG_a1f1ba12_191__至极境__JHSNS_SEG_a1f1ba12_192__至极境__JHSNS_SEG_a1f1ba12_193__至极境__JHSNS_SEG_a1f1ba12_194__至极境__JHSNS_SEG_a1f1ba12_195__至极境__JHSNS_SEG_a1f1ba12_196__至极境__JHSNS_SEG_a1f1ba12_197__至极境__JHSNS_SEG_a1f1ba12_198__至极境__JHSNS_SEG_a1f1ba12_199__至极境__JHSNS_SEG_a1f1ba12_200__至极境__JHSNS_SEG_a1f1ba12_201__至极境__JHSNS_SEG_a1f1ba12_202__至极境__JHSNS_SEG_a1f1ba12_203__至极境__JHSNS_SEG_a1f1ba12_204__至极境__JHSNS_SEG_a1f1ba12_205__至极境__JHSNS_SEG_a1f1ba12_206__至极境__JHSNS_SEG_a1f1ba12_207__至极境__JHSNS_SEG_a1f1ba12_208__至极境__JHSNS_SEG_a1f1ba12_209__至极境__JHSNS_SEG_a1f1ba12_210__至极境__JHSNS_SEG_a1f1ba12_211__至极境__JHSNS_SEG_a1f1ba12_212__至极境__JHSNS_SEG_a1f1ba12_213__至极境__JHSNS_SEG_a1f1ba12_214__至极境__JHSNS_SEG_a1f1ba12_215__至极境__JHSNS_SEG_a1f1ba12_216__至极境__JHSNS_SEG_a1f1ba12_217__至极境__JHSNS_SEG_a1f1ba12_218__至极境__JHSNS_SEG_a1f1ba12_219__至极境__JHSNS_SEG_a1f1ba12_220__至极境__JHSNS_SEG_a1f1ba12_221__至极境__JHSNS_SEG_a1f1ba12_222__至极境__JHSNS_SEG_a1f1ba12_223__至极境__JHSNS_SEG_a1f1ba12_224__至极境__JHSNS_SEG_a1f1ba12_225__至极境__JHSNS_SEG_a1f1ba12_226__至极境__JHSNS_SEG_a1f1ba12_227__至极境__JHSNS_SEG_a1f1ba12_228__至极境__JHSNS_SEG_a1f1ba12_229__至极境__JHSNS_SEG_a1f1ba12_230__至极境__JHSNS_SEG_a1f1ba12_231__至极境__JHSNS_SEG_a1f1ba12_232__至极境__JHSNS_SEG_a1f1ba12_233__至极境__JHSNS_SEG_a1f1ba12_234__至极境__JHSNS_SEG_a1f1ba12_235__至极境__JHSNS_SEG_a1f1ba12_236__至极境__JHSNS_SEG_a1f1ba12_237__至极境__JHSNS_SEG_a1f1ba12_238__至极境__JHSNS_SEG_a1f1ba12_239__至极境__JHSNS_SEG_a1f1ba12_240__至极境__JHSNS_SEG_a1f1ba12_241__至极境__JHSNS_SEG_a1f1ba12_242__至极境__JHSNS_SEG_a1f1ba12_243__至极境__JHSNS_SEG_a1f1ba12_244__至极境__JHSNS_SEG_a1f1ba12_245__至极境__JHSNS_SEG_a1f1ba12_246__至极境__JHSNS_SEG_a1f1ba12_247__至极境__JHSNS_SEG_a1f1ba12_248__至极境__JHSNS_SEG_a1f1ba12_249__至极境__JHSNS_SEG_a1f1ba12_250__至极境__JHSNS_SEG_a1f1ba12_251__至极境__JHSNS_SEG_a1f1ba12_252__至极境__JHSNS_SEG_a1f1ba12_253__至极境__JHSNS_SEG_a1f1ba12_254__至极境__JHSNS_SEG_a1f1ba12_255__至极境__JHSNS_SEG_a1f1ba12_256__至极境__JHSNS_SEG_a1f1ba12_257__至极境__JHSNS_SEG_a1f1ba12_258__至极境__JHSNS_SEG_a1f1ba12_259__至极境__JHSNS_SEG_a1f1ba12_260__至极境__JHSNS_SEG_a1f1ba12_261__至极境__JHSNS_SEG_a1f1ba12_262__至极境__JHSNS_SEG_a1f1ba12_263__至极境__JHSNS_SEG_a1f1ba12_264__至极境__JHSNS_SEG_a1f1ba12_265__至极境__JHSNS_SEG_a1f1ba12_266__至极境__JHSNS_SEG_a1f1ba12_267__至极境__JHSNS_SEG_a1f1ba12_268__至极境__JHSNS_SEG_a1f1ba12_269__至极境__JHSNS_SEG_a1f1ba12_270__至极境__JHSNS_SEG_a1f1ba12_271__至极境__JHSNS_SEG_a1f1ba12_272__至极境__JHSNS_SEG_a1f1ba12_273__至极境__JHSNS_SEG_a1f1ba12_274__至极境__JHSNS_SEG_a1f1ba12_275__至极境__JHSNS_SEG_a1f1ba12_276__至极境__JHSNS_SEG_a1f1ba12_277__至极境__JHSNS_SEG_a1f1ba12_278__至极境__JHSNS_SEG_a1f1ba12_279__至极境__JHSNS_SEG_a1f1ba12_280__至极境__JHSNS_SEG_a1f1ba12_281__至极境__JHSNS_SEG_a1f1ba12_282__至极境__JHSNS_SEG_a1f1ba12_283__至极境__JHSNS_SEG_a1f1ba12_284__至极境__JHSNS_SEG_a1f1ba12_285__至极境__JHSNS_SEG_a1f1ba12_286__至极境__JHSNS_SEG_a1f1ba12_287__至极境__JHSNS_SEG_a1f1ba12_288__至极境__JHSNS_SEG_a1f1ba12_289__至极境__JHSNS_SEG_a1f1ba12_290__至极境__JHSNS_SEG_a1f1ba12_291__至极境__JHSNS_SEG_a1f1ba12_292__至极境__JHSNS_SEG_a1f1ba12_293__至极境__JHSNS_SEG_a1f1ba12_294__至极境__JHSNS_SEG_a1f1ba12_295__至极境__JHSNS_SEG_a1f1ba12_296__至极境__JHSNS_SEG_a1f1ba12_297__至极境__JHSNS_SEG_a1f1ba12_298__至极境__JHSNS_SEG_a1f1ba12_299__至极境__JHSNS_SEG_a1f1ba12_300__至极境__JHSNS_SEG_a1f1ba12_301__至极境__JHSNS_SEG_a1f1ba12_302__至极境__JHSNS_SEG_a1f1ba12_303__至极境__JHSNS_SEG_a1f1ba12_304__至极境__JHSNS_SEG_a1f1ba12_305__至极境__JHSNS_SEG_a1f1ba12_306__至极境__JHSNS_SEG_a1f1ba12_307__至极境__JHSNS_SEG_a1f1ba12_308__至极境__JHSNS_SEG_a1f1ba12_309__至极境__JHSNS_SEG_a1f1ba12_310__至极境__JHSNS_SEG_a1f1ba12_311__至极境__JHSNS_SEG_a1f1ba12_312__至极境__JHSNS_SEG_a1f1ba12_313__至极境__JHSNS_SEG_a1f1ba12_314__至极境__JHSNS_SEG_a1f1ba12_315__至极境__JHSNS_SEG_a1f1ba12_316__至极境__JHSNS_SEG_a1f1ba12_317__至极境__JHSNS_SEG_a1f1ba12_318__至极境__JHSNS_SEG_a1f1ba12_319__至极境__JHSNS_SEG_a1f1ba12_320__至极境__JHSNS_SEG_a1f1ba12_321__至极境__JHSNS_SEG_a1f1ba12_322__至极境__JHSNS_SEG_a1f1ba12_323__至极境__JHSNS_SEG_a1f1ba12_324__至极境__JHSNS_SEG_a1f1ba12_325__至极境__JHSNS_SEG_a1f1ba12_326__至极境__JHSNS_SEG_a1f1ba12_327__至极境__JHSNS_SEG_a1f1ba12_328__至极境__JHSNS_SEG_a1f1ba12_329__至极境__JHSNS_SEG_a1f1ba12_330__至极境__JHSNS_SEG_a1f1ba12_331__至极境__JHSNS_SEG_a1f1ba12_332__至极境__JHSNS_SEG_a1f1ba12_333__至极境__JHSNS_SEG_a1f1ba12_334__至极境__JHSNS_SEG_a1f1ba12_335__至极境__JHSNS_SEG_a1f1ba12_336__至极境__JHSNS_SEG_a1f1ba12_337__至极境__JHSNS_SEG_a1f1ba12_338__至极境__JHSNS_SEG_a1f1ba12_339__至极境__JHSNS_SEG_a1f1ba12_340__至极境__JHSNS_SEG_a1f1ba12_341__至极境__JHSNS_SEG_a1f1ba12_342__至极境__JHSNS_SEG_a1f1ba12_343__至极境__JHSNS_SEG_a1f1ba12_344__至极境__JHSNS_SEG_a1f1ba12_345__至极境__JHSNS_SEG_a1f1ba12_346__至极境__JHSNS_SEG_a1f1ba12_347__至极境__JHSNS_SEG_a1f1ba12_348__至极境__JHSNS_SEG_a1f1ba12_349__至极境__JHSNS_SEG_a1f1ba12_350__至极境__JHSNS_SEG_a1f1ba12_351__至极境__JHSNS_SEG_a1f1ba12_352__至极境__JHSNS_SEG_a1f1ba12_353__至极境__JHSNS_SEG_a1f1ba12_354__至极境__JHSNS_SEG_a1f1ba12_355__至极境__JHSNS_SEG_a1f1ba12_356__至极境__JHSNS_SEG_a1f1ba12_357__至极境__JHSNS_SEG_a1f1ba12_358__至极境__JHSNS_SEG_a1f1ba12_359__至极境__JHSNS_SEG_a1f1ba12_360__至极境__JHSNS_SEG_a1f1ba12_361__至极境__JHSNS_SEG_a1f1ba12_362__至极境__JHSNS_SEG_a1f1ba12_363__至极境__JHSNS_SEG_a1f1ba12_364__至极境__JHSNS_SEG_a1f1ba12_365__至极境__JHSNS_SEG_a1f1ba12_366__至极境__JHSNS_SEG_a1f1ba12_367__至极境__JHSNS_SEG_a1f1ba12_368__至极境__JHSNS_SEG_a1f1ba12_369__至极境__JHSNS_SEG_a1f1ba12_370__至极境__JHSNS_SEG_a1f1ba12_371__至极境__JHSNS_SEG_a1f1ba12_372__至极境__JHSNS_SEG_a1f1ba12_373__至极境__JHSNS_SEG_a1f1ba12_374__至极境__JHSNS_SEG_a1f1ba12_375__至极境__JHSNS_SEG_a1f1ba12_376__至极境__JHSNS_SEG_a1f1ba12_377__至极境__JHSNS_SEG_a1f1ba12_378__至极境__JHSNS_SEG_a1f1ba12_379__至极境__JHSNS_SEG_a1f1ba12_380__至极境__JHSNS_SEG_a1f1ba12_381__至极境__JHSNS_SEG_a1f1ba12_382__至极境__JHSNS_SEG_a1f1ba12_383__至极境__JHSNS_SEG_a1f1ba12_384__至极境__JHSNS_SEG_a1f1ba12_385__至极境__JHSNS_SEG_a1f1ba12_386__至极境__JHSNS_SEG_a1f1ba12_387__至极境__JHSNS_SEG_a1f1ba12_388__至极境__JHSNS_SEG_a1f1ba12_389__至极境__JHSNS_SEG_a1f1ba12_390__至极境__JHSNS_SEG_a1f1ba12_391__至极境__JHSNS_SEG_a1f1ba12_392__至极境__JHSNS_SEG_a1f1ba12_393__至极境__JHSNS_SEG_a1f1ba12_394__至极境__JHSNS_SEG_a1f1ba12_395__至极境__JHSNS_SEG_a1f1ba12_396__至极境__JHSNS_SEG_a1f1ba12_397__至极境__JHSNS_SEG_a1f1ba12_398__至极境__JHSNS_SEG_a1f1ba12_399__至极境__JHSNS_SEG_a1f1ba12_400__至极境__JHSNS_SEG_a1f1ba12_401__至极境__JHSNS_SEG_a1f1ba12_402__至极境__JHSNS_SEG_a1f1ba12_403__至极境__JHSNS_SEG_a1f1ba12_404__至极境__JHSNS_SEG_a1f1ba12_405__至极境__JHSNS_SEG_a1f1ba12_406__至极境__JHSNS_SEG_a1f1ba12_407__至极境__JHSNS_SEG_a1f1ba12_408__至极境__JHSNS_SEG_a1f1ba12_409__至极境__JHSNS_SEG_a1f1ba12_410__至极境__JHSNS_SEG_a1f1ba12_411__至极境__JHSNS_SEG_a1f1ba12_412__至极境__JHSNS_SEG_a1f1ba12_413__至极境__JHSNS_SEG_a1f1ba12_414__至极境__JHSNS_SEG_a1f1ba12_415__至极境__JHSNS_SEG_a1f1ba12_416__至极境__JHSNS_SEG_a1f1ba12_417__至极境__JHSNS_SEG_a1f1ba12_418__至极境__JHSNS_SEG_a1f1ba12_419__至极境__JHSNS_SEG_a1f1ba12_420__至极境__JHSNS_SEG_a1f1ba12_421__至极境__JHSNS_SEG_a1f1ba12_422__至极境__JHSNS_SEG_a1f1ba12_423__至极境__JHSNS_SEG_a1f1ba12_424__至极境__JHSNS_SEG_a1f1ba12_425__至极境__JHSNS_SEG_a1f1ba12_426__至极境__JHSNS_SEG_a1f1ba12_427__至极境__JHSNS_SEG_a1f1ba12_428__至极境__JHSNS_SEG_a1f1ba12_429__至极境__JHSNS_SEG_a1f1ba12_430__至极境__JHSNS_SEG_a1f1ba12_431__至极境__JHSNS_SEG_a1f1ba12_432__至极境__JHSNS_SEG_a1f1ba12_433__至极境__JHSNS_SEG_a1f1ba12_434__至极境__JHSNS_SEG_a1f1ba12_435__至极境__JHSNS_SEG_a1f1ba12_436__至极境__JHSNS_SEG_a1f1ba12_437__至极境__JHSNS_SEG_a1f1ba12_438__至极境__JHSNS_SEG_a1f1ba12_439__至极境__JHSNS_SEG_a1f1ba12_440__至极境__JHSNS_SEG_a1f1ba12_441__至极境__JHSNS_SEG_a1f1ba12_442__至极境__JHSNS_SEG_a1f1ba12_443__至极境__JHSNS_SEG_a1f1ba12_444__至极境__JHSNS_SEG_a1f1ba12_445__至极境__JHSNS_SEG_a1f1ba12_446__至极境__JHSNS_SEG_a1f1ba12_447__至极境__JHSNS_SEG_a1f1ba12_448__至极境__JHSNS_SEG_a1f1ba12_449__至极境__JHSNS_SEG_a1f1ba12_450__至极境__JHSNS_SEG_a1f1ba12_451__至极境__JHSNS_SEG_a1f1ba12_452__至极境__JHSNS_SEG_a1f1ba12_453__至极境__JHSNS_SEG_a1f1ba12_454__至极境__JHSNS_SEG_a1f1ba12_455__至极境__JHSNS_SEG_a1f1ba12_456__至极境__JHSNS_SEG_a1f1ba12_457__至极境__JHSNS_SEG_a1f1ba12_458__至极境__JHSNS_SEG_a1f1ba12_459__至极境__JHSNS_SEG_a1f1ba12_460__至极境__JHSNS_SEG_a1f1ba12_461__至极境__JHSNS_SEG_a1f1ba12_462__至极境__JHSNS_SEG_a1f1ba12_463__至极境__JHSNS_SEG_a1f1ba12_464__至极境__JHSNS_SEG_a1f1ba12_465__至极境__JHSNS_SEG_a1f1ba12_466__至极境__JHSNS_SEG_a1f1ba12_467__至极境__JHSNS_SEG_a1f1ba12_468__至极境__JHSNS_SEG_a1f1ba12_469__至极境__JHSNS_SEG_a1f1ba12_470__至极境__JHSNS_SEG_a1f1ba12_471__至极境__JHSNS_SEG_a1f1ba12_472__至极境__JHSNS_SEG_a1f1ba12_473__至极境__JHSNS_SEG_a1f1ba12_474__至极境__JHSNS_SEG_a1f1ba12_475__至极境__JHSNS_SEG_a1f1ba12_476__至极境__JHSNS_SEG_a1f1ba12_477__至极境__JHSNS_SEG_a1f1ba12_478__至极境__JHSNS_SEG_a1f1ba12_479__至极境__JHSNS_SEG_a1f1ba12_480__至极境__JHSNS_SEG_a1f1ba12_481__至极境__JHSNS_SEG_a1f1ba12_482__至极境__JHSNS_SEG_a1f1ba12_483__至极境__JHSNS_SEG_a1f1ba12_484__至极境__JHSNS_SEG_a1f1ba12_485__至极境__JHSNS_SEG_a1f1ba12_486__至极境__JHSNS_SEG_a1f1ba12_487__至极境__JHSNS_SEG_a1f1ba12_488__至极境__JHSNS_SEG_a1f1ba12_489__至极境__JHSNS_SEG_a1f1ba12_490__至极境__JHSNS_SEG_a1f1ba12_491__至极境__JHSNS_SEG_a1f1ba12_492__至极境__JHSNS_SEG_a1f1ba12_493__至极境__JHSNS_SEG_a1f1ba12_494__至极境__JHSNS_SEG_a1f1ba12_495__至极境__JHSNS_SEG_a1f1ba12_496__至极境__JHSNS_SEG_a1f1ba12_497__至极境__JHSNS_SEG_a1f1ba12_498__至极境__JHSNS_SEG_a1f1ba12_499__至极境__JHSNS_SEG_a1f1ba12_500__至极境--spec-draft-n-max 1所予平均境域仅十五K,仅足应寻常之务。高深草稿之境,于十六GB之卡,竟无可用之平均境域。
此乃MTP于民用硬件之核心VRAM耗损之问:额外草稿之缓冲,直食剩余VRAM之预算,而35B-A3B之模,配以q8 KV缓存,所余甚微。
KV缓存q5
| MTP至多一 | MTP最大2 | MTP至三 | MTP至四 | 标准(IQ3_S) | |
|---|---|---|---|---|---|
| 提示速 | 二百六十四每秒 | 每秒二百六十六 | 二百七十每秒 | 二百六十四每秒 | 每秒三百四十三 |
| 生速 | 每秒151 | 百四十七每秒 | 百三十七每秒 | 131 t/s | 122 t/s |
| Avg Ctx | 10 K | — | — | — | 120 K |
| Max Ctx | 120 K | 110 K | 110 K | 80 K | 200 K |
之KV cache,仅稍增平均上下文之效。--spec-draft-n-max 1则示十K上下文,时率151。标准解码于
,时率122,上下文平均百二十K。
十六GB GPU上,三十五B MoE模型配MTP,遇硬障:平均可用上下文骤降至十至十五K词元,实难应实际工作。标准解码每秒122至146,上下文八十至十二万词元,则大有用处。
若汝有二十四GB+ VRAM,则三十五B+ MTP之组合大增其美——上下文窗口之患消,且存速之益。
择适--spec-draft-n-max值
每步拟议之虚币几何数(--spec-draft-n-max)非有定解——视乎模架构与VRAM之可资:
-
27B密于16GB:
--spec-draft-n-max 2以q8 KV为速--spec-draft-n-max 1配 q5 KV 者,最合乎境也。 - 至若 35B MoE 之於 16 GB:
--spec-draft-n-max 1乃唯一可存可用之境者,然亦仅略尔而已。 - 值若更高 (
3),4增VRAM之压,而无相称之速——至四,所费VRAM与至二无异,然速增不与之俱。
如何使MTP于llama.cpp
务使用MTP之GGUF(文件名含MTP)。若初涉llama.cpp之旗,则参llama.cpp之快启于CLI与服务器。 覆盖万基。乃启 llama-server 抑或 llama-cli,以
llama-server \
--model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
--ctx-size 40000 \
-ngl 99 --flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--spec-type draft-mtp \
--spec-draft-n-max 2
若为 q5 KV 缓存,则易 q8_0 为 q5_1 或 q5_0,并调 --ctx-size 之上:
llama-server \
--model Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf \
--ctx-size 80000 \
-ngl 99 --flash-attn on \
--cache-type-k q5_1 --cache-type-v q5_1 \
--spec-type draft-mtp \
--spec-draft-n-max 1
MTP自 llama.cpp睹GGUF之文见MTP首而--spec-type draft-mtp立。
故Qwen3.6-27B-UD-IQ3_XXS.gguf不可行于MTP之式,须Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf。
然Qwen3.6-27B-UD-IQ3_XXS-MTP.gguf可通于推演解之式与自归一之式。
结论
于十六吉字节GPU(RTX 4080)之上,此量化参数,llama.cpp之MTP,于Qwen 3.6 27B实为胜,于Qwen 3.6 35B则实为负,于实用:
Qwen 3.6 27B (IQ3_XXS) — MTP实为可取:
- q8 KV + MTP至多二 → 速生约六十七分之百。,文境四十至六十千(较无MTP之八十至百千)
- q5 KV加MTP至一,则较前速生约三十九分,文境七十至百千(较无MTP之百三十至百六十)
- 速与VRAM效,得中于
--spec-draft-n-max 2
Qwen 3.6 35B(IQ3_S)—十六GB处,MTP非实用
- 生速增廿七至廿九成,然平均境缩至十至十五千于q8,五千于q5
- 标准解码以百廿二至百四十六每秒,境涵八十至一百二十千,于实务更宜
- 境况于廿四吉字以上VRAM大有改善
纸面而言,q5 KV缓存乃最大化上下文窗口而存MTP速效之显答——然实践之中,自q8降至q5,质降殊甚。宜于己务自试q5,方采之;于吾工作,其劣不可受,q8以更紧之上下文预算,犹为较优之权衡。
欲观大略于LLM服务之选与基础设权衡,请参二零二六年大模型托管之柱,二零二六年大模型性能。若汝调校Qwen 3.6采样器之设,与MTP同,则Qwen 3.6与Gemma 4能动式大模型推理参数参考,实为良伴。












