Accelerating Qwen3-8B Agent on Intel® Core™ Ultra with Depth-Pruned Draft Models

Igor Margulis, Ofir Zafrir, Shira Guskin, Guy Boudoukh, Pedro Cu · 2025-09-29 · via Hugging Face - Blog

TL;DR:

Qwen3-8B is one of the most exciting recent releases—a model with native agentic capabilities, making it a natural fit for the AIPC.
With OpenVINO.GenAI, we’ve been able to accelerate generation by ~1.3× using speculative decoding with a lightweight Qwen3-0.6B draft.
By using speculative decoding and applying a simple pruning process to the draft, we pushed the speedup even further to ~1.4×
We wrapped this up by showing how these improvements can be used to run a fast, local AI Agent with 🤗 smolagents

Qwen3

Qwen3-8B is part of the latest Qwen family, trained with explicit agentic behaviors. It supports tool invocation, multi-step reasoning, and long-context handling capabilities, that make it well-suited for complex agent workflows. When integrated with frameworks like Hugging Face 🤗smolagents, QwenAgent, or AutoGen, it enables a wide range of agentic applications built around tool use and reasoning. Unlike single-turn chatbots, agentic applications rely on reasoning models that produce “thinking aloud” traces, intermediate steps that expand token usage, making inference speed critical to responsiveness. The combination of optimized inference and built-in agentic intelligence makes Qwen3-8B a compelling foundation for next-gen AI agents.

Accelerating Qwen3-8B on Intel® Core™ Ultra with Speculative Decoding

We started by benchmarking the 4-bit optimized OpenVINO version of Qwen3-8B on an Intel Lunar Lake integrated GPU, establishing this as our baseline for further acceleration

Speculative decoding is a method to speed up auto-regressive generation. It works by using a smaller, faster model as a draft to propose multiple tokens in a single forward pass, which are then validated by the larger target model in one forward pass. In our setup, Qwen3-8B served as the target model while Qwen3-0.6B was used as the draft. This approach delivered an average of 1.3× speedup over the baseline.

from openvino_genai import LLMPipeline, draft_model

target_path = "/path/to/target/Qwen3-8B-int4-ov"
draft_path = "/path/to/draft/Qwen3-0.6B-int8-ov"
device = "GPU"

model = LLMPipeline(target_path, device, draft_model=draft_model(draft_path, device))

streamer = lambda x: print(x, end="", flush=True)
model.generate("What is speculative decoding and how does it improve inference speed?", max_new_tokens=100, streamer=streamer)

Before initializing the LLMPipeline, make sure both the target and draft models are converted to OpenVINO. You can either download pre-converted models from the provided links or follow these instructions to convert your own.

Pushing Performance Further

The speculative decoding speedup depends on the average number of generated tokens per forward step of the target, γ \gamma , the speculation window size, and the ratio between the target and draft models' latency c c . A smaller, faster (though less accurate) draft can often deliver greater acceleration. This inspired us to shrink the draft model while still preserving its quality, i.e. E(#generated_tokens) E(\# generated\_tokens) .

Speedup=E(#generated_tokens)γc+1 Speedup = \frac{E(\# generated\_tokens)}{\gamma c + 1}

Our recent work shows that model depth (number of layers) is a major contributor to inference latency. We drew inspiration from recent work on layer-wise compression[1]. In our approach, we identify blocks of layers that contribute little, measured using angular distance, and remove them. After pruning, we apply fine-tuning to recover accuracy. Using this method, we pruned 6 out of 28 layers from the Qwen3-0.6B draft model. To recover the quality of the pruned draft model, we further finetuned it using synthetic data generated by Qwen3-8B. The data was produced by generating responses to 500k prompts from BAAI/Infinity-Instruct dataset.

The resulting pruned draft model delivered ~1.4x speedup compared to the baseline, an improvement over the ~1.3× gain achieved with the original draft. This outcome aligns with theoretical expectations - reducing draft latency improves the over-all speedup, enabling faster and more efficient inference.

This demonstrates how pruning + speculative decoding can unlock faster and more efficient inference—making local AI agents even more practical.

Check out the notebook and the Qwen3-0.6B depth-pruned draft model to reproduce our results step by step

Integration with 🤗smolagents

To showcase the real-world potential, we deployed our optimized setup with the 🤗smolagents library. With this integration, developers can plug in Qwen3-8B (paired with our pruned draft) to build agents that call APIs and external tools, write and execute code, handle long-context reasoning and run efficiently on Intel® Core™ Ultra. The benefits aren’t limited to Hugging Face, this model pairing can also be used seamlessly with frameworks like AutoGen or QwenAgent, further strengthening the agentic ecosystem.

In our demo, we assigned the accelerated Qwen3-based agent a task: 👉 Summarize the key features of the Qwen3 model series and present them in a slide deck.

Here’s how it worked: 1. The agent used a web search tool to gather up-to-date information. 2. It then switched to the Python interpreter to generate slides with the python-pptx library. This simple workflow highlights just a fraction of the possibilities unlocked when accelerated Qwen3 models meet frameworks like 🤗smolagents, bringing practical, efficient AI agents to life on AI PC. Try it here 🚀

References

[1] Gromov, A., Tirumala, K., Shapourian, H., Glorioso, P., & Roberts, D. A. (2025, January 22). The unreasonable ineffectiveness of the deeper layers. Poster presented at ICLR 2025. https://arxiv.org/abs/2403.17887

Performance and legal notices

Performance results are based on internal benchmarking with OpenVINO™ 2025.2 as of September 2025, using a configuration with an Intel® Core™ Ultra 7 268V 2.20 GHz processor with an integrated Intel® Arc™ 140V GPU, paired with 32 GB of DDR5 memory.

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

Other names and brands may be claimed as the property of others.

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

Hugging Face - Blog

Qwen3

Accelerating Qwen3-8B on Intel® Core™ Ultra with Speculative Decoding

Pushing Performance Further

Integration with 🤗smolagents

References