

















Abstract:Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: this https URL.
| Comments: | Accepted to the International Conference on Machine Learning (ICML 2026) |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2601.22709 [cs.CV] |
| (or arXiv:2601.22709v4 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2601.22709 arXiv-issued DOI via DataCite |
From: Yanlong Chen [view email]
[v1]
Fri, 30 Jan 2026 08:30:52 UTC (5,275 KB)
[v2]
Mon, 2 Feb 2026 06:39:48 UTC (5,275 KB)
[v3]
Sun, 3 May 2026 03:38:04 UTC (5,275 KB)
[v4]
Sun, 24 May 2026 01:56:03 UTC (14,110 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。