Abstract
Adapting Large Vision-Language Models (LVLMs) to specialized domains typically demands resource-intensive fine-tuning or access to proprietary parameters (“white-box” access). While decoding-time strategies like Proxy Tuning offer a parameter-efficient alternative, they rely on rigid, static logit arithmetic that fails to account for instance-specific variations in model certainty and domain shift. In this work, we introduce Adaptive Weighted Proxy Tuning (AWPT), a gray-box steering framework that dynamically modulates the logit contributions of a large base model, a fine-tuned expert, and an untuned anti-expert. Unlike static approaches, AWPT introduces two instance-aware mechanisms: (1) a lightweight ViT-based Weight Predictor that performs amortized inference to estimate optimal mixing coefficients in real-time with negligible added latency (∼0.03s overhead), and (2) a Per-Sample Optimization objective that establishes theoretical performance bounds via gradient-based logit steering. Extensive evaluation across medical (ROCOv2, IU-Xray) and general domains (Flickr30k, MS COCO, TextCaps) demonstrates that AWPT achieves performance parity with fully fine-tuned models while remaining parameter-free regarding the generator. Crucially, our dynamic weighting acts as an effective regularizer, significantly reducing object hallucinations and establishing AWPT as a robust solution for deploying general-purpose LVLMs in safety-critical contexts.
- Anthology ID:
- 2026.acl-industry.85
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Yunyao Li, Georg Rehm, Mei Tu
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1197–1217
- Language:
- URL:
- https://aclanthology.org/2026.acl-industry.85/
- DOI:
- Bibkey:
- Cite (ACL):
- Nafew Azim, Fuad Rahman, and Nabeel Mohammed. 2026. Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning.. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1197–1217, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Adaptive Weighted Proxy Tuning: Efficient Gray-Box Steering for Image Captioning. (Azim et al., ACL 2026)
- Copy Citation:
- PDF:
- https://aclanthology.org/2026.acl-industry.85.pdf






















