惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Security Blog
Microsoft Security Blog
T
The Exploit Database - CXSecurity.com
MyScale Blog
MyScale Blog
D
Docker
GbyAI
GbyAI
MongoDB | Blog
MongoDB | Blog
Y
Y Combinator Blog
C
Check Point Blog
The GitHub Blog
The GitHub Blog
云风的 BLOG
云风的 BLOG
Vercel News
Vercel News
Engineering at Meta
Engineering at Meta
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
Google DeepMind News
Google DeepMind News
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
Hugging Face - Blog
Hugging Face - Blog
Scott Helme
Scott Helme
C
Cybersecurity and Infrastructure Security Agency CISA
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Security Archives - TechRepublic
Security Archives - TechRepublic
I
Intezer
P
Proofpoint News Feed
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Simon Willison's Weblog
Simon Willison's Weblog
阮一峰的网络日志
阮一峰的网络日志
Project Zero
Project Zero
人人都是产品经理
人人都是产品经理
A
About on SuperTechFans
AWS News Blog
AWS News Blog
T
Tor Project blog
Know Your Adversary
Know Your Adversary
B
Blog
美团技术团队
A
Arctic Wolf
IT之家
IT之家
G
GRAHAM CLULEY
P
Privacy International News Feed
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Martin Fowler
Martin Fowler
酷 壳 – CoolShell
酷 壳 – CoolShell
C
Cisco Blogs
博客园 - 【当耐特】
V
Visual Studio Blog
T
Threat Research - Cisco Blogs
L
LINUX DO - 热门话题
H
Hacker News: Front Page
PCI Perspectives
PCI Perspectives
Jina AI
Jina AI
腾讯CDC

Paper Index on ACL Anthology

A Bounded Coordination-Support Capability for Multi-Party Settings: Task-State Monitoring in Firefighter Incident Command A Dataset of Latin Etymologies Extracted from Wiktionary An Efficient Approach for Answering Not Readily Attainable Questions for RAG-based Applications Automated German Alt Text Generation for News Charts Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance Can Large Language Models Replace Statistical Software? Code-Switching Detection in Multilingual Child Speech with SwissBERT Concept Extraction and Webb’s Depth of Knowledge: Comparing LLM Question Generation Pipelines for Educational Assessment Data Augmentation for Historical NER: A Systematic Comparison of Lexical and LLM-based Approaches Enhancing Retrieval via Cognitively Motivated Document Expansion Extending the Contact Hypothesis: Cross-Linguistic Evaluation of Religion and Nationality Bias When Prompting LLMs in German and Icelandic Extracting Article-Level Legal Dependencies from Swiss Federal Law using LLMs How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark Optimizing Large Language Models for Robust Domain-Specific Text-to-SQL: From Prompting to Preference Alignment Proceedings of the 11th Edition of the Swiss Text Analytics Conference Reinforcement Learning for Latent-Space Thinking in LLMs RUMLEM: A Dictionary-Based Lemmatizer for Romansh Skill Extraction from Resumes and Job Offers across Six Languages Text vs. Phoneme Intermediates for Low-Resource Swiss German The Same Email, Signed Differently: Testing Negotiation Bias and Recommendation Stability in LLMs Which Skills Debate Reaches the Public? Comparing Scientific Literature and Media Coverage of AI and LLM Skill Impacts (2022–2025) Controlling Language and Style of Multi-lingual Generative Language Models with Control Vectors Hybrid Human-LLM Corpus Construction and LLM Evaluation for the Caused-Motion Construction Implicit and Indirect: Detecting Face-threatening and Paired Actions in Asynchronous Online Conversations Northern European Journal of Language Technology, Volume 11 A modular architecture for creating multimodal embodied agents with an episodic Knowledge Graph as an explainable and controllable long-term memory A Neural Approach to Discourse Relation Signal Detection An Analysis of Japanese Sentence-final Particle Yone: Compare Yone and Ne in Response Attribution and the discourse structure of reports Automatic Detection of the Bulgarian Evidential Renarrative Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses Beyond semantics: the challenges of annotating pragmatic and discourse phenomena Bullshit, Pragmatic Deception, and Natural Language Processing Calling things by their names: Towards a unified account for name-informing and mixed quotation Characterizing the Response Space of Questions: data and theory Cognitive and social delays in the initiation of conversational repair Common Ground inconsistencies in dialogue systems: conflict patterns implied by polar question forms Computational Linguistics in Bulgaria Demonstrative Pronouns as Anti-Logophoric Pronouns: An Experimental Investigation Digging Communicative Intentions: The Case of Crises Events Discourse Relations and Connectives in Higher Text Structure Does ChatGPT Adapt Itself to the Language Used and the Audience It Implies? Embodied Conversational Systems in Human–Robot Interaction: Introduction to the Special Issue Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting Event and Entity Coreference Across Five Languages: Effects of Context and Referring Expression Exploring the Sensitivity to Alternative Signals of Coherence Relations Few Shades of Supervision for Discourse Segmentation Form and Function of Connectives in Chinese Conversational Speech From Discursive Practice to Logic? Remarks on Logical Expressivism GailBot: An automatic transcription system for Conversation Analysis German Demonstrative Pronouns in Contrast German Modal Particles as Discourse Signals Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human–Robot Interaction How People Structure Representations of Discourse Investigating Proactivity in Task-Oriented Dialogues It matters how you combine your clauses: Effects of syntactic subordination, connectives, and typographic and prosodic boundaries on the prominence of referents Journal Computational Linguistics in Bulgaria Laughter use by virtual agents increases task success Lexical Alignment to Non-native Speakers Lexical and contextual cue effects in discourse expectations: Experimenting with German ’zwar...aber’ and English ’true/sure...but’ Light Verb Constructions in ELEXIS-WSD – Annotation, Comparisons and Issues Modelling Structures for Situated Discourse Multi-modal Anaphora and Broadcasting of Information by Gestural Post-holds Narrative Elements in Expository Texts Opinion Piece: Can we Fix the Scope for Coreference? Perspective-Taking and Protagonist Prominence Please, Please, Just Tell Me: The Linguistic Features of Humorous Deception Pragmatic uses of I don’t know, boosters, and hedges in text and talk Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models Processing of discourse anaphors by L2 speakers of English Referential Communication Between Friends and Strangers in the Wild Repair of claimed non-understanding of word meaning in online discussion forum interaction Scoring Coreference Chains with Split-Antecedent Anaphors Self-Repair in Tigrinya: Trouble Sources, Mechanisms and Solutions Signaling of Causal Relations in Spanish: Variety, Functionality, and Specificity Strategic Dialogue Assessment: The Crooked Path to Innocence Studying Alignment in a Collaborative Learning Activity via Automatic Methods: The Link Between What We Say and Do The (Possible) Use of AI Tools for Processing Texts in Journalism in Bulgarian The Conversational Discourse Unit: Identification and Its Role in Conversational Turn-taking Management The effect of domain knowledge and implicitation on discourse relation inferences The timing of prominence information during the resolution of German personal and demonstrative pronouns The Use of Perspective Markers and Connectives in Expressing Subjectivity: Evidence from Collocational Analyses User Impressions of System Questions to Acquire Lexical Knowledge during Dialogues User Satisfaction Reward Estimation Across Domains: Domain-independent Dialogue Policy Learning When to Say What and How: Adapting the Elaborateness and Indirectness of Spoken Dialogue Systems Why ellipsis? Interactional function predicts choice of syntactic form in conversation A Comparison of Methods to Bias Translation Toward Portuguese Variants A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection A elaboração de uma edição digital d’Os Lusíadas A Larger Annotated Corpus of Portuguese Coreference A Lexicon-Grammar of Brazilian Portuguese Predicative Adjectives A Multilingual Voice Analytics Module for Contact-Center Hiring A Multimodal Framework for Financial Fake News Detection for Brazilian Portuguese A Multitask Transformer for Offensive Language Detection and Target Identification in HateBR A RAG Chatbot with Incremental Context Retrieval based on Local LLMs for Hospital Documents A UD Parser to the Rescue: A Method for Bringing a Classical Annotated Corpus to Life Again Accelerating Portuguese Masked Diffusion Models through Representation Alignment Agent Orchestration - LLM for Legal Metadata Extraction: A Comparative Analysis of Efficiency and Precision ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs AMALIA: A Fully Open Large Language Model for European Portuguese
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
2026-06-22 · via Paper Index on ACL Anthology

Abstract

Typical large vision-language models (LVLMs) apply autoregressive supervision primarily to textual responses, without fully exploiting causal learning over rich visual inputs. As a result, these models often emphasize vision-to-language alignment while potentially overlooking fine-grained visual information. While prior work has explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. ASVR trains models to autoregressively reconstruct the semantic content of input images, which consistently enhances multimodal comprehension. Notably, we show that even when provided with continuous image features as input, models can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across various multimodal understanding benchmarks. ASVR delivers significant performance gains and scalability across varying data scales, visual input, visual supervision and model architectures. In particular, ASVR generally improves baselines by 2-3% across 14 multimodal benchmarks.

Anthology ID:
2026.findings-acl.1900
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38101–38115
Language:
URL:
https://aclanthology.org/2026.findings-acl.1900/
DOI:
Bibkey:
Cite (ACL):
Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, and Jiaqi Wang. 2026. Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38101–38115, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.findings-acl.1900.pdf
Checklist:
 2026.findings-acl.1900.checklist.pdf