惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
SegmentFault 最新的问题
AI
AI
G
Google Developers Blog
博客园 - 司徒正美
阮一峰的网络日志
阮一峰的网络日志
J
Java Code Geeks
月光博客
月光博客
云风的 BLOG
云风的 BLOG
V
V2EX
人人都是产品经理
人人都是产品经理
WordPress大学
WordPress大学
I
InfoQ
P
Proofpoint News Feed
The Register - Security
The Register - Security
有赞技术团队
有赞技术团队
D
Docker
T
Tailwind CSS Blog
MongoDB | Blog
MongoDB | Blog
博客园 - 三生石上(FineUI控件)
IT之家
IT之家
H
Hackread – Cybersecurity News, Data Breaches, AI and More
博客园 - Franky
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
V
Visual Studio Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
U
Unit 42
Stack Overflow Blog
Stack Overflow Blog
T
The Exploit Database - CXSecurity.com
Spread Privacy
Spread Privacy
C
Cybersecurity and Infrastructure Security Agency CISA
C
Cyber Attacks, Cyber Crime and Cyber Security
小众软件
小众软件
Cisco Talos Blog
Cisco Talos Blog
Cyberwarzone
Cyberwarzone
S
Securelist
The Cloudflare Blog
MyScale Blog
MyScale Blog
T
Tor Project blog
L
LangChain Blog
Recorded Future
Recorded Future
V
Vulnerabilities – Threatpost
The GitHub Blog
The GitHub Blog
NISL@THU
NISL@THU
A
Arctic Wolf
C
CERT Recently Published Vulnerability Notes
Blog — PlanetScale
Blog — PlanetScale
S
Schneier on Security
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog

Paper Index on ACL Anthology

A Bounded Coordination-Support Capability for Multi-Party Settings: Task-State Monitoring in Firefighter Incident Command A Dataset of Latin Etymologies Extracted from Wiktionary An Efficient Approach for Answering Not Readily Attainable Questions for RAG-based Applications Automated German Alt Text Generation for News Charts Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance Can Large Language Models Replace Statistical Software? Code-Switching Detection in Multilingual Child Speech with SwissBERT Concept Extraction and Webb’s Depth of Knowledge: Comparing LLM Question Generation Pipelines for Educational Assessment Data Augmentation for Historical NER: A Systematic Comparison of Lexical and LLM-based Approaches Enhancing Retrieval via Cognitively Motivated Document Expansion Extending the Contact Hypothesis: Cross-Linguistic Evaluation of Religion and Nationality Bias When Prompting LLMs in German and Icelandic Extracting Article-Level Legal Dependencies from Swiss Federal Law using LLMs How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark Optimizing Large Language Models for Robust Domain-Specific Text-to-SQL: From Prompting to Preference Alignment Proceedings of the 11th Edition of the Swiss Text Analytics Conference Reinforcement Learning for Latent-Space Thinking in LLMs RUMLEM: A Dictionary-Based Lemmatizer for Romansh Skill Extraction from Resumes and Job Offers across Six Languages Text vs. Phoneme Intermediates for Low-Resource Swiss German The Same Email, Signed Differently: Testing Negotiation Bias and Recommendation Stability in LLMs Which Skills Debate Reaches the Public? Comparing Scientific Literature and Media Coverage of AI and LLM Skill Impacts (2022–2025) Controlling Language and Style of Multi-lingual Generative Language Models with Control Vectors Hybrid Human-LLM Corpus Construction and LLM Evaluation for the Caused-Motion Construction Implicit and Indirect: Detecting Face-threatening and Paired Actions in Asynchronous Online Conversations Northern European Journal of Language Technology, Volume 11 A modular architecture for creating multimodal embodied agents with an episodic Knowledge Graph as an explainable and controllable long-term memory A Neural Approach to Discourse Relation Signal Detection An Analysis of Japanese Sentence-final Particle Yone: Compare Yone and Ne in Response Attribution and the discourse structure of reports Automatic Detection of the Bulgarian Evidential Renarrative Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses Beyond semantics: the challenges of annotating pragmatic and discourse phenomena Bullshit, Pragmatic Deception, and Natural Language Processing Calling things by their names: Towards a unified account for name-informing and mixed quotation Characterizing the Response Space of Questions: data and theory Cognitive and social delays in the initiation of conversational repair Common Ground inconsistencies in dialogue systems: conflict patterns implied by polar question forms Computational Linguistics in Bulgaria Demonstrative Pronouns as Anti-Logophoric Pronouns: An Experimental Investigation Digging Communicative Intentions: The Case of Crises Events Discourse Relations and Connectives in Higher Text Structure Does ChatGPT Adapt Itself to the Language Used and the Audience It Implies? Embodied Conversational Systems in Human–Robot Interaction: Introduction to the Special Issue Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting Event and Entity Coreference Across Five Languages: Effects of Context and Referring Expression Exploring the Sensitivity to Alternative Signals of Coherence Relations Few Shades of Supervision for Discourse Segmentation Form and Function of Connectives in Chinese Conversational Speech From Discursive Practice to Logic? Remarks on Logical Expressivism GailBot: An automatic transcription system for Conversation Analysis German Demonstrative Pronouns in Contrast German Modal Particles as Discourse Signals Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human–Robot Interaction How People Structure Representations of Discourse Investigating Proactivity in Task-Oriented Dialogues It matters how you combine your clauses: Effects of syntactic subordination, connectives, and typographic and prosodic boundaries on the prominence of referents Journal Computational Linguistics in Bulgaria Laughter use by virtual agents increases task success Lexical Alignment to Non-native Speakers Lexical and contextual cue effects in discourse expectations: Experimenting with German ’zwar...aber’ and English ’true/sure...but’ Light Verb Constructions in ELEXIS-WSD – Annotation, Comparisons and Issues Modelling Structures for Situated Discourse Multi-modal Anaphora and Broadcasting of Information by Gestural Post-holds Narrative Elements in Expository Texts Opinion Piece: Can we Fix the Scope for Coreference? Perspective-Taking and Protagonist Prominence Please, Please, Just Tell Me: The Linguistic Features of Humorous Deception Pragmatic uses of I don’t know, boosters, and hedges in text and talk Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models Processing of discourse anaphors by L2 speakers of English Referential Communication Between Friends and Strangers in the Wild Repair of claimed non-understanding of word meaning in online discussion forum interaction Scoring Coreference Chains with Split-Antecedent Anaphors Self-Repair in Tigrinya: Trouble Sources, Mechanisms and Solutions Signaling of Causal Relations in Spanish: Variety, Functionality, and Specificity Strategic Dialogue Assessment: The Crooked Path to Innocence Studying Alignment in a Collaborative Learning Activity via Automatic Methods: The Link Between What We Say and Do The (Possible) Use of AI Tools for Processing Texts in Journalism in Bulgarian The Conversational Discourse Unit: Identification and Its Role in Conversational Turn-taking Management The effect of domain knowledge and implicitation on discourse relation inferences The timing of prominence information during the resolution of German personal and demonstrative pronouns The Use of Perspective Markers and Connectives in Expressing Subjectivity: Evidence from Collocational Analyses User Impressions of System Questions to Acquire Lexical Knowledge during Dialogues User Satisfaction Reward Estimation Across Domains: Domain-independent Dialogue Policy Learning When to Say What and How: Adapting the Elaborateness and Indirectness of Spoken Dialogue Systems Why ellipsis? Interactional function predicts choice of syntactic form in conversation A Comparison of Methods to Bias Translation Toward Portuguese Variants A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection A elaboração de uma edição digital d’Os Lusíadas A Larger Annotated Corpus of Portuguese Coreference A Lexicon-Grammar of Brazilian Portuguese Predicative Adjectives A Multilingual Voice Analytics Module for Contact-Center Hiring A Multimodal Framework for Financial Fake News Detection for Brazilian Portuguese A Multitask Transformer for Offensive Language Detection and Target Identification in HateBR A RAG Chatbot with Incremental Context Retrieval based on Local LLMs for Hospital Documents A UD Parser to the Rescue: A Method for Bringing a Classical Annotated Corpus to Life Again Accelerating Portuguese Masked Diffusion Models through Representation Alignment Agent Orchestration - LLM for Legal Metadata Extraction: A Comparative Analysis of Efficiency and Precision ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs AMALIA: A Fully Open Large Language Model for European Portuguese
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs
2026-06-22 · via Paper Index on ACL Anthology

Abstract

Despite the rapid progress of LLMs, their evaluation remains hindered by static, manually curated benchmarks with limited task coverage and poor adaptability to emerging domains. Existing automated approaches typically operate within fixed task schemas and often fail to autonomously discover new evaluation dimensions, limiting both scalability and effectiveness. To address these gaps, we propose AutoTaskEval, an automated framework that constructs domain-specific benchmarks directly from unstructured corpora. Using a refined Bloom’s Taxonomy, the framework systematically discovers tasks, enriches contextual grounding via iterative Socratic prompting, and generates diverse, progressively challenging evaluation instances. Applied to the complex and knowledge-intensive legal domain, AutoTaskEval uncovers a broader and more fine-grained task space than expert-curated benchmarks while producing high-quality instances that preserve established model-level evaluation trends. We further validate its robustness in a low-structure e-commerce review domain. Together, these results show that AutoTaskEval enables scalable, adaptive, and high-fidelity LLM assessment across domains and model families, advancing autonomous and capability-sensitive evaluation.

Anthology ID:
2026.acl-long.280
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6191–6223
Language:
URL:
https://aclanthology.org/2026.acl-long.280/
DOI:
Bibkey:
Cite (ACL):
Qingqing Lyu, Linjuan Wu, Yongliang Shen, Hengwei Liu, Hao Li, Shengpei Jiang, Yin Zhang, and Weiming Lu. 2026. AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6191–6223, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs (Lyu et al., ACL 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.acl-long.280.pdf
Checklist:
 2026.acl-long.280.checklist.pdf