惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Engineering at Meta
Engineering at Meta
博客园_首页
WordPress大学
WordPress大学
宝玉的分享
宝玉的分享
罗磊的独立博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
酷 壳 – CoolShell
酷 壳 – CoolShell
O
OpenAI News
阮一峰的网络日志
阮一峰的网络日志
小众软件
小众软件
S
Securelist
博客园 - 叶小钗
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 热门话题
Jina AI
Jina AI
博客园 - 【当耐特】
C
Cisco Blogs
爱范儿
爱范儿
Scott Helme
Scott Helme
月光博客
月光博客
P
Proofpoint News Feed
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
人人都是产品经理
人人都是产品经理
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
J
Java Code Geeks
T
Tailwind CSS Blog
S
Schneier on Security
D
Darknet – Hacking Tools, Hacker News & Cyber Security
P
Privacy & Cybersecurity Law Blog
T
Threatpost
IT之家
IT之家
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
博客园 - Franky
V
Vulnerabilities – Threatpost
V
Visual Studio Blog
P
Proofpoint News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
MongoDB | Blog
MongoDB | Blog
Stack Overflow Blog
Stack Overflow Blog
G
Google Developers Blog
T
Tor Project blog
The Hacker News
The Hacker News
NISL@THU
NISL@THU
腾讯CDC
SecWiki News
SecWiki News
有赞技术团队
有赞技术团队
Blog — PlanetScale
Blog — PlanetScale
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Google DeepMind News
Google DeepMind News

Paper Index on ACL Anthology

A Bounded Coordination-Support Capability for Multi-Party Settings: Task-State Monitoring in Firefighter Incident Command A Dataset of Latin Etymologies Extracted from Wiktionary An Efficient Approach for Answering Not Readily Attainable Questions for RAG-based Applications Automated German Alt Text Generation for News Charts Call Support Copilot: A Reproducible Multimodal System for Speech Emotion Recognition, Intent Understanding, and Agent Assistance Can Large Language Models Replace Statistical Software? Code-Switching Detection in Multilingual Child Speech with SwissBERT Concept Extraction and Webb’s Depth of Knowledge: Comparing LLM Question Generation Pipelines for Educational Assessment Data Augmentation for Historical NER: A Systematic Comparison of Lexical and LLM-based Approaches Enhancing Retrieval via Cognitively Motivated Document Expansion Extending the Contact Hypothesis: Cross-Linguistic Evaluation of Religion and Nationality Bias When Prompting LLMs in German and Icelandic Extracting Article-Level Legal Dependencies from Swiss Federal Law using LLMs How Good is AI on Swiss Voting Booklets? A Multilingual OCR and Alignment Benchmark Optimizing Large Language Models for Robust Domain-Specific Text-to-SQL: From Prompting to Preference Alignment Proceedings of the 11th Edition of the Swiss Text Analytics Conference Reinforcement Learning for Latent-Space Thinking in LLMs RUMLEM: A Dictionary-Based Lemmatizer for Romansh Skill Extraction from Resumes and Job Offers across Six Languages Text vs. Phoneme Intermediates for Low-Resource Swiss German The Same Email, Signed Differently: Testing Negotiation Bias and Recommendation Stability in LLMs Which Skills Debate Reaches the Public? Comparing Scientific Literature and Media Coverage of AI and LLM Skill Impacts (2022–2025) Controlling Language and Style of Multi-lingual Generative Language Models with Control Vectors Hybrid Human-LLM Corpus Construction and LLM Evaluation for the Caused-Motion Construction Implicit and Indirect: Detecting Face-threatening and Paired Actions in Asynchronous Online Conversations Northern European Journal of Language Technology, Volume 11 A modular architecture for creating multimodal embodied agents with an episodic Knowledge Graph as an explainable and controllable long-term memory A Neural Approach to Discourse Relation Signal Detection An Analysis of Japanese Sentence-final Particle Yone: Compare Yone and Ne in Response Attribution and the discourse structure of reports Automatic Detection of the Bulgarian Evidential Renarrative Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses Beyond semantics: the challenges of annotating pragmatic and discourse phenomena Bullshit, Pragmatic Deception, and Natural Language Processing Calling things by their names: Towards a unified account for name-informing and mixed quotation Characterizing the Response Space of Questions: data and theory Cognitive and social delays in the initiation of conversational repair Common Ground inconsistencies in dialogue systems: conflict patterns implied by polar question forms Computational Linguistics in Bulgaria Demonstrative Pronouns as Anti-Logophoric Pronouns: An Experimental Investigation Digging Communicative Intentions: The Case of Crises Events Discourse Relations and Connectives in Higher Text Structure Does ChatGPT Adapt Itself to the Language Used and the Audience It Implies? Embodied Conversational Systems in Human–Robot Interaction: Introduction to the Special Issue Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting Event and Entity Coreference Across Five Languages: Effects of Context and Referring Expression Exploring the Sensitivity to Alternative Signals of Coherence Relations Few Shades of Supervision for Discourse Segmentation Form and Function of Connectives in Chinese Conversational Speech From Discursive Practice to Logic? Remarks on Logical Expressivism GailBot: An automatic transcription system for Conversation Analysis German Demonstrative Pronouns in Contrast German Modal Particles as Discourse Signals Graph-to-Text Approach to Knowledge-Grounded Response Generation in Human–Robot Interaction How People Structure Representations of Discourse Investigating Proactivity in Task-Oriented Dialogues It matters how you combine your clauses: Effects of syntactic subordination, connectives, and typographic and prosodic boundaries on the prominence of referents Journal Computational Linguistics in Bulgaria Laughter use by virtual agents increases task success Lexical Alignment to Non-native Speakers Lexical and contextual cue effects in discourse expectations: Experimenting with German ’zwar...aber’ and English ’true/sure...but’ Light Verb Constructions in ELEXIS-WSD – Annotation, Comparisons and Issues Modelling Structures for Situated Discourse Multi-modal Anaphora and Broadcasting of Information by Gestural Post-holds Narrative Elements in Expository Texts Opinion Piece: Can we Fix the Scope for Coreference? Perspective-Taking and Protagonist Prominence Please, Please, Just Tell Me: The Linguistic Features of Humorous Deception Pragmatic uses of I don’t know, boosters, and hedges in text and talk Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models Processing of discourse anaphors by L2 speakers of English Referential Communication Between Friends and Strangers in the Wild Repair of claimed non-understanding of word meaning in online discussion forum interaction Scoring Coreference Chains with Split-Antecedent Anaphors Self-Repair in Tigrinya: Trouble Sources, Mechanisms and Solutions Signaling of Causal Relations in Spanish: Variety, Functionality, and Specificity Strategic Dialogue Assessment: The Crooked Path to Innocence Studying Alignment in a Collaborative Learning Activity via Automatic Methods: The Link Between What We Say and Do The (Possible) Use of AI Tools for Processing Texts in Journalism in Bulgarian The Conversational Discourse Unit: Identification and Its Role in Conversational Turn-taking Management The effect of domain knowledge and implicitation on discourse relation inferences The timing of prominence information during the resolution of German personal and demonstrative pronouns The Use of Perspective Markers and Connectives in Expressing Subjectivity: Evidence from Collocational Analyses User Impressions of System Questions to Acquire Lexical Knowledge during Dialogues User Satisfaction Reward Estimation Across Domains: Domain-independent Dialogue Policy Learning When to Say What and How: Adapting the Elaborateness and Indirectness of Spoken Dialogue Systems Why ellipsis? Interactional function predicts choice of syntactic form in conversation A Comparison of Methods to Bias Translation Toward Portuguese Variants A Dataset of Brazilian Portuguese Clinical Notes for Anaphylaxis Detection A elaboração de uma edição digital d’Os Lusíadas A Larger Annotated Corpus of Portuguese Coreference A Lexicon-Grammar of Brazilian Portuguese Predicative Adjectives A Multilingual Voice Analytics Module for Contact-Center Hiring A Multimodal Framework for Financial Fake News Detection for Brazilian Portuguese A Multitask Transformer for Offensive Language Detection and Target Identification in HateBR A RAG Chatbot with Incremental Context Retrieval based on Local LLMs for Hospital Documents A UD Parser to the Rescue: A Method for Bringing a Classical Annotated Corpus to Life Again Accelerating Portuguese Masked Diffusion Models through Representation Alignment Agent Orchestration - LLM for Legal Metadata Extraction: A Comparative Analysis of Efficiency and Precision ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs AMALIA: A Fully Open Large Language Model for European Portuguese
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs
2026-06-22 · via Paper Index on ACL Anthology

Abstract

Despite the rapid progress of LLMs, their evaluation remains hindered by static, manually curated benchmarks with limited task coverage and poor adaptability to emerging domains. Existing automated approaches typically operate within fixed task schemas and often fail to autonomously discover new evaluation dimensions, limiting both scalability and effectiveness. To address these gaps, we propose AutoTaskEval, an automated framework that constructs domain-specific benchmarks directly from unstructured corpora. Using a refined Bloom’s Taxonomy, the framework systematically discovers tasks, enriches contextual grounding via iterative Socratic prompting, and generates diverse, progressively challenging evaluation instances. Applied to the complex and knowledge-intensive legal domain, AutoTaskEval uncovers a broader and more fine-grained task space than expert-curated benchmarks while producing high-quality instances that preserve established model-level evaluation trends. We further validate its robustness in a low-structure e-commerce review domain. Together, these results show that AutoTaskEval enables scalable, adaptive, and high-fidelity LLM assessment across domains and model families, advancing autonomous and capability-sensitive evaluation.

Anthology ID:
2026.acl-long.280
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6191–6223
Language:
URL:
https://aclanthology.org/2026.acl-long.280/
DOI:
Bibkey:
Cite (ACL):
Qingqing Lyu, Linjuan Wu, Yongliang Shen, Hengwei Liu, Hao Li, Shengpei Jiang, Yin Zhang, and Weiming Lu. 2026. AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6191–6223, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
AutoTaskEval: Towards Domain-Specific and Fine-Grained Evaluation for LLMs (Lyu et al., ACL 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.acl-long.280.pdf
Checklist:
 2026.acl-long.280.checklist.pdf