惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

D
Docker
酷 壳 – CoolShell
酷 壳 – CoolShell
T
Tailwind CSS Blog
WordPress大学
WordPress大学
博客园 - 司徒正美
小众软件
小众软件
Hugging Face - Blog
Hugging Face - Blog
量子位
美团技术团队
腾讯CDC
Jina AI
Jina AI
有赞技术团队
有赞技术团队
Recorded Future
Recorded Future
云风的 BLOG
云风的 BLOG
M
MIT News - Artificial intelligence
Stack Overflow Blog
Stack Overflow Blog
Apple Machine Learning Research
Apple Machine Learning Research
C
Cisco Blogs
T
Threatpost
博客园 - Franky
C
Check Point Blog
Microsoft Azure Blog
Microsoft Azure Blog
L
LINUX DO - 热门话题
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
AI
AI
Project Zero
Project Zero
G
GRAHAM CLULEY
www.infosecurity-magazine.com
www.infosecurity-magazine.com
W
WeLiveSecurity
P
Privacy & Cybersecurity Law Blog
PCI Perspectives
PCI Perspectives
Cyberwarzone
Cyberwarzone
The Cloudflare Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
雷峰网
雷峰网
A
Arctic Wolf
Blog — PlanetScale
Blog — PlanetScale
P
Proofpoint News Feed
Latest news
Latest news
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Recent Commits to openclaw:main
Recent Commits to openclaw:main
C
CXSECURITY Database RSS Feed - CXSecurity.com
C
Cybersecurity and Infrastructure Security Agency CISA
AWS News Blog
AWS News Blog
P
Palo Alto Networks Blog
Last Week in AI
Last Week in AI
SecWiki News
SecWiki News
GbyAI
GbyAI
Simon Willison's Weblog
Simon Willison's Weblog
D
Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CL updates on arXiv.org

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations Self-Calibrating Language Models via Test-Time Discriminative Distillation HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation Generating High Quality Synthetic Data for Dutch Medical Conversations GIANTS: Generative Insight Anticipation from Scientific Literature Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling Should We be Pedantic About Reasoning Errors in Machine Translation? Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning Weird Generalization is Weirdly Brittle Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry CircuitSynth: Reliable Synthetic Data Generation Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text Relational Probing: LM-to-Graph Adaptation for Financial Prediction CodeComp: Structural KV Cache Compression for Agentic Coding FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness Comparative Analysis of Large Language Models in Healthcare Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation A Structured Clustering Approach for Inducing Media Narratives NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection Turing or Cantor: That is the Question CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning Instruction Data Selection via Answer Divergence EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization LLMs Should Incorporate Explicit Mechanisms for Human Empathy Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment BlasBench: An Open Benchmark for Irish Speech Recognition TInR: Exploring Tool-Internalized Reasoning in Large Language Models OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts Evaluating Memory Capability in Continuous Lifelog Scenario Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking A Triadic Suffix Tokenization Scheme for Numerical Reasoning Evaluating Cooperation in LLM Social Groups through Elected Leadership LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval Seven simple steps for log analysis in AI systems LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment Generative UI: LLMs are Effective UI Generators LABBench2: An Improved Benchmark for AI Systems Performing Biology Research ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models COMPOSITE-Stem Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards Cross-Cultural Value Awareness in Large Vision-Language Models Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution The Amazing Agent Race: Strong Tool Users, Weak Navigators SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation Anthropogenic Regional Adaptation in Multimodal Vision-Language Model Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference What Factors Affect LLMs and RLLMs in Financial Question Answering? Echoes of Automation: The Increasing Use of LLMs in Newsmaking KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling Preference Learning Unlocks LLMs' Psycho-Counseling Skills FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models Aligning What LLMs Do and Say: Towards Self-Consistent Explanations StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation Disco-RAG: Discourse-Aware Retrieval-Augmented Generation GenProve: Learning to Generate Text with Fine-Grained Provenance Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation ChemPro: A Progressive Chemistry Benchmark for Large Language Models ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA Reasoning Models Will Sometimes Lie About Their Reasoning Linear Representations of Hierarchical Concepts in Language Models H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration
An Empirical Study of Automating Agent Evaluation
Kang Zhou, S · 2026-05-13 · via cs.CL updates on arXiv.org

View PDF

Abstract:Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
Subjects: Computation and Language (cs.CL)
Cite as: arXiv:2605.11378 [cs.CL]
  (or arXiv:2605.11378v1 [cs.CL] for this version)
  https://doi.org/10.48550/arXiv.2605.11378

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Sangmin Woo [view email]
[v1] Tue, 12 May 2026 01:06:34 UTC (13,103 KB)