An Empirical Study of Automating Agent Evaluation - 惯性聚合

推荐订阅源

酷壳 – CoolShell

Tailwind CSS Blog

WordPress大学

博客园 - 司徒正美

Hugging Face - Blog

美团技术团队

有赞技术团队

Recorded Future

MIT News - Artificial intelligence

Stack Overflow Blog

Apple Machine Learning Research

博客园 - Franky

Check Point Blog

Microsoft Azure Blog

LINUX DO - 热门话题

cs.AI updates on arXiv.org

www.infosecurity-magazine.com

Privacy & Cybersecurity Law Blog

PCI Perspectives

The Cloudflare Blog

奇客Solidot–传递最新科技情报

Blog — PlanetScale

Proofpoint News Feed

cs.CL updates on arXiv.org

Recent Commits to openclaw:main

CXSECURITY Database RSS Feed - CXSecurity.com

Cybersecurity and Infrastructure Security Agency CISA

Palo Alto Networks Blog

Last Week in AI

Simon Willison's Weblog

Darknet – Hacking Tools, Hacker News & Cyber Security

cs.CL updates on arXiv.org

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations Self-Calibrating Language Models via Test-Time Discriminative Distillation HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation Generating High Quality Synthetic Data for Dutch Medical Conversations GIANTS: Generative Insight Anticipation from Scientific Literature Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling Should We be Pedantic About Reasoning Errors in Machine Translation? Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning Weird Generalization is Weirdly Brittle Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry CircuitSynth: Reliable Synthetic Data Generation Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text Relational Probing: LM-to-Graph Adaptation for Financial Prediction CodeComp: Structural KV Cache Compression for Agentic Coding FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness Comparative Analysis of Large Language Models in Healthcare Adaptive Multi-Expert Reasoning via Difficulty-Aware Routing and Uncertainty-Guided Aggregation A Structured Clustering Approach for Inducing Media Narratives NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data LASQ: A Low-resource Aspect-based Sentiment Quadruple Extraction Dataset BLUEmed: Retrieval-Augmented Multi-Agent Debate for Clinical Error Detection Turing or Cantor: That is the Question CodaRAG: Connecting the Dots with Associativity Inspired by Complementary Learning NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning Instruction Data Selection via Answer Divergence EviCare: Enhancing Diagnosis Prediction with Deep Model-Guided Evidence for In-Context Reasoning Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization LLMs Should Incorporate Explicit Mechanisms for Human Empathy Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment BlasBench: An Open Benchmark for Irish Speech Recognition TInR: Exploring Tool-Internalized Reasoning in Large Language Models OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts Evaluating Memory Capability in Continuous Lifelog Scenario Polyglot Teachers: Evaluating Language Models for Multilingual Synthetic Data Generation METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking A Triadic Suffix Tokenization Scheme for Numerical Reasoning Evaluating Cooperation in LLM Social Groups through Elected Leadership LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval Seven simple steps for log analysis in AI systems LETGAMES: An LLM-Powered Gamified Approach to Cognitive Training for Patients with Cognitive Impairment Generative UI: LLMs are Effective UI Generators LABBench2: An Improved Benchmark for AI Systems Performing Biology Research ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models COMPOSITE-Stem Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards Cross-Cultural Value Awareness in Large Vision-Language Models Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution The Amazing Agent Race: Strong Tool Users, Weak Navigators SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation Anthropogenic Regional Adaptation in Multimodal Vision-Language Model Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference What Factors Affect LLMs and RLLMs in Financial Question Answering? Echoes of Automation: The Increasing Use of LLMs in Newsmaking KCS: Diversify Multi-hop Question Generation with Knowledge Composition Sampling Preference Learning Unlocks LLMs' Psycho-Counseling Skills FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models Aligning What LLMs Do and Say: Towards Self-Consistent Explanations StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation Think Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation Disco-RAG: Discourse-Aware Retrieval-Augmented Generation GenProve: Learning to Generate Text with Fine-Grained Provenance Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation ChemPro: A Progressive Chemistry Benchmark for Large Language Models ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA Reasoning Models Will Sometimes Lie About Their Reasoning Linear Representations of Hierarchical Concepts in Language Models H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration

An Empirical Study of Automating Agent Evaluation

Kang Zhou, S · 2026-05-13 · via cs.CL updates on arXiv.org

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。