惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Schneier on Security
D
Docker
I
InfoQ
IT之家
IT之家
MyScale Blog
MyScale Blog
aimingoo的专栏
aimingoo的专栏
WordPress大学
WordPress大学
The Cloudflare Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Recorded Future
Recorded Future
Recent Announcements
Recent Announcements
博客园 - 聂微东
美团技术团队
U
Unit 42
Scott Helme
Scott Helme
Stack Overflow Blog
Stack Overflow Blog
P
Privacy & Cybersecurity Law Blog
V
Vulnerabilities – Threatpost
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
L
Lohrmann on Cybersecurity
H
Help Net Security
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
T
The Blog of Author Tim Ferriss
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
C
Cybersecurity and Infrastructure Security Agency CISA
H
Hackread – Cybersecurity News, Data Breaches, AI and More
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
Google Online Security Blog
Google Online Security Blog
D
DataBreaches.Net
F
Fortinet All Blogs
V
V2EX - 技术
S
SegmentFault 最新的问题
云风的 BLOG
云风的 BLOG
C
Cisco Blogs
Hacker News - Newest:
Hacker News - Newest: "LLM"
AI
AI
L
LINUX DO - 热门话题
Martin Fowler
Martin Fowler
S
Security Affairs
O
OpenAI News
E
Exploit-DB.com RSS Feed
Microsoft Security Blog
Microsoft Security Blog
C
Check Point Blog
The Hacker News
The Hacker News
博客园_首页
Cloudbric
Cloudbric
Project Zero
Project Zero
J
Java Code Geeks
T
Tenable Blog

cs.CL updates on arXiv.org

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations Re-Centering Humans in LLM Personalization UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment What Do People Actually Want From AI? Mapping Preference Plurality HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule Signal-Driven Observation for Long-Horizon Web Agents Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation Modular Monolingual Adaptation using Pretrained Language Models When to Think Deeply: Inhibitory Deliberation for LLM Reasoning PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding A Four-Condition Diagnostic Protocol for Evidence Utilization in Long-Context and Retrieval-Augmented Language Models Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication Korean Culture into LLM Alignment: Toward Cultural Coherence Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification Interpreting Brain Responses to Language with Sparse Features from Language Models Are Large Language Models Suitable for Graph Computation? Progress and Prospects An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering Didact: A Cross-Domain Capability Discovery System for Defence Auditing Training Data in Domain-adapted LLMs: LoRA-MINT OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition Principles of Concept Representation in Sentence Encoders MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages? Modeling semantic association in self-paced reading with language model embeddings Style or Content? Evaluating Style Classifiers with Controlled Content Overlap Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings Explicit Evidence Grounding via Structured Inline Citation Generation UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning Geometry of Semantic Space: Comparative Study of Discrete and Continuous Models Adversarial Creation and Detection of AI-Generated Social Bot Content KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026 Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO Generic Triple-Latent Compression with Gated Associative Retrieval PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment Multi-Granularity Reasoning for Natural Language Inference LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models LoRi: Low-Rank Distillation for Implicit Reasoning A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing Self-supervised User Profile Generation for Personalization Trajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation Multilingual Coreference Resolution via Cycle-Consistent Machine Translation Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time? AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training What's in a Name? Morphological Shortcuts by LLMs in Pharmacology AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer Bootstrapping Semantic Layer from Execution for Text-to-SQL QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models Rethinking LoRA Memory Through the Lens of KV Cache Compression Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems Interpreting Style Representations via Style-Eliciting Prompts Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads
Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling
[Submitted on 5 Jun 2026] · 2026-06-08 · via cs.CL updates on arXiv.org

View PDF

Abstract:Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at this https URL.

Submission history

From: Xing Yue [view email]
[v1] Fri, 5 Jun 2026 08:34:06 UTC (8,904 KB)