惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

GbyAI
GbyAI
NISL@THU
NISL@THU
S
Secure Thoughts
P
Palo Alto Networks Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
AWS News Blog
AWS News Blog
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
S
Security @ Cisco Blogs
Cloudbric
Cloudbric
L
LINUX DO - 最新话题
L
LINUX DO - 热门话题
O
OpenAI News
C
Cyber Attacks, Cyber Crime and Cyber Security
Google DeepMind News
Google DeepMind News
Schneier on Security
Schneier on Security
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
www.infosecurity-magazine.com
www.infosecurity-magazine.com
月光博客
月光博客
阮一峰的网络日志
阮一峰的网络日志
Forbes - Security
Forbes - Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
Securelist
S
Security Affairs
博客园 - 三生石上(FineUI控件)
V2EX - 技术
V2EX - 技术
Apple Machine Learning Research
Apple Machine Learning Research
D
Darknet – Hacking Tools, Hacker News & Cyber Security
人人都是产品经理
人人都是产品经理
IT之家
IT之家
T
Threat Research - Cisco Blogs
博客园 - 司徒正美
J
Java Code Geeks
C
Cisco Blogs
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
N
News and Events Feed by Topic
P
Privacy International News Feed
V
Visual Studio Blog
博客园_首页
量子位
C
Cybersecurity and Infrastructure Security Agency CISA
Y
Y Combinator Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
T
The Exploit Database - CXSecurity.com
Security Archives - TechRepublic
Security Archives - TechRepublic
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
N
News and Events Feed by Topic
D
DataBreaches.Net
The Cloudflare Blog

cs.IR updates on arXiv.org

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference Enhancing Unsupervised Keyword Extraction in Academic Papers through Integrating Highlights with Abstract CAST: Modeling Semantic-Level Transitions for Complementary-Aware Sequential Recommendation IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora Personalized Benchmarking: Evaluating LLMs by Individual Preferences Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations JFinTEB: Japanese Financial Text Embedding Benchmark UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels Learning Behaviorally Grounded Item Embeddings via Personalized Temporal Contexts Collaborative Filtering Through Weighted Similarities of User and Item Embeddings IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning Metric-agnostic Learning-to-Rank via Boosting and Rank Approximation GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation Uncertainty-aware Generative Learning Path Recommendation with Cognition-Adaptive Diffusion CPGRec+: A Balance-oriented Framework for Personalized Video Game Recommendations Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG NewsTorch: A PyTorch-based Toolkit for Learner-oriented News Recommendation Controlling Authority Retrieval: A Missing Retrieval Objective for Authority-Governed Knowledge APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model Evaluation of Agents under Simulated AI Marketplace Dynamics Driving Engagement in Daily Fantasy Sports with a Scalable and Urgency-Aware Ranking Engine TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking FRAGATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines Indexing Multimodal Language Models for Large-scale Image Retrieval FRESCO: Benchmarking and Optimizing Re-rankers for Evolving Semantic Conflict in Retrieval-Augmented Generation TRACE: A Conversational Framework for Sustainable Tourism Recommendation with Agentic Counterfactual Explanations Adaptive Query Routing: A Tier-Based Framework for Hybrid Retrieval Across Financial, Legal, and Medical Documents Knowledge Graph RAG: Agentic Crawling and Graph Construction in Enterprise Documents NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation HeceTokenizer: A Syllable-Based Tokenization Approach for Turkish Retrieval NSFL: A Post-Training Neuro-Symbolic Fuzzy Logic Framework for Boolean Operators in Neural Embeddings Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification MOSAIC: Multi-Domain Orthogonal Session Adaptive Intent Capture for Prescient Recommendations Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA MAB-DQA: Addressing Query Aspect Importance in Document Question Answering with Multi-Armed Bandits PRAGMA: Revolut Foundation Model Rag Performance Prediction for Question Answering Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search Evaluating Scene-based In-Situ Item Labeling for Immersive Conversational Recommendation Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model SocialWise: LLM-Agentic Conversation Therapy for Individuals with Autism Spectrum Disorder to Enhance Communication Skills Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval Spectral Tempering for Embedding Compression in Dense Passage Retrieval AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation To LLM, or Not to LLM: How Designers and Developers Navigate LLMs as Tools or Teammates A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents LiveGraph: Active-Structure Neural Re-ranking for Exercise Recommendation GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics Exploring Structural Complexity in Normative RAG with Graph-based approaches: A case study on the ETSI Standards SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval SemaCDR: LLM-Powered Transferable Semantics for Cross-Domain Sequential Recommendation Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning Retrieval-Augmented Large Language Models for Evidence-Informed Guidance on Cannabidiol Use in Older Adults RLPO: Residual Listwise Preference Optimization for Long-Context Review Ranking When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows WisPaper: Your AI Scholar Search Engine GroupRank: A Groupwise Paradigm for Effective and Efficient Passage Reranking with LLMs Hierarchical Semantic Retrieval with Cobweb WARBERT: A Hierarchical BERT-based Model for Web API Recommendation Reliable Evaluation Protocol for Low-Precision Retrieval VoteGCL: Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation PoTable: Towards Systematic Thinking via Plan-then-Execute Stage Reasoning on Tables An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs Auto-FP: An Experimental Study of Automated Feature Preprocessing for Tabular Data
RAG-DIVE: A Dynamic Approach for Multi-Turn Dialogue Evaluation in Retrieval-Augmented Generation
Lorenz Brehme, Benedikt Dornauer, Jan-Henrik Böttcher, Klaus Sch · 2026-01-30 · via cs.IR updates on arXiv.org

Evaluating Retrieval-Augmented Generation (RAG) systems using static multi-turn datasets fails to capture the dynamic nature of real-world dialogues. Existing evaluation methods rely on predefined datasets, which restrict them to static, one-directional queries and limit their ability to capture the adaptive, context-dependent performance of RAG systems in interactive, multi-turn settings. Thus, we introduce the RAG-DIVE, a Dynamic Interactive Validation and Evaluation approach, that simulates user interactions with RAG systems. RAG-DIVE leverages an LLM to generate multi-turn conversations dynamically and is organized into three components. The dialogue generation stage consists of the (1) Conversation Generator, which simulates a user by creating multi-turn queries, and the (2) Conversation Validator, which filters and corrects invalid or low-quality outputs to ensure coherent conversations. The evaluation stage is handled by the (3) Conversation Evaluator, which assesses the RAG system's performance across the entire dialogue and generates both per-turn and multi-turn metrics that provide an aggregated view of system behavior. We validated RAG-DIVE through two experimental setups. First, we tested a sample RAG system, including human evaluation of dialogue quality, repeated trials to assess consistency, and an ablation study showing that RAG-DIVE detects performance changes caused by system modifications. Second, we compared RAG-DIVE with a traditional static dataset evaluation on an industrial RAG system under different configurations to verify whether both approaches reveal similar performance trends. Our findings demonstrate that RAG-DIVE facilitates dynamic, interaction-driven evaluation for multi-turn conversations, thereby advancing the assessment of RAG systems.