惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
V
V2EX
P
Proofpoint News Feed
The Hacker News
The Hacker News
GbyAI
GbyAI
G
Google Developers Blog
S
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
Recent Announcements
Recent Announcements
腾讯CDC
Hacker News - Newest:
Hacker News - Newest: "LLM"
K
Kaspersky official blog
U
Unit 42
Engineering at Meta
Engineering at Meta
J
Java Code Geeks
Google Online Security Blog
Google Online Security Blog
Last Week in AI
Last Week in AI
V
Vulnerabilities – Threatpost
N
News and Events Feed by Topic
O
OpenAI News
量子位
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Y
Y Combinator Blog
博客园 - 【当耐特】
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
Microsoft Security Blog
Microsoft Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog
MongoDB | Blog
MongoDB | Blog
S
Security Affairs
A
About on SuperTechFans
Project Zero
Project Zero
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 聂微东
Webroot Blog
Webroot Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Cloudbric
Cloudbric
T
Tenable Blog
月光博客
月光博客
C
Check Point Blog
宝玉的分享
宝玉的分享
V
Visual Studio Blog
T
The Blog of Author Tim Ferriss
NISL@THU
NISL@THU

cs.IR updates on arXiv.org

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors OneFeed: A Unified Generative Framework for Feed ContentEnhancement and Query Generation Beyond Retrieval: Learning Compact User Representations for Scalable LLM Personalization TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning Bridging Passive and Active: Enhancing Conversation Starter Recommendation via Active Expression Modeling Empirical Study of Pop and Jazz Mix Ratios for Genre-Adaptive Chord Generation RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graphs from Complex Documents AI Co-Scientist for Ranking: Discovering Novel Search Ranking Models alongside LLM-based AI Agents with Cloud Computing Access AgenticRec: End-to-End Tool-Integrated Policy Optimization for Ranking-Oriented Recommender Agents All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution Orcheo: A Modular Full-Stack Platform for Conversational Search From Noise to Order: Learning to Rank via Denoising Diffusion Self-Supervised Learning as Discrete Communication Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era Learning Unified User Quantized Tokenizers for User Representation A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task Similarity of Semantic Relations Expressing Implicit Semantic Relations without Supervision Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches Adapting a general parser to a sublanguage Evaluating Variable Length Markov Chain Models for Analysis of User Web Navigation Sessions Inference and Evaluation of the Multinomial Mixture Model for Text Clustering Similarity of Objects and the Meaning of Words A Multi-Relational Network to Support the Scholarly Communication Process Better than the real thing? Iterative pseudo-query processing using cluster-based language models PageRank without hyperlinks: Structural re-ranking using links induced by language models The Nature of Novelty Detection Hiérarchisation des règles d'association en fouille de textes Sur le statut référentiel des entités nommées Authoring case based training by document data extraction Transitive Text Mining for Information Extraction and Hypothesis Generation Lattices for Dynamic, Hierarchic & Overlapping Categorization: the Case of Epistemic Communities Corpus-based Learning of Analogies and Semantic Relations Summarizing Reports on Evolving Events; Part I: Linear Evolution Measuring Semantic Similarity by Latent Relational Analysis Universal Similarity Metalinguistic Information Extraction for Terminology Summarization from Medical Documents: A Survey An Introduction to the Summarization of Evolving Events: Linear and Non-linear Evolution Top-Down Unsupervised Image Segmentation (it sounds like oxymoron, but actually it is not) Ontology-Based Users & Requests Clustering in Customer Service Management System Combining Independent Modules in Lexical Multiple-Choice Problems The Google Similarity Distance Human-Level Performance on Word Analogy Questions by Latent Relational Analysis Ranking Pages by Topology and Popularity within Web Sites Building Chinese Lexicons from Scratch by Unsupervised Short Document Self-Segmentation Automatic Keyword Extraction from Spoken Text. A Comparison of two Lexical Resources: the EDR and WordNet An argumentative annotation schema for meeting discussions Semantic filtering by inference on domain knowledge in spoken dialogue systems A knowledge-based approach to semi-automatic annotation of multimedia documents via user adaptation Automated Pattern Detection--An Algorithm for Constructing Optimally Synchronizing Multi-Regular Language Filters Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities Search Using N-gram Technique Based Statistical Analysis for Knowledge Extraction in Case Based Reasoning Systems A Dynamic Clustering-Based Markov Model for Web Usage Mining Corpus structure, language models, and ad hoc information retrieval "In vivo" spam filtering: A challenge problem for data mining Artificial Sequences and Complexity Measures Evolving a Stigmergic Self-Organized Data-Mining Polyhierarchical Classifications Induced by Criteria Polyhierarchies, and Taxonomy Algebra Acquiring Lexical Paraphrases from a Single Corpus Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval Data mining and Privacy in Public Sector using Intelligent Agents (discussion paper) A Neural Network Assembly Memory Model Based on an Optimal Binary Signal Detection Theory Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems Measuring Praise and Criticism: Inference of Semantic Orientation from Association Semi-metric Behavior in Document Networks and its Application to Recommendation Systems ROC Curves Within the Framework of Neural Network Assembly Memory Model: Some Analytic Results Coherent Keyphrase Extraction via Web Mining Learning Analogies and Semantic Relations Bayesian Information Extraction Network A Method for Clustering Web Attacks Using Edit Distance A Neural Network Assembly Memory Model with Maximum-Likelihood Recall and Recognition Properties Analysis and Interface for Instructional Video Segmentation, Indexing, and Visualization of Extended Instructional Videos Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews Learning Algorithms for Keyphrase Extraction Question Answering over Unstructured Data without Domain Restrictions Knowledge management for enterprises (Wissensmanagement fuer Unternehmen) The Traits of the Personable Intelligent Anticipated Exploration of Web Sites Towards Solving the Interdisciplinary Language Barrier Problem Conceptual Analysis of Lexical Taxonomies: The Case of WordNet Top-Level Information Extraction Using the Structured Language Model Bipartite graph partitioning and data clustering Coupled Clustering: a Method for Detecting Structural Correspondence Iterative Residual Rescaling: An Analysis and Generalization of LSI File mapping Rule-based DBMS and Natural Language Processing Retrieval from Captioned Image Databases Using Natural Language Processing Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies How to Evaluate your Question Answering System Every Day and Still Get Real Work Done PIPE: Personalizing Recommendations via Partial Evaluation Representing Scholarly Claims in Internet Digital Libraries: A Knowledge Modelling Approach Using Local Optimality Criteria for Efficient Information Retrieval with Redundant Information Filters
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical Practice
[Submitted on 16 Jun 2026] · 2026-06-17 · via cs.IR updates on arXiv.org

View PDF HTML (experimental)

Abstract:Retrieval-Augmented Generation (RAG) is the prevailing architecture for grounding language model outputs in external evidence, yet its dominant evaluation paradigms and default configurations remain oriented toward factual question-answering. For interpretive disciplines such as historical studies, RAG embeds assumptions that conflict with scholarly practice. We introduce HistoRAG, a framework that translates historiographical principles into concrete architectural interventions. Separated retrieval and generation decouples source discovery from interpretation, temporal windowing enforces balanced source representation across the research period as a methodological requirement of historical inquiry, and LLM-as-judge evaluation makes relevance judgments transparent and contestable. We evaluate these interventions using SPIEGELragged, applied to 102,189 articles from Der Spiegel (1950-1979). Each intervention addresses a measurable deficiency in standard RAG: era-specific vocabulary retrieves zero chunks from the 1950s when using 1970s terminology, evidence of the temporal skew that motivates windowing; vector similarity and LLM-assessed relevance correlate only weakly (Spearman rho = 0.275), motivating post-retrieval evaluation; and keyword-based and semantic retrieval surface largely disjoint source pools, motivating an architecture in which both operate as complementary retrieval layers under a shared LLM evaluation filter. We also introduce the concept of Zwischentexte (intermediate texts that function as interpretive proposals rather than findings) as a framework for responsible integration of LLM-generated text into scholarly practice. The architecture offers a model for how domain-specific epistemological commitments can be translated into RAG design decisions, and may transfer to other interpretive disciplines working with large corpora.

Submission history

From: Torsten Hiltmann [view email]
[v1] Tue, 16 Jun 2026 16:03:37 UTC (185 KB)