惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Cyberwarzone
Cyberwarzone
T
Tenable Blog
A
Arctic Wolf
P
Palo Alto Networks Blog
P
Privacy International News Feed
S
Securelist
Security Latest
Security Latest
AWS News Blog
AWS News Blog
W
WeLiveSecurity
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Apple Machine Learning Research
Apple Machine Learning Research
K
Kaspersky official blog
C
CERT Recently Published Vulnerability Notes
V
V2EX - 技术
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Scott Helme
Scott Helme
C
Check Point Blog
TaoSecurity Blog
TaoSecurity Blog
Microsoft Azure Blog
Microsoft Azure Blog
D
DataBreaches.Net
T
Tailwind CSS Blog
T
Tor Project blog
宝玉的分享
宝玉的分享
博客园 - 司徒正美
Engineering at Meta
Engineering at Meta
Cisco Talos Blog
Cisco Talos Blog
Recent Announcements
Recent Announcements
H
Hackread – Cybersecurity News, Data Breaches, AI and More
L
Lohrmann on Cybersecurity
Jina AI
Jina AI
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
P
Proofpoint News Feed
IT之家
IT之家
S
Schneier on Security
MyScale Blog
MyScale Blog
S
Security Affairs
Simon Willison's Weblog
Simon Willison's Weblog
C
Comments on: Blog
aimingoo的专栏
aimingoo的专栏
腾讯CDC
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园_首页
F
Fortinet All Blogs
Vercel News
Vercel News
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
E
Exploit-DB.com RSS Feed
A
About on SuperTechFans
Help Net Security
Help Net Security
博客园 - 【当耐特】
L
LINUX DO - 最新话题

cs.CL updates on arXiv.org

Indexing Multimodal Language Models for Large-scale Image Retrieval UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling PersonaVLM: Long-Term Personalized Multimodal LLMs MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking Reward Design for Physical Reasoning in Vision-Language Models When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning? Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning (How) Learning Rates Regulate Catastrophic Overtraining Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space A Domain-Specific Language for LLM-Driven Trigger Generation in Multimodal Data Collection The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size Functional Emotions or Situational Contexts? A Discriminating Test from the Mythos Preview System Card C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation Token Statistics Reveal Conversational Drift in Multi-turn LLM Interaction Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic Curation of a Palaeohispanic Dataset for Machine Learning EVE: A Domain-Specific LLM Framework for Earth Intelligence OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs Document-tuning for robust alignment to animals Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports? IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki's Ramayana Across Indian Languages Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus AgentSPEX: An Agent SPecification and EXecution Language Peer-Predictive Self-Training for Language Model Reasoning TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding Using reasoning LLMs to extract SDOH events from clinical notes ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding Synthesizing Instruction-Tuning Datasets with Contrastive Decoding Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate Training-Free Test-Time Contrastive Learning for Large Language Models YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks Foresight Optimization for Strategic Reasoning in Large Language Models Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2 Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations Self-Calibrating Language Models via Test-Time Discriminative Distillation HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation Generating High Quality Synthetic Data for Dutch Medical Conversations GIANTS: Generative Insight Anticipation from Scientific Literature Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling Should We be Pedantic About Reasoning Errors in Machine Translation? Human vs. Machine Deception: Distinguishing AI-Generated and Human-Written Fake News Using Ensemble Learning Weird Generalization is Weirdly Brittle Computational Implementation of a Model of Category-Theoretic Metaphor Comprehension CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty SEPTQ: A Simple and Effective Post-Training Quantization Paradigm for Large Language Models Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry CircuitSynth: Reliable Synthetic Data Generation Training-Free Cross-Lingual Dysarthria Severity Assessment via Phonological Subspace Analysis in Self-Supervised Speech Representations Simulating Organized Group Behavior: New Framework, Benchmark, and Analysis Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification Nationality encoding in language model hidden states: Probing culturally differentiated representations in persona-conditioned academic text
Eywa: Provenance-Grounded Long-Term Memory for AI Agents
[Submitted on 29 May 2026] · 2026-06-01 · via cs.CL updates on arXiv.org

View PDF HTML (experimental)

Abstract:AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score >= 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at this https URL.

Submission history

From: Resham Joshi [view email]
[v1] Fri, 29 May 2026 02:56:35 UTC (3,432 KB)