惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

A
Arctic Wolf
V
V2EX
P
Proofpoint News Feed
The Hacker News
The Hacker News
GbyAI
GbyAI
G
Google Developers Blog
S
Schneier on Security
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
W
WeLiveSecurity
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
Recent Announcements
Recent Announcements
腾讯CDC
Hacker News - Newest:
Hacker News - Newest: "LLM"
K
Kaspersky official blog
U
Unit 42
Engineering at Meta
Engineering at Meta
J
Java Code Geeks
Google Online Security Blog
Google Online Security Blog
Last Week in AI
Last Week in AI
V
Vulnerabilities – Threatpost
N
News and Events Feed by Topic
O
OpenAI News
量子位
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
Y
Y Combinator Blog
博客园 - 【当耐特】
Vercel News
Vercel News
Hacker News: Ask HN
Hacker News: Ask HN
T
Tor Project blog
Apple Machine Learning Research
Apple Machine Learning Research
Microsoft Security Blog
Microsoft Security Blog
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
AWS News Blog
AWS News Blog
MongoDB | Blog
MongoDB | Blog
S
Security Affairs
A
About on SuperTechFans
Project Zero
Project Zero
D
Darknet – Hacking Tools, Hacker News & Cyber Security
博客园 - 聂微东
Webroot Blog
Webroot Blog
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Cloudbric
Cloudbric
T
Tenable Blog
月光博客
月光博客
C
Check Point Blog
宝玉的分享
宝玉的分享
V
Visual Studio Blog
T
The Blog of Author Tim Ferriss
NISL@THU
NISL@THU

cs.DB updates on arXiv.org

Architectural Evolution and Selection Framework for Database Systems in AI-Ready Data Platforms Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method Demand-Driven Vulnerability Detection for Cloud Security Posture Management: Removing Human Rule Authoring from the Disclosure-to-Protection Critical Path Larch: Learned Query Optimization for Semantic Predicates DP4SQL: Differentially Private SQL with Flexible Privacy Policies Data Profiling for Change Rules RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching The Role of Semirings in Incremental View Maintenance DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving Principles of Concept Representation in Sentence Encoders TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving Data Flow Control: Data Safety Policies for AI Agents LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval CYGNET: Cypher Gate for Neural Execution Triage and Cost Containment MLSkip: Data Skipping for ML Filters via Lightweight Metadata Formalizing all indexed mathematics as a benchmark for general reasoning, with the example of implementing dilatations of categories A Community Survey on SHACL and ShEx: Briding Gaps in RDF Validation BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams CAPER: Clause-Aligned Process Supervision for Text-to-SQL HRNN: A Hybrid Graph Index for Approximate Reverse k-Nearest Neighbor Search on High-Dimensional Vectors Cost-Aware Optimization for Agentic Query Execution ACRONYM: Accelerated Approximate Nearest Neighbor Search in Memory for Dynamic Vector Databases The Case for Text-to-SQL Friendly Logical Database Design LAANN: I/O-Aware Look-Ahead Search for Disk-Based Approximate Nearest Neighbor Search TimeBlocks: Foundational and Continual Time-Series Blockbase -- Extended Version Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit Inference Cost Attacks for Retrieval-Augmented Large Language Models Can we trust LLM Self-Explanations for Entity Resolution? The World's Fastest Matching Engine Algorithm PE-means: Improved Differentially Private $k$-means Clustering through Private Evolution Vector Linking via Cross-Model Local Isometric Consistency SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition Modeling and Optimization for Massive Data Allocation in Database Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation Listing Even Cycles Faster than the Submodular-Width Barrier Towards Reliable Agentic Progressive Text-to-Visualization with Verification Rules Explaining Rankings with Hidden Group Bonuses One Ring to Shuffle Them All: Scalable Intra-Process Data Redistribution with Ring-Buffer Shuffle in Redpanda Oxla ScanTwin: Simulating Performance Regressions Without Access to Tenant Data Residual-Entropy Accounting for Routed Atom-Budgeted Learned Indexes IORM: Hierarchical I/O Governance for Thousands of Consolidated Databases on Oracle Exadata Building Community-Centred NLP Resources for Puno Quechua Efficient Shapley-Based Influence Attribution in Social Networks Are Diffusion Language Models Good Database Analysts? A Query Engine for the Agents Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations RT-RkNN: Reverse k Nearest Neighbor Queries as a Graphics Ray Casting Problem Generalized Range Filtering Approximate Nearest Neighbor Search: Containment and Overlap [Technical Report] Geo: A Query Rewrite Framework for Graph Pattern Mining Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking AgentIR: A Workload-Adaptive Cascade Retrieval Substrate for Long-Term Conversational Memory CAFS: A Cache-Aware Frequency Sort for Low-Cardinality Integer Data on x86-64 Top-k Approximate Functional Dependency Discovery MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation MetaboKG: An Analysis-centric Knowledge Graph Framework for Untargeted Metabolomics LEARNT: A Practical Estimator for Cardinality of LIKE Queries with Formal Accuracy Guarantees Incorporating Deep Learning Design in Database Queries AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery The Time is Here for Just-in-Time Systems: Challenges and Opportunities CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces A Pragmatic Approach to Learned Indexing in RocksDB: Targeted Optimizations with Minimal System Modification BCTuner: LLM-Guided Monte Carlo Tree Search for Efficient Blockchain Knob Tuning A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works Finding Performance Issues in Database Systems by Exploiting Dormant Code Paths Measuring Database Unfairness via Dependency Quantification Under Differential Privacy Evaluation of Pipelines for Data Integration into Knowledge Graphs Residual Skill Optimization for Text-to-SQL Ensembles AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs) Dynamic Shapley Computation A Case for Agentic Tuning: From Documentation to Action in PostgreSQL Block-Sphere Vector Quantization AffectAI-Capture: A Reproducible Multimodal Protocol for Small-Group Meeting Research GroupAffect-4: A Multimodal Dataset of Four-Person Collaborative Interaction CogScale: Scalable Benchmark for Sequence Processing TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems Expressive Power of Deep Homomorphism Networks over Relational Databases Agentic Cost-Aware Query Planning with Knowledge Distillation for Big Data Analytics Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing Automatic Unsupervised Ensemble Outlier Model Selection--Extended Version A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks Gaussian Relational Graph Transformer Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift A Horn extension of DL-Lite with NL data complexity 3D Primitives are a Spatial Language for VLMs Enabling AI-Native Mobility in 6G: A Real-World Dataset for Handover, Beam Management, and Timing Advance A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models Toward Multi-Database Query Reasoning for Text2Cypher
The Value of Adaptivity in LSM Bloom-Filter Tuning: A Log-Law and a Two-Clock Frontier
[Submitted on 16 Jun 2026] · 2026-06-17 · via cs.DB updates on arXiv.org

View PDF HTML (experimental)

Abstract:Log-structured merge (LSM) trees attach an approximate-membership filter to every run and must split a fixed memory budget across them. The static optimum is known (Monkey); a large systems literature then makes the allocation adaptive, tracking shifting hotness online. We ask a prior question: when is that adaptivity worth its machinery? We give three analytical answers and validate them on synthetic sweeps, real Twitter production cache traces, and a real RocksDB engine. First, a log-law: optimal bits-per-key is affine in the logarithm of access frequency, at a fixed slope. Second, a robustness law: because the workload enters only logarithmically, the excess read cost from a hotness misestimate is half the size-weighted variance of the log error, and a common-factor misestimate is absorbed by the budget multiplier, so coarse estimates lose little. Third, an adaptivity-value frontier: since compaction rebuilds filters for free on its own clock, the value of continuous tracking over an allocation recomputed only at compaction grows quadratically in the within-epoch drift, with a closed-form scale. This yields a three-regime policy (coarse-at-compaction suffices, then track, then at extreme drift fall back to uniform) and predicts that more skew makes fine tracking matter less. On a real cluster, reallocating only at compaction captures 96-99% of tracking's benefit; on RocksDB the false-positive primitive holds within four percent to eight bits per key. The contribution is a characterization of when adaptive tuning pays; we add no new filter and no engine fork. Code and pre-registration are public.

Submission history

From: Sandeep Kunkunuru [view email]
[v1] Tue, 16 Jun 2026 16:39:41 UTC (226 KB)