惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
J
Java Code Geeks
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
H
Hackread – Cybersecurity News, Data Breaches, AI and More
V
Visual Studio Blog
G
Google Developers Blog
V
V2EX
The Register - Security
The Register - Security
博客园 - 三生石上(FineUI控件)
云风的 BLOG
云风的 BLOG
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
博客园_首页
S
SegmentFault 最新的问题
博客园 - Franky
Martin Fowler
Martin Fowler
Stack Overflow Blog
Stack Overflow Blog
A
About on SuperTechFans
人人都是产品经理
人人都是产品经理
aimingoo的专栏
aimingoo的专栏
罗磊的独立博客
C
Check Point Blog
MyScale Blog
MyScale Blog
T
The Blog of Author Tim Ferriss
MongoDB | Blog
MongoDB | Blog
The GitHub Blog
The GitHub Blog
Last Week in AI
Last Week in AI
Microsoft Azure Blog
Microsoft Azure Blog
IT之家
IT之家
F
Fortinet All Blogs
Jina AI
Jina AI
P
Proofpoint News Feed
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
阮一峰的网络日志
阮一峰的网络日志
B
Blog
L
LangChain Blog
月光博客
月光博客
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
宝玉的分享
宝玉的分享
博客园 - 【当耐特】
T
Tailwind CSS Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
Microsoft Security Blog
Microsoft Security Blog
WordPress大学
WordPress大学
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
B
Blog RSS Feed
博客园 - 聂微东
Hugging Face - Blog
Hugging Face - Blog
M
MIT News - Artificial intelligence
GbyAI
GbyAI

Arpit Bhayani

Temporal Primer - Building Long-Running Systems What Matters in Production RAG Structure of Every LLM Chat How LLMs Really Work Your Monolith Is Already A Distributed System Databases Were Not Designed For This BM25 JOIN Algorithms Venting at Work Comes at a Reputation Cost Why Half Your Skills Expire Every Few Years Multi-Paxos - Consensus in Distributed Databases MySQL Replication Internals Bloom Filters When You Increase Kafka Partitions Product Quantization The Q, K, V Matrices The Day I Accidentally Deleted Production How LLM Inference Works What are Blocking Queues and Why We Need Them Heartbeats in Distributed Systems How Writes Work in Apache Cassandra Redis Replication Internals How to Handle Arrogant Colleagues at Work How Does a CDN Handle Content Replication You Can't Fix Everything on Day One When Emotions Spill Over at Work Why gRPC Uses HTTP2 Meetings With No Agenda Are a Waste of Time Career Longevity Beats Constant Job Hopping Stay Relevant at Higher Salary Levels Why Distributed Systems Need Consensus Algorithms Like Raft Why Do Databases Deadlock and How Do They Resolve It Why and How Cache Locality Can Make Your Code Faster Why Eventual Consistency is Preferred in Distributed Systems Why does DNS use both UDP and TCP Should You Do a Master's My Honest Take Empathy Makes Great Engineers Unstoppable Good Mentors Build People, Not Just Skills Why You Should Always Have Back-Burner Projects Before You Push Back, Know What You're Standing On Be the One They Can Count On How Much Are People Willing to Bet on You How to Get Leadership to Say Yes to Your Project Don't Let Your Best Ideas Die in Silence Be the Person Everyone Wants to Work With The XY Problem and How to Avoid It The Startup Hiring Lie Nobody Talks About You Won't Be Promoted Unless You Ask It's Not Enough to be Right; Learn to be Heard No One Ships Great Software Alone You Don't Win by Proving Others Wrong Appreciate Generously; It Costs Nothing, But Builds Everything Your Soft Skills Aren't Soft at All Before you form an opinion, experience it Why You Need Both Curiosity and Action to Thrive A Daily Worklog Changed Everything How We Handle Mistakes Defines Us Own Your Mistakes Don't Wait. Step Up. Temporary Fixes Are Permanent Why Interviews Are Biased And What Sets You Apart Saying 'This isn't my problem' is actually the problem How to Write Effective OKRs Never Lose a Battle due to Miscommunication When In Doubt, Code It Out How to Follow Up Without Annoying People Lead Projects That Land, Execution Over Everything Abstract Thinking Will Define Your Next Decade We Engineers Suck at Task Estimation Shiny Obect Syndrome in Tech When to Change Jobs - The 3P Framework Comfort and Competition - Know When to Switch Gears Paper Notes - On-demand Container Loading in AWS Lambda Paper Notes - SQL Has Problems. We Can Fix Them Pipe Syntax In SQL Paper Notes - NanoLog - A Nanosecond Scale Logging System Don't Wait, Learn - The Best Resource is Mythical Paper Notes - WTF - The Who to Follow Service at Twitter The Unexpected Benefit of Reading Random Engineering Articles Roadmaps Are Limiting Your Growth Stop Leaving Money on the Table - Negotiate Your Job Offer Never Bad-Mouth Your Past Employers Show You're a Culture Fit Quantify your resume, Know Your Numbers The Importance of Being Likeable in Interviews Questions to Ask Your Interviewer How to Build Trust Through Collaboration Do This, Once You Are Out of the Interview Cycle Stop Pitching Ideas, Start Pitching Projects Read Those Design Docs, Even the Ones That Seem Irrelevant The Best Engineering Lessons Happen During Outages Great Engineers Start Broad LLM Summaries are Ruining Your Learning Turn System Design Interviews into Discussions Title Inflation At Work, Find Your Own Projects 6 Simple Strategies to Cracking Any Tech Interview How to Remain Unblocked Solving the Knapsack Problem with Evolutionary Algorithms Generating Pseudorandom Numbers with LFSR Local vs Global Indexes in Partitioned Databases
The RUM Conjecture - Storage System Trade-offs
Arpit Bhayani · 2020-05-31 · via Arpit Bhayani

The RUM Conjecture states that we cannot design an access method for a storage system that is optimal in all the following three aspects - Reads, Updates, and, Memory. The conjecture puts forth that we always have to trade one to make the other two optimal and this makes the three constitutes a competing triangle, very similar to the famous CAP theorem.

https://user-images.githubusercontent.com/4745789/83323578-6eb21e00-a27d-11ea-941b-43e875169c97.png

Access Method

Data access refers to an ability to access and retrieve data stored within a storage system driven by an optional storage engine. Usually, a storage system is designed to be optimal for serving a niche use case and achieve that by carefully and judiciously deciding the memory and disk storage requirements, defining well-structured access and retrieval pattern, designing data structures for primary and auxiliary data and picking additional techniques like compression, encryption, etc. These decisions define, and to some extent restricts, the possible ways the storage engine can read and update the data in the system.

RUM Overheads

An ideal storage system would be the one that has an access method that provides lowest Read Overhead, minimal Update Cost, and does not require any extra Memory or Storage space, over the main data. In the real-world, achieving this is near impossible and that is something that is dictated by this conjecture.

Read Overhead

Read Overhead occur when the storage engine performs reads on auxiliary data to fetch the required intended main data. This usually happens when we use an auxiliary data structure like a Secondary Index to speed up reads. The reads happening on this auxiliary structure constitutes read overheads.

Read Overhead is measured through Read Amplification and it is defined as the ratio between the total amount of data read (main + auxiliary) and the amount of main data intended to be read.

Update Overhead

Update Overhead occur when the storage engine performs writes on auxiliary data or on some unmodified main data along with intended updates on the main data. A typical example of Update Overheads is the writes that happen on an auxiliary structure like Secondary Index alongside the write happening on intended main data.

Update Overhead is measured through Write Amplification and it is defined as the ratio between the total amount of data written (main + auxiliary) and the amount of main data intended to be updated.

Memory Overhead

Memory overhead occurs when the storage system uses an auxiliary data structure to speed up reads, writes, or to serve common access patterns. This storage is in addition to the storage needs of the main data.

Memory Overhead is measured through Space Amplification and it is defined as the ratio between the space utilized for auxiliary and main data and space utilized by the main data.

The Conjecture

The RUM Conjecture, in a formal way, states that

An access method that can set an upper bound for two out of the read, update, and memory overheads, also sets a lower bound for the third overhead.

This is not a hard rule that is followed and hence it is not a theorem but a conjecture - widely observed but not proven. But we can safely keep this in mind while designing the next big storage system serving a use case.

Categorizing Storage Systems

Now that we have seen RUM overheads and the RUM Conjecture we take a look at examples of Storage Systems that classify into one of the three types.

Read Optimised

Read Optimised storage systems offer very low read overhead but require some extra auxiliary space to gain necessary performance that again comes at a cost of updates required to keep auxiliary data in sync with main data which adds to update overheads. When the updates, on main data, become frequent the performance of a read optimized storage system takes a dip.

A fine example of a read optimized storage system is the one that supports Point Indexes, also called Hash-based indexes, offering constant time access. The systems that provide logarithmic time access, like B-Trees and Skiplists, also fall into this category.

Update Optimised

Update Optimised storage systems offer very low Update Overhead by usually using an auxiliary space holding differential data (delta) and flushing them over main data in a bulk operation. The need of having an auxiliary data to keep track of delta to perform a bulk update adds to Memory Overhead.

A few examples of Update Optimised systems are LSM Trees, Partitioned B Trees, and FD Tree. These structures offer very good performance for an update-heavy system but suffer from an increased read and space overheads. While reading data from LSM Tree, the engine needs to perform read on all the tiers and then perform a conflict resolution, and maintaining tiers of data itself is a huge Space Overhead.

Memory Optimised

Memory Optimised storage systems are designed to minimize auxiliary memory required for access and updates on the main data. To be memory-optimized the systems usually use compress the main data and auxiliary storages, or allow some error rate, like false positives.

A few examples of Memory Optimises systems are lossy index structures like Bloom Filters, Count-min sketches, Lossy encodings, and Sparse Indexes. Keeping either main or auxiliary data compressed, to be memory efficient, the system takes a toll on writes and reads as they now have additionally performed compression and decompressions adding to the Update and Read overheads.

https://user-images.githubusercontent.com/4745789/83323560-55a96d00-a27d-11ea-9d33-4001c672b920.png

Storage System examples for RUM Conjecture

Block-based Clustered Indexing

Block-based Clustered Indexing, sits comfortably between these three optimized systems types. It is not read Read efficient but also efficient on Updates and Memory. It builds a very short tree for its auxiliary data, by storing a few pointers to pages and since the data is clustered i.e. the main data itself is stored in the index, the system does not go to fetch the main data from the main storage and hence provides a minimal Read overhead.

Being RUM Adaptive

Storage systems have always been rigid with respect to the kind of use cases it aims to solve. the application, the workload, and the hardware should dictate how we access our data, and not the constraints of our systems. Storage systems could be designed to be RUM Adaptive and they should possess an ability to be tuned to reduce the RUM overheads depending on the data access pattern and computation knowledge. RUM Adaptive storage systems are part of the discussion for some other day.

Conclusion

There will always be trade-offs, between Read, Update, and Memory, while either choosing one storage system over others; the RUM conjecture facilitates and to some extent formalizes the entire process. Although this is just a conjecture, it still helps us disambiguate and make an informed, better and viable decision that will go a long way.

This essay was heavily based on the original research paper introducing The RUM Conjecture.

References