惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Attack and Defense Labs
Attack and Defense Labs
T
Threatpost
C
Cybersecurity and Infrastructure Security Agency CISA
H
Hackread – Cybersecurity News, Data Breaches, AI and More
I
Intezer
C
Cyber Attacks, Cyber Crime and Cyber Security
The Register - Security
The Register - Security
量子位
Security Latest
Security Latest
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
大猫的无限游戏
大猫的无限游戏
小众软件
小众软件
Exploit-DB.com RSS Feed
Exploit-DB.com RSS Feed
C
CXSECURITY Database RSS Feed - CXSecurity.com
MyScale Blog
MyScale Blog
J
Java Code Geeks
Apple Machine Learning Research
Apple Machine Learning Research
Google DeepMind News
Google DeepMind News
WordPress大学
WordPress大学
Spread Privacy
Spread Privacy
Jina AI
Jina AI
博客园 - 【当耐特】
P
Palo Alto Networks Blog
Last Week in AI
Last Week in AI
SecWiki News
SecWiki News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
G
GRAHAM CLULEY
宝玉的分享
宝玉的分享
Hacker News - Newest:
Hacker News - Newest: "LLM"
T
The Blog of Author Tim Ferriss
V
Vulnerabilities – Threatpost
有赞技术团队
有赞技术团队
T
Tor Project blog
H
Hacker News: Front Page
A
Arctic Wolf
NISL@THU
NISL@THU
A
About on SuperTechFans
云风的 BLOG
云风的 BLOG
Engineering at Meta
Engineering at Meta
V
V2EX
N
News and Events Feed by Topic
Webroot Blog
Webroot Blog
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
I
InfoQ
D
Docker
L
LINUX DO - 最新话题
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
U
Unit 42

Zhach's News & Views

The Real Gap Between What We See and What We Know Rise of Liquid News: Formats for the User How I Left YouTube The AI Editor: Can We Trust the AI Fact-Checker? (Part 2) Short-Form Content and the Social Cost of a Quick Scroll How to Save Money with Big Data: Finding Matches (Part 3) The Independent Reporter's Playbook: How to Thrive in the New News Ecosystem The AI Editor: How to Automate Fact-Checking? (Part 1) How to Save Money with Big Data: Counting Things (Part 2) When Does Someone’s Confidence Need a Check-In? Protecting the Human: Why the EU AI Act is a Good Move The Math of Truth: How Data is Reshaping Journalism The Plateau is Here: This is What's Next For AI I Used to Work at Google. This Is How Your News Is Created.
How to Save Money with Big Data: Understanding Hashes (Part 1)
https://www.facebook.com/zhachory.volker · 2025-10-15 · via Zhach's News & Views

WARNING: this 3-part series is a little more technical than others. If you like cool computer theories or are curious about how things work, enjoy! If you don’t, you can check out my other stories at

zhach.news

.

I am currently working with my tutoring student about some computer science data structures and we got to one of my favorite topics, hash functions and tables! And I can’t wait to teach them all about the wonderful world of Probabilistic Data Structures (or PDS). But I realized that not a lot of people know about these things, and they are everywhere, so I decided to write about it.

A “Hash Function” is simply defined as turning an arbitrary long piece of data into a fixed length piece of data. A simple hash function could be just turning the letters of a sentence into numbers and adding them together (like this).​

By Jorge Stolfi - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=6601264

The foundation of nearly all efficient, large-scale data processing rests upon the concept of hashing. Before we can appreciate how probabilistic data structures save us petabytes of memory, we must first clearly understand this building block: the hash function.


What does a Hash Function do?

In its simplest form, a hash function is a mathematical algorithm that takes an input (or a "key") of arbitrary length and produces an output of a fixed, generally much shorter, length. That’s it. Like I mentioned before, a simple hash function could be just turning the letters of a sentence into numbers and adding them together. So if I use the Unicode number for any letter, I can just add up the letters and use that as my hash function.

So imagine a super long sentence.

sentence_1 = """The old lighthouse keeper, Elias, watched the relentless sweep of the beam cut through the fog, a familiar, hypnotic rhythm that had governed his life for thirty years. Outside, the gulls cried a mournful chorus against the steady roar of the Atlantic, their sound muffled by the thick, salty air. He remembered a story from his childhood, about a ship that had mistaken a particularly bright star for its light and met disaster on the jagged reefs below. Tonight, the sea was a canvas of inky black, disturbed only by whitecaps that seemed to grin menacingly in the sparse illumination, a perfect, lonely night for a man who preferred the company of waves and silence."""

A hash function can take this and just turn it into a number (what we call a hash number or value).

hash_number_1 = hash_function(sentence_1)
print(hash_number_1) # prints "85733"

Let’s try it for a different sentence.

sentence_2 = "My name is Zhach, and I like Becky."
hash_number_2 = hash_function(sentence_2)
print(hash_number_2) # prints "11726"

And the fun part is that those same sentences will always return that same number, no matter what.

hash_number_1 = hash_function(sentence_1)
print(hash_number_1) # prints "85733"

hash_number_2 = hash_function(sentence_2)
print(hash_number_2) # prints "11726"

But if you change even one letter (like if you misspell my name), the hash function will return an entirely different number.

sentence_3 = "My name is Zach, and I like Becky."
hash_number_3 = hash_function(sentence_3)
print(hash_number_3) # prints "69718"

This may seem useless at first, but this can have a number of applications!

  1. Fixed-Length Conversion: It converts a variable-length key (like a user's full name or a document's title) into a fixed-length value. This is vital for memory efficiency, as a 100-character document title, for example, is consistently represented by a short integer.
  2. Uniform Distribution: The function is designed to scramble the bits of the input key in such a way that the resulting hash values are distributed as uniformly as possible across the entire keyspace. This minimizes the likelihood of different inputs generating the same output, an event known as a collision (more on that later).
  3. Table Mapping: Finally, the hash value is typically mapped into an index that is within the size limits of a particular table or array. This is how we achieve the fast lookups associated with data structures like a hash map.

But what if two things get the same number?

Aha! Now you are getting to the tradeoff here. I can take anything and turn it into a number, but the number of possible sentences in the universe have to be greater than the number I can generate in a program right?

This is where collisions come in. Basically, whenever two data points get the same hash value, that is a collision. Like the last sentence I wrote and this other sentence:

sentence_3 = "My name is Zach, and I like Becky."
sentence_4 = "Lz mbme ht Ybch, bnc J kike Cfbkx."

hash_number_3 = hash_function(sentence_3)
hash_number_4 = hash_function(sentence_4)

print(hash_number_3) # prints "69718"
print(hash_number_4) # prints "69718"

Two different sentences, but the same number! 🤯

Is this a problem? Depends on the application! How we handle collisions and how often they happen is the knob that most of the upcoming Probabilistic Data Structures really play with.


Accuracy vs. Resources

I believe this is the most important concept to understand when talking about big data systems. The reason hash functions are so important to Probabilistic Data Structures is because they facilitate the biggest trade-off in scalable systems: exchanging perfect accuracy for massive resource savings.

Consider the volume of data that is processed on YouTube, for example. To keep track of every unique user ID for every video impression, for example, would require petabytes of storage. In that scenario, an exact, 100% accurate count is simply infeasible in real-time or too costly to maintain (don’t I know it 😢)

Probabilistic Data Structures, powered by efficient hashing, enable us to answer questions like: "Is this item likely in this set?" or "How many unique items have we seen, with a guaranteed error rate of less than 2%?" In many real-world applications—from web analytics to network traffic monitoring—a highly accurate approximation is not only sufficient but, sometimes, the only viable path to a real-time, scalable solution. 

For instance, the user id to video watch problem, I can track all the views on a video with a database of user IDs. If 10 million people watch a video, I am storing 80 MB of data for this 1 video. If I have 10 million videos like this, I am already at 800 TBs of data.

If I use something like a Bloom Filter, I can track all the users for 10 million videos with only 80 TB of memory and a 4% error rate (link). We are saving 90% of the data using a bloom filter!

Diagram of how the probability of error decreases as memory increases

If we say that a Terabyte costs 100$, we saved $72,000 with that change. That’s the power of hashes. We could also increase the memory we use to reduce the error rate, which is a knob that is at our disposal now!


What are some of your favorite examples of where an approximate answer, powered by a simple hash function, is significantly more valuable than a precise one? Share your thoughts below! Follow to read parts 2 and 3!