惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
Scott Helme
Scott Helme
爱范儿
爱范儿
WordPress大学
WordPress大学
博客园 - 三生石上(FineUI控件)
阮一峰的网络日志
阮一峰的网络日志
博客园 - Franky
V
V2EX
腾讯CDC
博客园_首页
博客园 - 司徒正美
酷 壳 – CoolShell
酷 壳 – CoolShell
T
Tailwind CSS Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
小众软件
小众软件
J
Java Code Geeks
大猫的无限游戏
大猫的无限游戏
月光博客
月光博客
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog
雷峰网
雷峰网
Stack Overflow Blog
Stack Overflow Blog
IT之家
IT之家
罗磊的独立博客
Recorded Future
Recorded Future
博客园 - 聂微东
O
OpenAI News
S
Secure Thoughts
Hacker News: Ask HN
Hacker News: Ask HN
S
Schneier on Security
Hacker News - Newest:
Hacker News - Newest: "LLM"
Y
Y Combinator Blog
C
Cyber Attacks, Cyber Crime and Cyber Security
Project Zero
Project Zero
宝玉的分享
宝玉的分享
K
Kaspersky official blog
N
Netflix TechBlog - Medium
T
The Exploit Database - CXSecurity.com
Google Online Security Blog
Google Online Security Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
Webroot Blog
Webroot Blog
云风的 BLOG
云风的 BLOG
Simon Willison's Weblog
Simon Willison's Weblog
C
Check Point Blog
D
Darknet – Hacking Tools, Hacker News & Cyber Security
L
LINUX DO - 热门话题
美团技术团队
L
Lohrmann on Cybersecurity

Arpit Bhayani

Temporal Primer - Building Long-Running Systems What Matters in Production RAG Structure of Every LLM Chat How LLMs Really Work Your Monolith Is Already A Distributed System Databases Were Not Designed For This BM25 JOIN Algorithms Venting at Work Comes at a Reputation Cost Why Half Your Skills Expire Every Few Years Multi-Paxos - Consensus in Distributed Databases MySQL Replication Internals Bloom Filters When You Increase Kafka Partitions Product Quantization The Q, K, V Matrices The Day I Accidentally Deleted Production How LLM Inference Works What are Blocking Queues and Why We Need Them Heartbeats in Distributed Systems How Writes Work in Apache Cassandra Redis Replication Internals How to Handle Arrogant Colleagues at Work How Does a CDN Handle Content Replication You Can't Fix Everything on Day One When Emotions Spill Over at Work Why gRPC Uses HTTP2 Meetings With No Agenda Are a Waste of Time Career Longevity Beats Constant Job Hopping Stay Relevant at Higher Salary Levels Why Distributed Systems Need Consensus Algorithms Like Raft Why Do Databases Deadlock and How Do They Resolve It Why and How Cache Locality Can Make Your Code Faster Why Eventual Consistency is Preferred in Distributed Systems Why does DNS use both UDP and TCP Should You Do a Master's My Honest Take Empathy Makes Great Engineers Unstoppable Good Mentors Build People, Not Just Skills Why You Should Always Have Back-Burner Projects Before You Push Back, Know What You're Standing On Be the One They Can Count On How Much Are People Willing to Bet on You How to Get Leadership to Say Yes to Your Project Don't Let Your Best Ideas Die in Silence Be the Person Everyone Wants to Work With The XY Problem and How to Avoid It The Startup Hiring Lie Nobody Talks About You Won't Be Promoted Unless You Ask It's Not Enough to be Right; Learn to be Heard No One Ships Great Software Alone You Don't Win by Proving Others Wrong Appreciate Generously; It Costs Nothing, But Builds Everything Your Soft Skills Aren't Soft at All Before you form an opinion, experience it Why You Need Both Curiosity and Action to Thrive A Daily Worklog Changed Everything How We Handle Mistakes Defines Us Own Your Mistakes Don't Wait. Step Up. Temporary Fixes Are Permanent Why Interviews Are Biased And What Sets You Apart Saying 'This isn't my problem' is actually the problem How to Write Effective OKRs Never Lose a Battle due to Miscommunication When In Doubt, Code It Out How to Follow Up Without Annoying People Lead Projects That Land, Execution Over Everything Abstract Thinking Will Define Your Next Decade We Engineers Suck at Task Estimation Shiny Obect Syndrome in Tech When to Change Jobs - The 3P Framework Comfort and Competition - Know When to Switch Gears Paper Notes - On-demand Container Loading in AWS Lambda Paper Notes - SQL Has Problems. We Can Fix Them Pipe Syntax In SQL Paper Notes - NanoLog - A Nanosecond Scale Logging System Don't Wait, Learn - The Best Resource is Mythical Paper Notes - WTF - The Who to Follow Service at Twitter The Unexpected Benefit of Reading Random Engineering Articles Roadmaps Are Limiting Your Growth Stop Leaving Money on the Table - Negotiate Your Job Offer Never Bad-Mouth Your Past Employers Show You're a Culture Fit Quantify your resume, Know Your Numbers The Importance of Being Likeable in Interviews Questions to Ask Your Interviewer How to Build Trust Through Collaboration Do This, Once You Are Out of the Interview Cycle Stop Pitching Ideas, Start Pitching Projects Read Those Design Docs, Even the Ones That Seem Irrelevant The Best Engineering Lessons Happen During Outages Great Engineers Start Broad LLM Summaries are Ruining Your Learning Turn System Design Interviews into Discussions Title Inflation At Work, Find Your Own Projects 6 Simple Strategies to Cracking Any Tech Interview How to Remain Unblocked Solving the Knapsack Problem with Evolutionary Algorithms Generating Pseudorandom Numbers with LFSR Local vs Global Indexes in Partitioned Databases
MinHash - Fast Jaccard Similarity at Scale
Arpit Bhayani · 2020-11-08 · via Arpit Bhayani

Set similarity measure finds its application spanning the Computer Science spectrum; some applications being - user segmentation, finding near-duplicate webpages/documents, clustering, recommendation generation, sequence alignment, and many more. In this essay, we take a detailed look into a set-similarity measure called - Jaccard’s Similarity Coefficient and how its computation can be optimized using a neat technique called MinHash.

Jaccard Similarity Coefficient

Jaccard Similarity Coefficient quantifies how similar two finite sets really are and is defined as the size of their intersection divided by the size of their union. This similarity measure is very intuitive and we can clearly see that it is a real-valued measure bounded in the interval [0, 1].

https://user-images.githubusercontent.com/4745789/98461673-302d7180-21d4-11eb-9722-41f473c1fe84.png

The coefficient is 0 when the two sets are mutually exclusive (disjoint) and it is 1 when the sets are equal. Below we see the one-line python function that computes this similarity measure.

def similarity_jaccard(a: set, b: set) -> float:
    return len(a.intersection(b)) / len(a.union(b))

Jaccard Similarity Coefficient as Probability

Jaccard Coefficient can also be interpreted as the probability that an element picked at random from the universal set U is present in both sets A and B.

https://user-images.githubusercontent.com/4745789/98462221-8dc3bd00-21d8-11eb-95bf-5a9267e88b97.png

Another analogy for this probability is the chances of throwing a dart and it hitting the intersection. Thus we see how we can transform the Jaccard Similarity Coefficient into a simple probability statement. This will come in very handy when we try to optimize the computation at scale.

Problem at Scale

Computing Jaccard Similarity Coefficient is very simple, all we require is a union operation and an intersection operation on the participating sets. But these computations go haywire when things run at scale.

Computing set similarity is usually a subproblem fitting in a bigger picture, for example, near-duplicate detection which finds near-duplicate articles across millions of documents. When we tokenize the documents and apply raw Jaccard Similarity Coefficient for every two combinations of documents we find that the computation will take years.

Instead of finding the true value for this coefficient, we can rely on an approximation if we can get a considerable speedup and this is where a technique called MinHash fits well.

MinHash

MinHash algorithm gives us a fast approximation to the Jaccard Similarity Coefficient between any two finite sets. Instead of computing the unions and the intersections every single time, this method once creates MinHash Signature for each set and use it to approximate the coefficient.

Computing single MinHash

MinHash h of the set S is the index of the first element, from a permuted Universal Set, that is present in the set S. But since permutation is a computation heavy operation especially for large sets we use a hashing/mapping function that typically reorders the elements using simple math operation. One such hashing function is

https://user-images.githubusercontent.com/4745789/98463097-f7df6080-21de-11eb-8b61-a84ff7ad85de.png

If u is the total number of elements in the Universal Set U then a and b are the random integers less than u and c is the prime number slightly higher than u. A sample permute function could be

def permute_fn(x: int) -> int:
    return (23 * x + 67) % 199

Now that we have defined permutation as a simple mathematical operation that spits out the new row index, we can find MinHash of a set as the element that has the minimum new row number. Hence we can define the MinHash function as

def minhash(s: set) -> int:
    return min([permute_fn(e) for e in s])

A surprising property of MinHash

MinHash has a surprising property, according to which, the probability that the MinHash of random permutation produces the same value for the two sets equals the Jaccard Similarity Coefficient of those sets.

https://user-images.githubusercontent.com/4745789/98463732-8229c380-21e3-11eb-9b26-04ec08bc8753.png

The above equality holds true because the probability of MinHash of two sets to be the same is the number of elements present in both the sets divided by the total number of elements in both the sets combined; which in fact is the definition of Jaccard Similarity Coefficient.

Hence to approximate Similarity Coefficient using MinHash all we have to do is find the Probability of MinHash of two sets to be the same, and this is where the MinHash Signature comes in to play.

MinHash Signature

MinHash Signature of a set S is a collection of k MinHash values corresponding to k different MinHash functions. The size k depends on the error tolerance, keeping it higher leads to more accurate approximations.

def minhash_signature(s: set):
    return [minhash(s) for minhash in minhash_fns]

MinHash functions usually differ in the permutation parameters i.e. coefficients a, b and c.

Now in order to compute Pr[h(A) = h(B)] we have to compare the MinHash Signature of the participating sets A and B and find how many values in their signatures match; dividing this number by the number of hash functions k will give the required probability and in turn an approximation of Jaccard Similarity Coefficient.

def similarity_minhash(a: set, b: set) -> float:
    sign_a = minhash_signature(a)
    sign_b = minhash_signature(b)
    return sum([1 for a, b in zip(sign_a, sign_b) if a == b]) / len(sign_a)

MinHash Signature could well be computed just once per set.

Thus to compute set similarity, we need not perform heavy computation like Union and Intersection and that too across millions of sets at scale, rather we can simply compare k items of in their signatures and get a fairly good estimate of it.

How good is the estimate?

In order to find how close the estimate is we compute the Jaccard Similarity Coefficient and its approximate using MinHash on two disjoint sets having equal cardinality. One of the sets will undergo a transition where one element of it will be replaced with one element of the other set. So with time, the sets will go from disjoint to being equal.

https://user-images.githubusercontent.com/4745789/98465023-860e1380-21ec-11eb-8813-7cb6920bc1fd.png

The illustration above shows the two plots and we can clearly see that the MinHash technique provides a fairly good estimate of Jaccard Similarity Coefficient with much fewer computations.

References