惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
Generated Hash Columns in MySQL — PlanetScale
Aaron Francis · 2023-06-15 · via Blog — PlanetScale

Aaron Francis |

One of the hidden gems in the MySQL documentation is this note in section 8.3.6:

As an alternative to a composite index, you can introduce a column that is “hashed” based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns.

We will build on this idea by creating generated hash columns for indexed lookups on large values and enforcing uniqueness across many columns. Instead of creating huge composite indexes, we'll index the compact generated hashes for fast lookups.

Before diving into generated hash columns, let's look at generated columns in general.

Generated columns in MySQL

A generated column can be considered a calculated, computed, or derived column. It is a column whose value results from an expression rather than direct data input. The expression can contain literal values, built-in functions, or references to other columns. The result of the expression must be scalar and deterministic.

To create a generated column, you give MySQL an expression, and then the database is in charge of populating the column with the result of that expression. The value in the generated column is always up to date and can never fall out of sync.

When creating a generated column, you can make it a STORED or VIRTUAL column. When using a STORED generated column, the value of the expression is written to disk as if it were a regular column. A VIRTUAL generated column is calculated at runtime every time. Performance metrics of VIRTUAL or STORED are highly dependent on the situation, but a good rule of thumb is that if it's expensive to calculate the value, store it.

The syntax for creating a generated column is:

col_name data_type [GENERATED ALWAYS] AS (expr)
  [VIRTUAL | STORED] [NOT NULL | NULL]
  [UNIQUE [KEY]] [[PRIMARY] KEY]
  [COMMENT 'string']

Let's create a simple circles table that declares the diameter as a regular column and then uses generated columns to calculate the radius and area columns according to their mathematical formulas:

CREATE TABLE circles (
  diameter DOUBLE,
  radius DOUBLE AS (diameter / 2),        -- Generated radius
  area DOUBLE AS (PI() * POW(radius, 2))  -- Generated area
);

Note

Both columns are VIRTUAL as that is the default.

Now if we insert 3 different diameters and then read the table back, we'll see that all columns are populated with values.

INSERT INTO circles (diameter) VALUES(1), (5), (10);

SELECT * FROM circles;

-- | diameter | radius | area               |
-- |----------|--------|--------------------|
-- |        1 |    0.5 | 0.7853981633974483 |
-- |        5 |    2.5 | 19.634954084936208 |
-- |       10 |      5 |  78.53981633974483 |

Now that we know how generated columns work, let's move on to making generated hash columns.

Generated hash columns for strict equality lookups

A generated hash column is nothing special. It's merely a generated column in which the value generated is a hash of something else. Usually, this is an MD5 hash, but it could be a CRC32, SHA256, or any other hash that suits your needs. It must be noted that, importantly, this technique has nothing to do with securely storing passwords or other sensitive information. Securely storing sensitive information is an entirely separate topic. We're using hashing functions to create very small, deterministic results for speedy lookups, not to protect information.

The first scenario where you might want a generated hash column is when you need to do a strict equality lookup on a value that is too large to add a B-tree index. Any B-tree indexes added to BLOB or TEXT columns require a prefix because they are too large to be indexed in their entirety. Again, from the MySQL documentation:

Rather than testing for equality against a very long text string, you can store a hash of the column value in a separate column, index that column, and test the hashed value in queries. (Use the MD5() or CRC32() function to produce the hash value.)

An example of a column that might need this is a url column. Since URLs can be incredibly long, you might store them in a TEXT column to adequately accommodate the full possible length. To perform a strict equality, indexed lookup on this column, we'll create a generated hash in an example table called visits.

CREATE TABLE visits (
  url TEXT,
  url_md5 CHAR(32) AS (MD5(url)),
  -- [other columns...]

  KEY(url_md5)
);

In this table, we've created the hash and put an index on that hash. If we were to look up the value for https://planetscale.com, we could take advantage of the index by looking up the hash instead of the URL:

SELECT * FROM visits WHERE url_md5 = MD5("https://planetscale.com");

-- | url                     | url_md5                          |
-- |-------------------------|----------------------------------|
-- | https://planetscale.com | fccb2478f30ecea53d4f2f294b5e891a |

Alternatively, you could calculate the hashed value on the application side instead of MySQL, resulting in the same outcome.

SELECT * FROM visits WHERE url_md5 = "fccb2478f30ecea53d4f2f294b5e891a";

-- | url                     | url_md5                          |
-- |-------------------------|----------------------------------|
-- | https://planetscale.com | fccb2478f30ecea53d4f2f294b5e891a |

I prefer to let MySQL do the hashing, making the queries more readable.

You may be worried about hash collisions using this strategy. While the odds of a randomly occurring MD5 collision are vanishingly small, they are not strictly zero. The odds are much higher if you use a different algorithm like CRC32.

To work around every possibility of a collision, no matter how unlikely, you can treat the MD5 hash as a redundant condition. A redundant condition is a condition that does not change the results of the query but does provide MySQL with the option to choose a different execution path, usually an index-assisted path.

Using our example from earlier, we'll add another condition on url alone to eliminate any potential hash collisions but still use the url_md5 index.

SELECT * FROM visits
  WHERE
  url_md5 = MD5("https://planetscale.com")  -- This allows MySQL to use the index
  and
  url = "https://planetscale.com"           -- This eliminates any MD5 hash collisions

-- | url                     | url_md5                          |
-- |-------------------------|----------------------------------|
-- | https://planetscale.com | fccb2478f30ecea53d4f2f294b5e891a |

We can make one more optimization with the current setup before we move on to hashing multiple columns.

Storing MD5 hashes in binary columns

Every character column in MySQL is defined with a character set and a collation. The character set determines what characters are allowed to go in the column, and the collation is a set of rules that determines how those characters compare to each other.

In our example above, we defined the url_md5 column as a CHAR(32) character column. We're not interested in the character aspect of the MD5 hash, i.e., we won't be sorting the hashes or doing partial string matches. We're only interested in strict equality checks against a sequence of bytes. Since that is true, storing the data in a character representation is unnecessary. Storing the value of the hash as a BINARY(16) is more efficient than storing it as a CHAR(32).

To store the values as a binary string of data, we need to use the UNHEX function to convert the characters into a sequence of bytes.

CREATE TABLE visits (
  url TEXT,
  url_md5 BINARY(16) AS (UNHEX(MD5(url))), -- Binary instead of character
  -- [other columns...]

  KEY(url_md5)
);

We need to carry this change through to the query itself as well, by using the UNHEX function there too:

SELECT * FROM visits WHERE url_md5 = UNHEX(MD5("https://planetscale.com"));

-- | url                     | url_md5                          |
-- |-------------------------|----------------------------------|
-- | https://planetscale.com | FCCB2478F30ECEA53D4F2F294B5E891A |

Note

Even though url_md5 is now a binary column, we still see the results as a string of characters! If this is the case for you, your client is likely using the --binary-as-hex option to give you a more human-readable display. You can use the --skip-binary-as-hex option to see the raw bytes.

We've reached the point of maximum performance for a single hash column. Let's look at the use cases for hashing multiple columns into one.

Generated hashes across multiple columns

The same basic principle applies to hashing multiple columns as hashing single columns: compact indexes can be faster than wide indexes. We can use the compact hash for strict equality searches or to enforce uniqueness across multiple columns at once.

Composite indexes in MySQL can consist of up to 16 columns, which is rarely necessary! Composite indexes have many advantages over the hashing strategy, primarily because each piece of the composite index can be used, provided you work left to right and don't skip any columns. (For an in-depth guide, see our video on composite indexes in MySQL.) However, a hash can be exactly what you're looking for in certain circumstances.

To create a composite hash, we first need to generate a concatenated string to hash. We'll do this using MySQL's CONCAT_WS function.

The WS in CONCAT_WS stands for "with separator," allowing us to define a separator to be used in between the column values. If we were to use the standard CONCAT function, we might end up with ambiguous concatenated values:

SELECT CONCAT("a", "b", "c"), CONCAT("ab", "c");

-- | CONCAT("a", "b", "c") | CONCAT("ab", "c") |
-- |-----------------------|-------------------|
-- | abc                   | abc               |

The CONCAT function also doesn't handle NULL values very elegantly, turning the result completely NULL.

SELECT CONCAT("a", "b", "c"), CONCAT("ab", NULL, "c")

-- | CONCAT("a", "b", "c") | CONCAT("ab", NULL, "c") |
-- |-----------------------|-------------------------|
-- | abc                   |                         |

The CONCAT_WS function solves both of these issues:

SELECT CONCAT_WS("|", "a", "b", "c"), CONCAT_WS("|", "ab", "c");

-- | CONCAT_WS("|", "a", "b", "c") | CONCAT_WS("|", "ab", "c") |
-- |-------------------------------|---------------------------|
-- | a|b|c                         | ab|c                      |

SELECT CONCAT_WS("|", "a", "b", "c"), CONCAT_WS("|", "ab", NULL, "c");

-- | CONCAT_WS("|", "a", "b", "c") | CONCAT_WS("|", "ab", NULL, "c") |
-- |-------------------------------|---------------------------------|
-- | a|b|c                         | ab|c                            |

Using the CONCAT_WS function, we can now create a composite, generated hash. Let's make an addresses table that stores CASS standardized U.S. addresses.

CREATE TABLE `addresses` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `primary_line` varchar(255),
  `secondary_line` varchar(255),
  `urbanization` varchar(255),
  `last_line` varchar(255),

  PRIMARY KEY (`id`)
)

Each standardized address consists of the primary and secondary lines, the urbanization, and the last line. Here is a random sampling of what this table looks like:

| id | primary_line          | secondary_line    | urbanization      | last_line               |
|----|-----------------------|-------------------|-------------------|-------------------------|
|  1 | 457 HAZELWOOD CV      | KATHLEEN LAMARRE  |                   | COPPELL TX 78759-2042   |
|  2 | 5322 MUSKET RDG       | MARK E STEKLE     |                   | AUSTIN TX 75019-2042    |
|  3 | M543 PASEO DEL MONTE  |                   | URB MONTE CLARO   | BAYAMON PR 00961-5806   |
|  4 | PO BOX 990            | 3-D SIGNS         |                   | CYPRESS TX 77410-0852   |
|  5 | 949 CALLE PORTO MAYOR | STE D3            | URB PORTOBELLO    | TOA ALTA PR 00953-5407  |
|  6 | 621 CARR 382          |                   | PARC SOLEDAD      | MAYAGUEZ PR 00682-7603  |
|  7 | 1602 CIMARRON TRL     | TEN C ENTERPRISES |                   | HURST TX 76053-3921     |
|  8 | 45C CALLE REINA       |                   | PARC CARMEN       | VEGA ALTA PR 00951-5806 |
|  9 | 1910 ROSEMONT ST      | CABILLO DANIEL    |                   | MESQUITE TX 75149-1549  |
| 10 | 110 N FM 3083 RD      | ATTN PROPERTY TAX |                   | CONROE TX 00738-1866    |
| 11 | C2 CALLE B            |                   | VILLAS DE CAPARRA | BAYAMON PR 00959-7605   |
| 12 | 18307 STEDMAN DR      | 3330 TRADITION    |                   | DALLAS TX 75252-5745    |
| 13 | 291 CALLE ZIRCONIA    |                   | URB COSTA BRAVA   | ISABELA PR 00662-6307   |
| 14 | 7206 INDIAN DIVIDE RD | FAMILY TRUST      |                   | SPICEWOOD TX 79169-1646 |
| 15 | 4992 CALLE HIGUERILLO |                   | URB FAJARDO GDNS  | FAJARDO PR 75149-3088   |

To prevent a duplicate normalized address from being added to the table, we're going to create a composite hashed index of all four address parts:

ALTER TABLE addresses ADD COLUMN address_hash BINARY(16) GENERATED ALWAYS AS (
  UNHEX(MD5(
    CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line)
  ))
);

Now we have a compact hash that represents the entirety of the address in one discrete value:

| id | primary_line          | secondary_line    | urbanization      | last_line               | address_hash                     |
|----|-----------------------|-------------------|-------------------|-------------------------|----------------------------------|
|  1 | 457 HAZELWOOD CV      | KATHLEEN LAMARRE  |                   | COPPELL TX 78759-2042   | 25EE1343804992671067D73BDD50B27E |
|  2 | 5322 MUSKET RDG       | MARK E STEKLE     |                   | AUSTIN TX 75019-2042    | 5492C5B33E3F769FD1CAA5EBB2FF56F8 |
|  3 | M543 PASEO DEL MONTE  |                   | URB MONTE CLARO   | BAYAMON PR 00961-5806   | EEE4D086755BEAB36171054F08BA0436 |
|  4 | PO BOX 990            | 3-D SIGNS         |                   | CYPRESS TX 77410-0852   | F19757F95985A0F6105AA4E781F46E14 |
|  5 | 949 CALLE PORTO MAYOR | STE D3            | URB PORTOBELLO    | TOA ALTA PR 00953-5407  | 2ADBA74C2D742420C60CAE46D42FFA7D |
|  6 | 621 CARR 382          |                   | PARC SOLEDAD      | MAYAGUEZ PR 00682-7603  | 99033D25A9C3D0109A7370CE0885EB77 |
|  7 | 1602 CIMARRON TRL     | TEN C ENTERPRISES |                   | HURST TX 76053-3921     | 153FE5967FB62138DE23F518B3600C4A |
|  8 | 45C CALLE REINA       |                   | PARC CARMEN       | VEGA ALTA PR 00951-5806 | CCF810BCD560CD0142D09EFCC2349D7F |
|  9 | 1910 ROSEMONT ST      | CABILLO DANIEL    |                   | MESQUITE TX 75149-1549  | 48BA2E351F5C58A1B976CEC55ABDB270 |
| 10 | 110 N FM 3083 RD      | ATTN PROPERTY TAX |                   | CONROE TX 00738-1866    | F98D0C9EE23D25D9B40EC4EC59DAC776 |
| 11 | C2 CALLE B            |                   | VILLAS DE CAPARRA | BAYAMON PR 00959-7605   | 8577AF223E10CFF124574D97F8C35BF9 |
| 12 | 18307 STEDMAN DR      | 3330 TRADITION    |                   | DALLAS TX 75252-5745    | B082CA1BE2B30C6B9D2101565C5F5DF2 |
| 13 | 291 CALLE ZIRCONIA    |                   | URB COSTA BRAVA   | ISABELA PR 00662-6307   | 17101B8E6420B8FC93682B53BDDBD74D |
| 14 | 7206 INDIAN DIVIDE RD | FAMILY TRUST      |                   | SPICEWOOD TX 79169-1646 | 103B79EE14A330F5D2D79BA2764F738D |
| 15 | 4992 CALLE HIGUERILLO |                   | URB FAJARDO GDNS  | FAJARDO PR 75149-3088   | 75102EA91197F96E20242EEE0264F2CA |

To prevent duplicates from entering the table, we'll add a unique index to the generated column:

ALTER table addresses ADD UNIQUE INDEX (address_hash);

Again, if you're worried about hash collisions, you might use a different hashing algorithm like SHA256. Note that the column size must change to a BINARY(32).

ALTER TABLE addresses ADD COLUMN address_hash BINARY(32) GENERATED ALWAYS AS (
  UNHEX(SHA2(
    CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line),
    256
  ))
);

Functional indexes in MySQL

All of the work we've done so far has been using generated columns to calculate the hashes for us. As of MySQL 8.0.13, we no longer have to use a generated column to create an index on the result of a function. We can use a functional index.

Functional indexes are implemented as virtual generated columns by MySQL, so there is no performance difference. It's merely a preference!

To create a functional index without a generated column, you can create an index as you usually would, but use a second set of parenthesis to denote that it's functional:

ALTER TABLE addresses ADD INDEX address_hash_functional ((  -- Note the two parentheses here!
  UNHEX(SHA2(
    CONCAT_WS('|', primary_line, secondary_line, urbanization, last_line),
    256
  ))
));

Whether you use the generated column or functional index is up to you! Regardless of which method you choose, hopefully this strategy can be helpful when building your applications!