惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Heimdal Security Blog
小众软件
小众软件
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
罗磊的独立博客
Google DeepMind News
Google DeepMind News
大猫的无限游戏
大猫的无限游戏
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
A
About on SuperTechFans
宝玉的分享
宝玉的分享
博客园 - 聂微东
月光博客
月光博客
Cyberwarzone
Cyberwarzone
Microsoft Security Blog
Microsoft Security Blog
V
Visual Studio Blog
Project Zero
Project Zero
T
Tor Project blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
L
LINUX DO - 最新话题
博客园 - 叶小钗
Recent Commits to openclaw:main
Recent Commits to openclaw:main
Attack and Defense Labs
Attack and Defense Labs
Spread Privacy
Spread Privacy
Forbes - Security
Forbes - Security
Simon Willison's Weblog
Simon Willison's Weblog
N
Netflix TechBlog - Medium
P
Proofpoint News Feed
Engineering at Meta
Engineering at Meta
Hacker News: Ask HN
Hacker News: Ask HN
I
InfoQ
M
MIT News - Artificial intelligence
AI
AI
博客园 - 三生石上(FineUI控件)
W
WeLiveSecurity
C
Check Point Blog
The Hacker News
The Hacker News
C
Cyber Attacks, Cyber Crime and Cyber Security
Application and Cybersecurity Blog
Application and Cybersecurity Blog
T
Tenable Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Cloudflare Blog
Blog — PlanetScale
Blog — PlanetScale
美团技术团队
D
Darknet – Hacking Tools, Hacker News & Cyber Security
GbyAI
GbyAI
Hacker News - Newest:
Hacker News - Newest: "LLM"
腾讯CDC
K
Kaspersky official blog

Blog — PlanetScale

Keeping a Postgres queue healthy — PlanetScale Patterns for Postgres Traffic Control — PlanetScale Graceful degradation in Postgres — PlanetScale High memory usage in Postgres is good, actually — PlanetScale Stripe Projects partnership: Provision PlanetScale Postgres and MySQL databases from the Stripe CLI — PlanetScale Enhanced tagging in Postgres Query Insights — PlanetScale Behind the scenes: How Database Traffic Control works — PlanetScale Introducing Database Traffic Control — PlanetScale Scaling Postgres connections with PgBouncer — PlanetScale Drizzle joins PlanetScale — PlanetScale Video Conferencing with Postgres — PlanetScale Faster PlanetScale Postgres connections with Cloudflare Hyperdrive — PlanetScale Introducing the PlanetScale MCP server — PlanetScale Database Transactions — PlanetScale Automating our changelog with Cursor commands — PlanetScale Postgres 18 is now available — PlanetScale Using MotherDuck with PlanetScale — PlanetScale $50 PlanetScale Metal is GA for Postgres — PlanetScale AI-Powered Postgres index suggestions — PlanetScale $5 PlanetScale is live — PlanetScale Announcing Vitess 23 — PlanetScale $50 PlanetScale Metal — PlanetScale Report on our investigation of the 2025-10-20 incident in AWS us-east-1 — PlanetScale $5 PlanetScale — PlanetScale Benchmarking Postgres 17 vs 18 — PlanetScale Larger than RAM Vector Indexes for Relational Databases — PlanetScale Partnering with Cloudflare to bring you the fastest globally distributed applications — PlanetScale Processes and Threads — PlanetScale PlanetScale for Postgres is now GA — PlanetScale Postgres High Availability with CDC — PlanetScale Announcing Neki — PlanetScale Caching — PlanetScale The principles of extreme fault tolerance — PlanetScale Announcing PlanetScale for Postgres — PlanetScale Benchmarking Postgres — PlanetScale Announcing Vitess 22 — PlanetScale The Real Failure Rate of EBS — PlanetScale IO devices and latency — PlanetScale Announcing PlanetScale Metal — PlanetScale PlanetScale Metal: There’s no replacement for displacement — PlanetScale Upgrading Query Insights to Metal — PlanetScale Automating cherry-picks between OSS and private forks — PlanetScale Database Sharding — PlanetScale Anatomy of a Throttler, part 3 — PlanetScale Introducing sharding on PlanetScale with workflows — PlanetScale Announcing Vitess 21 — PlanetScale Announcing the PlanetScale vectors public beta — PlanetScale Anatomy of a Throttler, part 2 — PlanetScale Instant deploy requests — PlanetScale Anatomy of a Throttler, part 1 — PlanetScale Increase IOPS and throughput with sharding — PlanetScale Tracking index usage with Insights — PlanetScale Faster backups with sharding — PlanetScale Building data pipelines with Vitess — PlanetScale The State of Online Schema Migrations in MySQL — PlanetScale Optimizing aggregation in the Vitess query planner — PlanetScale Dealing with large tables — PlanetScale Announcing Vitess 20 — PlanetScale Self-managed Vitess vs Managed Vitess with PlanetScale — PlanetScale Achieving data consistency with the consistent lookup Vindex — PlanetScale The MySQL adaptive hash index — PlanetScale Introducing global replica credentials — PlanetScale Profiling memory usage in MySQL — PlanetScale Summer 2023: Fuzzing Vitess at PlanetScale — PlanetScale How PlanetScale makes schema changes — PlanetScale Identifying and profiling problematic MySQL queries — PlanetScale The Problem with Using a UUID Primary Key in MySQL — PlanetScale Announcing Vitess 19 — PlanetScale PlanetScale forever — PlanetScale Introducing schema recommendations — PlanetScale Amazon Aurora Pricing: The many surprising costs of running an Aurora database — PlanetScale Three common MySQL database design mistakes — PlanetScale OAuth applications are now available to everyone — PlanetScale Deprecating the Scaler plan — PlanetScale PlanetScale branching vs. Amazon Aurora blue/green deployments — PlanetScale Databases at scale — PlanetScale Considerations for building a database disaster recovery plan — PlanetScale Working with Geospatial Features in MySQL — PlanetScale PlanetScale vs Amazon Aurora replication — PlanetScale Introducing the Vantage and PlanetScale integration — PlanetScale MySQL isolation levels and how they work — PlanetScale Introducing the schemadiff command line tool — PlanetScale $ pscale ping — PlanetScale Announcing foreign key constraints support — PlanetScale The challenges of supporting foreign key constraints — PlanetScale What is HTAP? — PlanetScale Introducing Insights Anomalies — PlanetScale Webhook security: a hands-on guide — PlanetScale MySQL replication: Best practices and considerations — PlanetScale A guide to HTML email with Ruby on Rails and Tailwind CSS — PlanetScale Sharding for cost-effective database management — PlanetScale PlanetScale ranks 188th in Deloitte’s top 500 fastest-growing companies — PlanetScale Announcing the Fivetran integration — PlanetScale Introducing webhooks — PlanetScale What is MySQL replication and when should you use it? — PlanetScale Sync user data between Clerk and a PlanetScale MySQL database — PlanetScale Introducing database reports — PlanetScale Distributed caching systems and MySQL — PlanetScale What is MySQL partitioning? — PlanetScale MySQL High Availability: Connection handling and concurrency — PlanetScale
What is database sharding and how does it work? — PlanetScale
Justin Gage · 2023-04-06 · via Blog — PlanetScale

Justin Gage |

What is sharding?

If you’ve used Google or YouTube, you’ve probably accessed sharded data.

Sharding is a strategy for scaling out your database by storing partitions of your data across multiple servers instead of putting everything on a single giant one. Each partition of data is called a shard. Splitting your database out into shards can help reduce the load on your database, leading to improved performance.

This post will help you understand exactly what database sharding is by walking through how sharding works, how to think about implementing your own sharded database, and some useful tools out there that can help, with a particular focus on MySQL and Postgres.

Sharding to scale out relational databases

Scene: you’ve upsized your MySQL on RDS instance for the 3rd time this quarter and your CFO just put 30 minutes on your calendar to “chat budget.” It might be time to scale out instead of scaling up! [1] Read replicas in RDS seem straightforward enough, but reading data is only half of the problem. What is an overwhelmed developer to do?

Sharding — a term that probably originally came from a video game — is how you scale out relational databases. You’ve probably seen this table before, about how scaling out helps you take this users table, all stored on a single server:

user_idfirst_namelast_nameemail...
ZpaDr20TTD4ZL7WmaPeterGibbonspeter@initech.net...
bI32htQ1PsEQioC7GBillLumberghbill@initech.net...
99J3x257SGP7J4IkFMiltonWaddamsstapler@initech.net...
0SH0pyi9bO5RM4I03Lawrencetwo@onetime.com
...............

And turn it into this users table, stored across 2 (or 1,000) servers:

user_idfirst_namelast_nameemailServer
ZpaDr20TTD4ZL7WmaPeterGibbonspeter@initech.netServer A
bI32htQ1PsEQioC7GBillLumberghbill@initech.netServer B
99J3x257SGP7J4IkFMiltonWaddamsstapler@initech.netServer B
0SH0pyi9bO5RM4I03Lawrencetwo@onetime.comServer A
............

But that’s only one type of sharding (row level, or horizontal). There are tons of different ways to split up your data across servers to best match how your business and data model works. Vertical sharding, for example, is when you split things at the schema or table level. More on this later!

Partitioning has existed – especially in OLAP setups – for a long time, primarily as a mechanism for improving query speed. Nightmares of sifting through HDFS partitions to find the missing snapshot pervade my sleep schedule... Anyway, sharding takes that concept and applies it to distributed systems: in addition to splitting up data into logical groups, let’s put those groups across multiple servers that talk to each other. Even Oracle does it!

For as long as relational databases have existed, they’ve been designed to run on a single server. Partially because of that, and partially because of fundamental laws of physics, sharding your data properly is, uh, not very easy.

How database sharding works under the hood

To shard your database, you’ll need to do a few things:

  1. Decide on a sharding scheme — What data gets split up, and how? How is it organized?
  2. Organize your target infrastructure — How many servers are you sharding to? How much data will be on each one?
  3. Create a routing layer — How does your application know where to store new data, and query existing data?
  4. Planning and executing the migration — How do you migrate from a single database to many with minimal downtime?

There’s no hard and fast playbook for each; everyone’s data model and business constraints are different. Let’s dive in.

Sharding schemes and algorithms

How you decide to split up your data into shards – also referred to as your partition strategy – should be a direct function of how your business runs, and where your query load is concentrated. For a B2B SaaS company where every user belongs to an organization, sharding by splitting up organization-level data probably makes sense. If you’re a consumer company, you may want to shard based on a random hash. Notion manually sharded their Postgres database by simply splitting on team ID. All of this is to say that sharding can be as simple or as complicated as you make it.

With that in mind, there are a few popular “algorithms” to decide which rows are stored together and on which servers:

  • Hash based sharding (also known as key based) – Take a value from the row, hash it, and send buckets of hashes to the same server. Whichever column you choose to hash is your shard key.
  • Range based sharding – Pick a column, create ranges, and allocate shards based on those ranges. Most useful for numerical columns that are (somewhat) randomly or evenly distributed.
  • Directory based sharding – Pick a column, allocate shards manually, and maintain a lookup table so you know where each row is stored.

If your sharding scheme isn’t random (e.g. hash based), you can begin to see why query profiling and understanding how your load is distributed can be useful.

Imagine you’re Amazon, and you want to shard your MySQL database that stores customer orders. On the surface, there seems to be no meaningful clustering: sure, you’ve got customers who order a lot of stuff, but that volume (and the associated reads during the shopping process) are basically random. It might make sense to use hash based sharding, and use the order ID as the shard key.

A big part of your sharding scheme is considering which tables are stored together. Joins across databases in a distributed system are difficult and costly, so ideally all of the data you need to answer a particular query exists on the same physical machine. For Amazon, that means the orders table and the products table containing the products in the orders table need to be physically colocated. This also requires incremental maintenance: if a customer makes a new order, the product data for that order needs to be included in the new shard so it can be read quickly later on.

Sharding maintenance is an oft underappreciated piece of scaling out your relational database. Depending on what your partition strategy is, you’ll likely end up with hotspots, where a particular server in your cluster is either storing too much data or handling too much throughput. In our Amazon example, it could be because a large business started ordering a metric-ton of stuff, and all of their data is on one server. Managing those hotspots, redistributing data and load, and reorganizing your partition strategy to prevent future issues is part of what you’re signing up for when you shard.

Deciding on what servers to use

With your sharding scheme set, it’s time to decide on how many machines you want to store data on, and how big you need them to be. There’s no formula here; this decision depends on your budget, projections for future database load, cloud provider, etc.

A common approach is maximizing flexibility. Start with a small number of hosts, and add more as needed. To maintain an even distribution of shards across your servers, you’ll need to re-balance every time you add a host. This is why companies like to choose a number of shards that’s divisible by a lot of smaller numbers; it allows you to scale out the number of servers incrementally while maintaining that smooth, even distribution.

Routing your sharded queries to the right databases

With your data distributed across multiple databases (imagine 20 of them), how does your application know which database to query? You need to build some sort of routing layer that decides. But how?

For those building sharding from scratch, the most common answer is in the application layer. You need to build logic into your application code that decides which database (and schema) to connect to for a particular query, conditional on the data inside that query and where it belongs in your sharding scheme. The logic looks something like:

if data.sharding_key in database_1.sharding_keys:
  …connect to database_1
else if data.sharding_key in database_2.sharding_keys:
  …connect to database_2

Depending on how you’ve partitioned your data and the number of physical machines / databases you’re working with, this logic can be relatively simple and stored in a JSON blob, config file, etc. More commonly, teams will use some sort of key value store or a lookup table in a database. The important thing is to have the information that ties a piece of data to its destination encoded somewhere so your application knows where to issue the query.

Building this for the first time is actually not that difficult; it’s the operational maintenance that becomes the real problem over time. If you move shards from database to database, rebalance, add new machines, remove machines, change any database properties…you’ll need to update that application logic to account for it. ProxySQL isn’t a full fledged solution for this, but it could be classified as a rough “shard routing” service.

Planning and executing your migration to a sharded solution

Once you’ve taken care of all of the above and have your physical servers running with empty databases on them, plus a plan for routing in your application logic, you’re faced with the age-old problem of how to migrate without (too much) downtime. Unlike a (potentially) more straightforward migration to a single new database provider, moving to sharding introduces a lot more things that can go wrong and in more ways.

Notion’s engineering team suggested a useful framework for thinking about the migration in their post about how they implemented sharding:

  1. Double-write: Incoming writes get applied to both the old and new databases.

  2. Backfill: Once double-writing has begun, migrate the old data to the new database.

  3. Verification: Ensure the integrity of data in the new database.

  4. Switch-over: Actually switch to the new database. This can be done incrementally, e.g. double-reads, then migrate all reads.

Each of these steps still introduces the possibility of downtime; it’s just a risk you’re going to have to take for changes at this scale.

Though many teams do build sharding for their database of choice from scratch, there is an ecosystem of tools, albeit perhaps less mature than the database software they’re built on.

Vitess

Vitess was built at YouTube when they needed to shard MySQL, and is now available to you and me. It’s basically a layer on top of MySQL that gives you sharding, and a lot of other neat stuff related to really big workloads: connection pooling, dynamic re-sharding and balancing, and monitoring tools, among other things. For a technical overview of how Vitess improves on vanilla MySQL, check out their comparison here.

As far as I’m aware, Vitess is the most mature and the most popular OSS sharding layer for a relational database. It served all YouTube DB traffic for years, and is in production at Slack, GitHub, NewRelic, Pinterest, Square, etc.

Note

PlanetScale offers fully-managed Vitess clusters. If you're looking for a pain-free sharding solution for your MySQL database, we can help. Contact us and we'll be in touch shortly.

Citus

Citus does what Vitess does for MySQL, but for Postgres (minus some more flashy features). It’s open source, designed as a Postgres extension, and can be run as a single node or several. It’s in production at Algolia, Heap, Cisco, and a few more. Their docs have good general advice for picking your sharding scheme, Citus or otherwise.

If you don't need sharding, but are interested in ultra fast NVMe-backed instances, check out PlanetScale for Postgres.

The serverless database wave

I suppose the more fundamental question is: why are you not using a database that does sharding for you? Over the past few years the so-called “serverless” database has gotten a lot more traction. Starting with the infamous Spanner paper, many have been thinking about how running a distributed system should be native to the database itself, the foremost of which has been CockroachDB. You can even run cloud Spanner on GCP.

You’re reading this blog on the PlanetScale website. They sell a shard-native (did I just coin this?) database built on MySQL and Vitess. I’m not a PlanetScale employee, but I am a big proponent of what they’re doing, specifically shifting the focus in databases to developer experience instead of infrastructure maintenance.

The question is starting to become: if you’re paying someone like AWS to run your database for you, why are you busy figuring out how to scale out that database? And I think that’s a good question the major cloud providers should be asking themselves.

References

[1] There is no shortage of opinion pieces on the web arguing against premature sharding. This post assumes an educated reader who can judge when scaling out is the right decision and when it isn’t.

FAQs

What is database sharding?

Database sharding is a strategy for scaling out a database by splitting its data across multiple servers instead of storing everything on one. Each chunk of data is called a shard. Rather than upgrading to a bigger single server (scaling up), sharding lets you add more servers (scaling out) to handle more data and traffic.

What's the difference between database sharding and replication?

Sharding and replication are complementary but distinct strategies. Sharding splits data across multiple servers so each server holds only a subset of the total data, reducing write load and storage pressure. Replication copies the same data to multiple servers, improving read performance and redundancy. Many production systems use both: sharded databases are often replicated within each shard for high availability.

When should you not shard your database?

Sharding can add significant operational complexity — you're responsible for choosing a sharding scheme, building a routing layer, managing hotspots, and executing a careful migration. It's generally not the first option you should pursue if your database can still scale vertically (bigger machines), if read replicas solve your bottleneck, or if your data model relies heavily on cross-table joins (which become expensive across shards).

That said, there are managed databases like PlanetScale (Vitess and Neki) that handle sharding at a proxy layer, transparent to your application code. This significantly decreases the complexity of operating a sharded database and are worth considering before implementing app-level sharding yourself.