FoundationsDatabasesmedium15 min read

Why We Split the Database

Servers are easy to clone because they're forgetful. The database is the opposite: it's the one thing that remembers everything, which is exactly why it's the hardest thing to scale. Splitting it is a series of trades, each buying capacity with a little correctness.

In the last explainer you learned to scale the app by running many forgetful servers behind a load balancer. So you did. You have ten servers now. And the site is still slow under load.

Here's why. All ten servers talk to the same single database. You cloned the easy part (the stateless workers) and left the hard part (the one machine that remembers everything) exactly as it was. Now ten servers are all queuing up at one database's door.

The database is hard to scale precisely because it holds state. You can't just run ten copies and call it a day, because then you'd have ten different versions of the truth. So scaling a database is its own journey, with its own order of moves. Let's walk it.

First, see what's actually slow

Before splitting anything, look at the traffic. Almost every app does far more reading than writing. Loading a feed, opening a profile, searching products — all reads. Posting, liking, buying — writes, and there are way fewer of them.

~90%
Reads
typical read-heavy app
~10%
Writes
the rarer, harder case

That imbalance is a gift, because reads are easy to copy and writes are not. It tells you where to push first.

Move one: add read replicas

You keep your main database (call it the primary) as the one place writes go. Then you add one or more copies of it, called read replicas. The primary continuously streams its changes to the replicas so they stay nearly up to date. Reads get spread across the replicas; writes still all go to the primary.

App serverswrites → primary · reads → a replica
Primarythe one place writes happen
Replica 1 · Replica 2 · Replica 3copies, serve reads only
Writes go to the primary. It copies changes to the replicas. Reads spread across the replicas.

Since reads are 90% of your traffic, spreading them across three replicas roughly cuts the busiest database's load to a fraction of what it was. For a read-heavy app this one move buys an enormous amount of headroom, and it needs no change to how your data is shaped.

But — and you can probably feel the pattern by now — it introduces a new problem.

The new problem: replication lag

Copying changes from the primary to the replicas isn't instant. It usually takes milliseconds, but under load it can stretch to seconds. In that gap, a replica is showing slightly old data. This delay is replication lag, and it produces a very specific, confusing bug.

What the user does

They edit their profile name and hit save. The write goes to the primary. The page reloads and reads from a replica that hasn't received the change yet. Their old name is back. They think the save failed, so they do it again.

Why it happens

The save genuinely worked — it's on the primary. The replica just hasn't caught up in the few hundred milliseconds since. This is called "read-your-own-writes" inconsistency: a user can't immediately see a change they just made.

The common fix is targeted: for the brief window right after a user writes something, read their data from the primary instead of a replica, so they always see their own latest change. Everyone else can keep reading from replicas, where being a second behind doesn't matter. You're not demanding the whole system be instantly consistent, just the one spot where staleness is visibly wrong.

DecisionAccept slightly stale reads on replicas in exchange for far more read capacity.

Most data is fine being a second out of date — nobody's harmed if a like count lags briefly. You trade that small staleness for the ability to serve many times more reads. The places where staleness genuinely matters (a user viewing their own just-saved change, a balance check before a payment) get routed to the primary. The skill is knowing which reads can be stale and which can't.

Before you shard: exhaust the cheaper moves

Replicas solved reads. But every write still goes to the single primary, and replicas don't help with that — they're copies, not extra write capacity. So the obvious next thought is "split the writes too." Resist it for as long as you can, because there's a lot of road left before you have to.

A bigger machine is the first answer, and it's not a joke answer. Modern hardware is enormous; a single well-tuned primary handles a write rate that would have needed a cluster a decade ago. Before that, fix what's actually slow: add the missing index, kill the query that scans the whole table, batch chatty writes, move rarely-changing computed data into a cache. Most "we need to shard" moments are really "we have one unindexed query" moments.

There's also a middle step between replicas and full sharding: functional partitioning (sometimes called federation). Instead of slicing one table across machines, you move whole tables to their own database — billing on one box, the product catalog on another, the activity log on a third. Each piece gets its own primary and its own write capacity, and you've gained a lot without the pain that comes from splitting a single table. The limit is that one table can still outgrow one machine, and a feature can't be moved to its own database if it's tangled into everything else.

When one primary still can't cope

Eventually, on a genuinely large system, two things force the harder move even after you've done all of the above:

  1. Too many writes. One table takes writes faster than any single machine can commit them.
  2. Too much data. A single table grows past what one machine can store or index efficiently, and queries slow down from sheer size.

You've already met the answer to "one machine isn't enough" — use more machines. But you can't just clone the primary, because writes would conflict. Instead you split the data itself across several databases, so each one holds a slice and handles the writes for that slice. This splitting is called sharding, and each slice is a shard.

How sharding works

You pick a shard key — a field you'll use to decide which database a row lives in. User ID is a common one. Then a rule sends each row to a shard based on that key.

Write for user 91
Shard rule
Shard B (users 50–99)
A row's shard key decides which database holds it. Each shard is a full, independent database for its slice of the data.

Now users 0–49 live on shard A, 50–99 on shard B, and so on. Each shard is a smaller, faster database handling only its slice of users and their writes. Add a fourth shard and you've added write capacity and storage. This is horizontal scaling again, applied to data instead of compute.

It works, and it's how the largest systems store their data. But sharding is the most expensive move in this whole sequence, and you should respect why.

What sharding takes away

Once data is split across shards, anything that needs to span shards gets hard. A query like "list all users sorted by signup date" now has to ask every shard and merge the results, instead of one quick query. A transaction touching two users on different shards can't be a simple all-or-nothing database transaction anymore. And picking the wrong shard key can leave one shard handling most of the traffic while others sit idle (a "hot shard"). Sharding trades a lot of convenience for capacity. Don't do it until a single primary genuinely can't cope.

The order, and the discipline

The whole journey, in the order the pain forces it:

  1. Ten servers, one database, still slow

    The servers were the easy part. The shared database is the real bottleneck.

  2. Reads are most of the load

    Add read replicas and spread reads across them. The primary handles writes. Huge, cheap win for a read-heavy app.

  3. Users don't see their own just-saved changes

    That's replication lag. Route a user's own reads to the primary right after they write; let everyone else read from replicas.

  4. Writes outgrow the primary, after the cheap fixes run out

    First a bigger machine, the missing index, functional partitioning by feature. Only when those are spent do you shard: split one table across databases by a shard key, each handling a slice. Powerful, and the costliest step — cross-shard queries and transactions get hard.

The discipline is the same as everywhere else in backend work: stop at the first move that solves your problem. A great many successful apps never shard at all, because one primary with a few read replicas carries them for years. Sharding is the move you make when you've earned the scale that demands it — not a day sooner.

The one idea to take away

Servers scale easily because they remember nothing; databases scale painfully because remembering everything is their entire job. So you split the database in stages, and every stage buys capacity by giving up a little consistency or convenience — stale replica reads first, then cross-shard simplicity. Knowing which trade you're making, and that you don't have to make all of them at once, is the whole skill.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QWhy don't read replicas help with write-heavy load?+

Replicas are read-only copies of the primary. Every write still has to go to the primary first, which then streams the change out to the replicas. Adding replicas multiplies read capacity but does nothing for write capacity. When the primary can't keep up with the write rate, replicas are the wrong tool — you need to shard.

QWhat is replication lag and what bug does it cause?+

Replication lag is the short delay (usually milliseconds, sometimes seconds under load) between a write landing on the primary and that change reaching the replicas. It causes "read-your-own-writes" bugs: a user saves a change, the next read hits a replica that hasn't caught up, and they see their old data and think the save failed. The fix is to route a user's own reads to the primary briefly after they write.

QWhat is a shard key and why does choosing it well matter so much?+

A shard key is the field used to decide which shard a row lives on, like user ID. It matters because it determines how evenly load and data spread across shards. A bad key can route most traffic to one shard (a hot shard) while others idle, and it makes some queries impossible to answer without contacting every shard. The key essentially locks in how the data is divided, so it's hard to change later.

QWhy is a transaction across two shards harder than a transaction within one database?+

A normal database transaction is all-or-nothing within a single database, which can coordinate it internally. When the two rows live on different shards (different databases), no single database can guarantee both changes commit or both roll back together. You need distributed coordination (sagas, two-phase commit, or compensating actions), which is far more complex and a common reason to avoid sharding until you truly must.

QMost apps never need to shard. Why start with replicas instead?+

Because reads are usually ~90% of traffic, and replicas absorb reads with no change to how data is shaped — it's a cheap, low-risk win that buys years of headroom. Sharding adds write capacity and storage but breaks cross-shard queries and transactions and is hard to undo. You take the simple, reversible step first and only shard when a single primary genuinely can't handle the writes or hold the data.

Before you leave — how confident are you with this?

Your honest rating shapes when you'll see this again. No grades, no shame.

More deep dives

Comments

to join the discussion.

Loading comments…