Surviving the IRCTC Tatkal Rush

Every morning at 10:00 AM, the Tatkal window opens and millions of people try to book a few thousand seats in the same minute. This is the mirror image of live streaming. There, everyone reads the same data, and caching saves you. Here, everyone writes to the same tiny set of rows, and caching can't help you at all, because the one thing you must guarantee is that two people never walk away with the same seat.

What we're building

Functional · what it does

Search trains and see live seat availability
Hold a seat while the user pays
Confirm the booking on successful payment
Release the hold if payment fails or times out
Show a fair queue position during the rush

Non-functional · what it must survive

Never oversell a seat, ever
Survive millions of attempts in a 60-second window
A failed payment must not lose the user's money
Stay responsive (or honestly say "busy") under load
Be fair: first come should mean first served

The defining constraint is correctness under contention. A streaming glitch annoys someone. A double-booked seat is a person standing in a train with a valid ticket and no seat, and a refund fight. Correctness wins over throughput here, every time.

The shape of the load

~10:00:00

The cliff

near-zero, then everything

Millions

Attempts / min

for thousands of seats

1000:1+

Demand : supply

most users will fail

Hot rows

The bottleneck

one train, one quota

Two facts shape everything. First, the contention is concentrated on a handful of rows: one popular train's seat inventory. Second, the vast majority of attempts must fail, because there simply aren't enough seats. A good design fails the losers fast and cheaply, and protects the few winners' transactions absolutely.

The core problem: don't oversell

Naively, booking looks like "read available count, if greater than zero subtract one." That's a classic race. Two requests both read "1 seat left," both think they won, both subtract, and now you've sold the same seat twice.

There are two correct ways to handle this, and you should know both.

Pessimistic locking

Lock the row before touching it: SELECT ... FOR UPDATE. Other writers wait. Reliable and simple to reason about, but the lock serialises everyone through one row, which limits throughput and can pile up waiters under a rush.

Optimistic / atomic update

Don't lock ahead of time. Do the decrement as one conditional statement and let the database's row lock be the gate: UPDATE inventory SET available = available - 1 WHERE id = ? AND available >= 1. If it updates zero rows, you lost. Short locks, high throughput.

That conditional UPDATE is the single most important line in the whole system. The WHERE available >= 1 guard makes overselling physically impossible: the database holds a brief row lock for each statement, so the checks are serialised and only available-many of them can succeed. Here's the idea in code, including the gap where a naive version goes wrong.

Hand out limited seats without oversellingrun · edit · saved to you

Loading editor…

Let the database be the referee

Don't try to coordinate seat allocation in your application code with counters in memory across many servers. That's a distributed-consensus problem you'll get wrong. Push the decision into a single atomic statement on a single row and let the database's locking do the hard part. The database is built for exactly this.

Hold, then confirm: the two-phase booking

A user needs time to pay, but you can't give away the seat for free while they fumble with a UPI PIN, and you can't let them hold it forever. So booking is two phases:

Hold
The atomic decrement reserves a seat and creates a hold with a short expiry (a few minutes). The seat is now neither available nor confirmed.
Pay
The user completes payment. This is a separate, slow, external step that can fail, time out, or succeed after the user gave up.
Confirm or release
On success, the hold becomes a confirmed booking. On failure or expiry, a reaper releases the hold and returns the seat to the pool.

The expiring hold is what keeps the system honest. A user who abandons checkout doesn't lock a seat forever, and a seat is never given to two people because it's removed from the available pool the instant it's held. The reaper that releases expired holds is not optional; it's the safety valve that prevents inventory from slowly leaking away into dead holds.

Payments must be idempotent

Payment is the riskiest step because it's slow and external. The user's network drops after they pay but before they see confirmation. They retry. Now you risk charging twice or, worse, taking their money and confirming nothing.

The fix is the idempotency key from the API chapter: every booking attempt carries a unique key, the payment is recorded against it, and a retry with the same key returns the original result instead of charging again. Pair this with a clear state machine for each booking (held → paid → confirmed, or held → expired → released) so that no matter how many times a request is retried or how it interleaves with the reaper, the booking lands in exactly one valid state.

The money-but-no-seat failure

The nightmare case: payment succeeds but confirmation fails (the app crashed, the hold expired in between). The user is charged with no ticket. You must reconcile this. Either confirm-after-pay is driven by a durable workflow that retries until the booking is confirmed or the payment is refunded, or a reconciliation job continuously matches successful payments against confirmed bookings and auto-refunds the orphans. Never leave it to chance.

Handling the 10:00 AM cliff

Correctness handles double-booking. It does nothing for the stampede of millions of requests hitting your servers in one second. For that you need to control how many requests even reach the booking logic.

The answer is a virtual waiting room. When the rush hits, you don't try to serve everyone at once. You admit users into the actual booking system at a controlled rate and hold the rest in a queue with a visible position. This does three things: it protects the database from a load it can't survive, it makes failure honest ("you're number 40,000 in line") instead of a spinning page, and it preserves fairness by admitting roughly in arrival order.

DecisionPut a queue in front and admit users at a sustainable rate.

Letting all the load hit the database means it falls over and nobody gets a seat, including the people who would have won. A queue that admits, say, a few thousand users a second keeps the core system inside its safe operating range, so the seats that exist actually get sold. The cost is that most users wait and then learn they didn't get a seat. That's the honest truth of 1000:1 demand, surfaced instead of hidden.

You also push as much rejection as far forward as possible. If a train is sold out, that fact can be cached and served from the edge, so "sold out" requests never reach the booking core at all. Only requests that might actually succeed should spend the expensive resource, which is a transaction on a hot inventory row.

Where the data lives

The inventory rows are the hottest, most contended data in the system. A few design notes:

Shard by train and date. Different trains are independent, so their inventory can live on different shards. This spreads the contention across the fleet instead of concentrating every train on one database. The contention within one popular train is irreducible, but at least one hot train doesn't slow bookings for every other.
Keep the hot transaction tiny. The hold transaction touches one inventory row and one holds row, and does nothing else (no emails, no analytics, no third-party calls). Everything non-essential happens afterward, off a queue. A short transaction holds its locks briefly, which is exactly what you want when thousands are queued behind it.
Read availability from a cache, write through the database. The "seats available" number shown during browsing can come from a slightly stale cache; correctness is enforced only at the moment of the atomic decrement. Don't make every availability check a hot-row read.

The one idea to take away

Booking is the opposite of streaming. You can't cache the write, so you make the write as small, atomic, and serialised as possible, and you protect it with a queue so only a survivable number of requests reach it. Correctness lives in one conditional UPDATE; everything else (holds, payments, queueing) exists to feed that one statement safely.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QTwo users see '1 seat left' and both click book. How do you guarantee only one succeeds?+

Make the decrement atomic and conditional: UPDATE inventory SET available = available - 1 WHERE id = ? AND available >= 1. The database holds a brief row lock per statement, so the two updates are serialised, and only the one that runs while available >= 1 succeeds. The loser's update affects zero rows and is told the seat is gone. Never read-then-write in application code.

QWhy hold a seat instead of booking it directly on click?+

Because payment is slow and can fail. A hold removes the seat from the available pool immediately (so it can't be double-sold) but doesn't confirm it until payment succeeds. A short expiry returns abandoned holds to the pool. Direct booking on click would either give seats to people who never pay or block the seat forever on a failed payment.

QA user pays, their network drops, and they retry. How do you avoid double-charging?+

Idempotency keys. Each attempt carries a unique key; the payment is recorded against it, and a retry with the same key returns the original result instead of charging again. Combined with a booking state machine, the operation lands in exactly one valid state no matter how many times it's retried.

QPayment succeeded but confirmation failed. The user is charged with no ticket. Now what?+

You reconcile. Either a durable workflow retries confirm-after-pay until the booking is confirmed or the payment is refunded, or a reconciliation job continuously matches successful payments against confirmed bookings and auto-refunds orphans. This case is guaranteed to happen at scale, so it must be handled by design, not hope.

QMillions of requests hit at 10:00:00. How do you keep the database alive?+

A virtual waiting room in front. Admit users into the booking core at a sustainable rate and queue the rest with a visible position. This caps the load on the hot inventory rows so the system stays inside its safe range, keeps failure honest, and preserves rough first-come fairness. Also cache 'sold out' so doomed requests never reach the core.

QPessimistic vs optimistic locking for seat inventory?+

Pessimistic (SELECT ... FOR UPDATE) locks the row up front; reliable but serialises everyone and piles up waiters under a rush. Optimistic (a conditional atomic UPDATE) holds only a brief per-statement lock and fails losers immediately, giving much higher throughput. For high-contention inventory, the atomic conditional update is usually the better fit.

QHow do you stop one popular train from slowing bookings for every other train?+

Shard inventory by train and date so independent trains live on independent databases. The contention within one hot train is irreducible (everyone wants the same seats), but sharding keeps that hot spot from dragging down unrelated bookings. Also keep the hot transaction tiny so locks are held briefly.