UPI processes billions of transactions a month. Every single one moves real money between real people, and the system has exactly zero tolerance for losing a rupee, creating one from nothing, or charging someone twice. This is the deep dive where the lessons from every other one (cache it, make it eventual, approximate it) all get thrown out, because none of them are acceptable when the number is someone's bank balance.
What we're building
Functional · what it does
- Move money from one account to another
- Show an always-correct balance
- A full, auditable history of every transaction
- Refunds and reversals
- Handle retries from flaky mobile networks
Non-functional · what it must survive
- Never lose or duplicate money, ever
- Every balance must be provably correct
- Survive crashes mid-transaction with no corruption
- High throughput despite strong consistency
- Complete audit trail for compliance
Notice what's not on the list: "lowest possible latency" and "infinite scale at any cost." Those are negotiable. Correctness is not. The design optimises for being provably right first, and fast second.
The first rule: never edit a balance
The instinct is to store a balance column and UPDATE it on every transaction. Don't. A stored, mutable balance has no history, no audit trail, and no way to tell whether it's right. If it's ever wrong, you can't reconstruct how it got that way.
Instead you use a double-entry ledger, the same system accountants have used for centuries. Every transaction writes two entries: a debit from one account and a credit to another. Balances aren't stored; they're derived by summing an account's entries. The ledger is append-only: you never update or delete an entry, you only add new ones.
The magic property is the invariant: every transaction's two entries sum to zero, so the sum of all entries in the system is always zero. Money is never created or destroyed, only moved. If that global sum is ever non-zero, you have a bug, and you can detect it immediately. That self-checking property is why money systems are built this way.
DecisionStore an append-only ledger and derive balances; never store a mutable balance.
Appending two entries per transaction uses more storage and means a balance is a sum rather than a lookup. In return you get a complete audit trail, the ability to reconstruct any balance at any point in time, and a built-in correctness check (all entries sum to zero). For money, that trade is not even close; correctness and auditability win.
For performance, you don't re-sum millions of entries on every balance check. You keep a cached or periodically-snapshotted balance (a materialised balance) for fast reads, but the ledger remains the source of truth. The cached balance is a convenience derived from the ledger, never the authority. If they ever disagree, the ledger is right and the cache is rebuilt.
Idempotency on every rupee
The mobile network will fail between charging the user and showing them the result. They will retry. Without protection, the retry moves the money again. So every money-moving request carries an idempotency key, and the ledger records the result against that key. A retry with the same key returns the original outcome instead of creating a second transaction.
This is the API chapter's idempotency pattern, but here it's not a nice-to-have, it's load-bearing. A missing idempotency key in a payments system is a guaranteed double-spend waiting for the first network blip.
The double-submit is the default, not the edge case
In payments you assume every request arrives more than once. Users double-tap, apps auto-retry, gateways replay. The idempotency key isn't there for the rare failure; it's there because duplicate delivery is the normal state of the world. Design as if every request is a duplicate of one you've already seen, and let the key tell you whether it actually is.
Moving money across two systems
The hardest case: the two accounts live in different banks, so the debit and the credit happen in different systems that can each fail independently. You can't wrap them in one database transaction. This is the distributed-transaction problem, and the honest answer is you don't use a distributed transaction (they don't work reliably across organisations). You use a state machine for the transfer.
Initiate
Create the transfer record in a
pendingstate with its idempotency key. Nothing has moved yet, but the intent is durably recorded.Debit the sender
Reserve and debit the source account. If this fails, the transfer goes to
failedand nothing else happens.Credit the receiver
Credit the destination. If it succeeds, the transfer is
completed.Compensate on failure
If the credit fails after the debit succeeded, you run a compensating action: reverse the debit (refund the sender). The transfer ends
reversed.
This is a saga: a multi-step process where each step has an undo, driven by a durable state machine that can be retried from wherever it crashed. The transfer record's state is persisted at every step, so a crash mid-transfer resumes correctly instead of leaving money in limbo. The states (pending, completed, failed, reversed) are explicit and exhaustive, so there's no undefined in-between.
Reversals are new entries, not deletions
When you refund or reverse, you do not delete the original ledger entries. You append new compensating entries that cancel them out. The history shows "money moved, then moved back," which is the truth. Deleting the original would destroy the audit trail and break the append-only guarantee. In a ledger, you fix mistakes by adding, never by erasing.
Where strong consistency is non-negotiable
Within a single account, the debit must be atomic and isolated: two simultaneous withdrawals must not both succeed against the same balance, or you've allowed an overdraft that shouldn't exist. This is the same hot-row contention as the IRCTC seat, solved the same way: a conditional atomic update guarded by a balance check (... WHERE balance >= amount), or an explicit row lock, so concurrent debits are serialised.
Across the whole system you don't need global strong consistency, only per-account consistency. Different accounts are independent, so they can live on different shards and proceed in parallel. The expensive guarantee is scoped to the one place it's required (a single account's balance), which is what keeps an otherwise strict system fast.
DecisionStrong consistency per account, sharded across accounts.
Global strong consistency would serialise the entire system and destroy throughput. Per-account consistency gives you the only guarantee that actually matters (no account is ever overdrawn or double-spent) while letting unrelated accounts on different shards run fully in parallel. You pay the cost of strictness exactly once, on the hot row, and nowhere else.
Reconciliation: trust, but verify
Even with all of the above, distributed money movement produces discrepancies: a partner bank's records disagree with yours, a callback was lost, a transfer is stuck pending. So a payments system runs continuous reconciliation: jobs that compare your ledger against external systems and against itself, flag anything that doesn't match, and either auto-correct (refund an orphaned charge) or escalate to a human.
This is the safety net under the safety net. The ledger's internal invariant (entries sum to zero) catches your own bugs; reconciliation catches disagreements with the outside world. A payments system without reconciliation is one lost callback away from a balance nobody can explain.
The one idea to take away
Money flips every default. You never edit a balance; you append to an immutable double-entry ledger and derive balances, so the system is auditable and self-checking. Every operation is idempotent because duplicates are the norm. Cross-system transfers are sagas with compensating reversals, not distributed transactions. Strong consistency is scoped to a single account so it stays fast. And reconciliation runs forever, because at this scale something is always slightly wrong and you'd rather find it than wait for a customer to.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QWhy not just store a balance column and update it on each transaction?+
Because a mutable balance has no history and no way to prove it's correct. If it's ever wrong you can't tell how it happened. A double-entry append-only ledger stores every movement and derives balances by summing, giving you a full audit trail and a built-in correctness check (all entries sum to zero).
QWhat invariant makes a double-entry ledger self-checking?+
Every transaction writes two entries (a debit and a credit) that sum to zero, so the sum of all entries across the whole system is always zero. Money is only moved, never created or destroyed. If that global sum is ever non-zero, there's a bug, and you can detect it immediately rather than discovering it via an angry customer.
QA user's network drops after paying and they retry. How do you avoid moving the money twice?+
Every money-moving request carries an idempotency key, and the ledger records the outcome against it. A retry with the same key returns the original result instead of creating a second transaction. In payments this is load-bearing, not optional, because duplicate delivery is the normal state of the network.
QThe debit and credit are in different banks. How do you move money safely?+
A saga: a durable state machine (pending → completed / failed / reversed) where each step persists its progress. Debit the sender, then credit the receiver; if the credit fails after the debit, run a compensating reversal to refund the sender. You don't use a distributed transaction, because those don't work reliably across organisations.
QHow do you handle a refund or reversal in the ledger?+
You append new compensating entries that cancel the originals; you never delete or edit the original entries. The history then truthfully shows money moving and moving back. Deleting would destroy the audit trail and break the append-only guarantee. In a ledger, you correct by adding, never by erasing.
QWhere is strong consistency actually required, and where isn't it?+
It's required per account: concurrent debits against one balance must be serialised so you can't overdraw, solved with a conditional atomic update or a row lock. It's not required globally; different accounts are independent and can live on different shards running in parallel. Scoping strictness to the single hot row keeps the system fast.
QWhy does a payments system need continuous reconciliation?+
Because cross-system money movement produces discrepancies: lost callbacks, stuck pending transfers, partner records that disagree with yours. Reconciliation jobs compare your ledger against external systems and itself, flag mismatches, and auto-correct or escalate. The ledger invariant catches your own bugs; reconciliation catches disagreements with the outside world.
Comments
Loading comments…