Data & consistencyPaymentshard19 min read

A Payments Ledger for Nationwide, Real-Time Money Movement

Everywhere else you trade consistency for scale. With money you can't. A lost or duplicated rupee is a person's salary, so correctness is absolute and the design bends around that.

A national real-time payments network can process more than 20 billion transactions a month — on the order of 700 million a day. Every single one moves real money between real people, and the system has exactly zero tolerance for losing a rupee, creating one from nothing, or charging someone twice. This is the deep dive where the lessons from every other one (cache it, make it eventual, approximate it) all get thrown out, because none of them are acceptable when the number is someone's bank balance.

One architectural note up front, because it shapes everything else. These networks are usually built around a central switch that is largely stateless: it routes a payment request to the right banks, validates it, times it, and enforces the rules — but it doesn't hold anyone's money. The account balances live at the participant banks. So the switch can scale horizontally like any stateless service, while the hard consistency work happens at the banks and in the ledger that records what moved.

What we're building

Functional · what it does

  • Move money from one account to another
  • Show an always-correct balance
  • A full, auditable history of every transaction
  • Refunds and reversals
  • Handle retries from flaky mobile networks

Non-functional · what it must survive

  • Never lose or duplicate money, ever
  • Every balance must be provably correct
  • Survive crashes mid-transaction with no corruption
  • High throughput despite strong consistency
  • Complete audit trail for compliance

Notice what's not on the list: "lowest possible latency" and "infinite scale at any cost." Those are negotiable. Correctness is not. The design optimises for being provably right first, and fast second.

The first rule: never edit a balance

The instinct is to store a balance column and UPDATE it on every transaction. Don't. A stored, mutable balance has no history, no audit trail, and no way to tell whether it's right. If it's ever wrong, you can't reconstruct how it got that way.

Instead you use a double-entry ledger, the same system accountants have used for centuries. Every transaction writes two entries: a debit from one account and a credit to another. Balances aren't stored; they're derived by summing an account's entries. The ledger is append-only: you never update or delete an entry, you only add new ones.

A double-entry ledger: append, never editrun · edit · saved to you
Loading editor…

The magic property is the invariant: every transaction's two entries sum to zero, so the sum of all entries in the system is always zero. Money is never created or destroyed, only moved. If that global sum is ever non-zero, you have a bug, and you can detect it immediately. That self-checking property is why money systems are built this way.

DecisionStore an append-only ledger and derive balances; never store a mutable balance.

Appending two entries per transaction uses more storage and means a balance is a sum rather than a lookup. In return you get a complete audit trail, the ability to reconstruct any balance at any point in time, and a built-in correctness check (all entries sum to zero). For money, that trade is not even close; correctness and auditability win.

For performance, you don't re-sum millions of entries on every balance check. You keep a cached or periodically-snapshotted balance (a materialised balance) for fast reads, but the ledger remains the source of truth. The cached balance is a convenience derived from the ledger, never the authority. If they ever disagree, the ledger is right and the cache is rebuilt.

Idempotency on every rupee

The mobile network will fail between charging the user and showing them the result. They will retry. Without protection, the retry moves the money again. So every money-moving request carries an idempotency key, and the ledger records the result against that key. A retry with the same key returns the original outcome instead of creating a second transaction.

This is the API chapter's idempotency pattern, but here it's load-bearing rather than a nice-to-have. A missing idempotency key in a payments system is a double-spend waiting for the first network blip.

The double-submit is the default, not the edge case

In payments you assume every request arrives more than once. Users double-tap, apps auto-retry, gateways replay. The idempotency key isn't there for the rare failure; it's there because duplicate delivery is the normal state of the world. Design as if every request is a duplicate of one you've already seen, and let the key tell you whether it actually is.

Moving money across two systems

The hardest case: the two accounts live in different banks, so the debit and the credit happen in different systems that can each fail independently. You can't wrap them in one database transaction. This is the distributed-transaction problem, and the honest answer is you don't use a distributed transaction (they don't work reliably across organisations). You use a state machine for the transfer.

  1. Initiate

    Create the transfer record in a pending state with its idempotency key. Nothing has moved yet, but the intent is durably recorded.

  2. Debit the sender

    Reserve and debit the source account. If this fails, the transfer goes to failed and nothing else happens.

  3. Credit the receiver

    Credit the destination. If it succeeds, the transfer is completed.

  4. Compensate on failure

    If the credit fails after the debit succeeded, you run a compensating action: reverse the debit (refund the sender). The transfer ends reversed.

This is a saga: a multi-step process where each step has an undo, driven by a durable state machine that can be retried from wherever it crashed. The transfer record's state is persisted at every step, so a crash mid-transfer resumes correctly instead of leaving money in limbo. The states (pending, completed, failed, reversed) are explicit and exhaustive, so there's no undefined in-between.

The 'I don't know' outcome is a real state

The genuinely scary case isn't failure, it's uncertainty: you sent the credit to the receiver's bank and never heard back. Did it land or not? Real networks handle this with an explicit deemed outcome. The switch records "treated as credited" or "treated as failed" per the rulebook, tells both sides, and lets a later settlement or dispute process confirm the truth. A timeout is not "nothing happened." It's its own state you record and resolve, never something you silently retry into a double-credit.

Reversals are new entries, not deletions

When you refund or reverse, you do not delete the original ledger entries. You append new compensating entries that cancel them out. The history shows "money moved, then moved back," which is the truth. Deleting the original would destroy the audit trail and break the append-only guarantee. In a ledger, you fix mistakes by adding, never by erasing.

Where strong consistency is non-negotiable

Within a single account, the debit must be atomic and isolated: two simultaneous withdrawals must not both succeed against the same balance, or you've allowed an overdraft that shouldn't exist. This is the same hot-row contention as the ticket seat in the previous deep dive, solved the same way: a conditional atomic update guarded by a balance check (... WHERE balance >= amount), or an explicit row lock, so concurrent debits are serialised.

Across the whole system you don't need global strong consistency, only per-account consistency. Different accounts are independent, so they can live on different shards and proceed in parallel. The expensive guarantee is scoped to the one place it's required (a single account's balance), which is what keeps an otherwise strict system fast.

DecisionStrong consistency per account, sharded across accounts.

Global strong consistency would serialise the entire system and destroy throughput. Per-account consistency gives you the only guarantee that actually matters (no account is ever overdrawn or double-spent) while letting unrelated accounts on different shards run fully in parallel. You pay the cost of strictness exactly once, on the hot row, and nowhere else.

Reconciliation: trust, but verify

Even with all of the above, distributed money movement produces discrepancies: a partner bank's records disagree with yours, a callback was lost, a transfer is stuck pending. So a payments system runs continuous reconciliation: jobs that compare your ledger against external systems and against itself, flag anything that doesn't match, and either auto-correct (refund an orphaned charge) or escalate to a human.

This is the safety net under the safety net. The ledger's internal invariant (entries sum to zero) catches your own bugs; reconciliation catches disagreements with the outside world. A payments system without reconciliation is one lost callback away from a balance nobody can explain.

At national scale this isn't just good engineering, it's often the law. In India the RBI's Turn Around Time framework requires that a debit which never reaches the beneficiary be auto-reversed within a fixed window, and if the reversal is late the bank pays the customer a fixed daily penalty automatically, with no complaint required. Disputes and chargebacks are similarly time-boxed, after which silence counts as acceptance. So your reconciliation and reversal flows aren't only protecting your books; they're meeting a clock the regulator is holding.

The one idea to take away

Money flips every default. You never edit a balance; you append to an immutable double-entry ledger and derive balances, so the system is auditable and self-checking. Every operation is idempotent because duplicates are the norm. Cross-system transfers are sagas with compensating reversals, not distributed transactions. Strong consistency is scoped to a single account so it stays fast. And reconciliation runs forever, because at this scale something is always slightly wrong and you'd rather find it than wait for a customer to.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QWhy not just store a balance column and update it on each transaction?+

Because a mutable balance has no history and no way to prove it's correct. If it's ever wrong you can't tell how it happened. A double-entry append-only ledger stores every movement and derives balances by summing, giving you a full audit trail and a built-in correctness check (all entries sum to zero).

QWhat invariant makes a double-entry ledger self-checking?+

Every transaction writes two entries (a debit and a credit) that sum to zero, so the sum of all entries across the whole system is always zero. Money is only moved, never created or destroyed. If that global sum is ever non-zero, there's a bug, and you can detect it immediately rather than discovering it via an angry customer.

QA user's network drops after paying and they retry. How do you avoid moving the money twice?+

Every money-moving request carries an idempotency key, and the ledger records the outcome against it. A retry with the same key returns the original result instead of creating a second transaction. In payments this is load-bearing, not optional, because duplicate delivery is the normal state of the network.

QThe debit and credit are in different banks. How do you move money safely?+

A saga: a durable state machine (pending → completed / failed / reversed) where each step persists its progress. Debit the sender, then credit the receiver; if the credit fails after the debit, run a compensating reversal to refund the sender. You don't use a distributed transaction, because those don't work reliably across organisations.

QHow do you handle a refund or reversal in the ledger?+

You append new compensating entries that cancel the originals; you never delete or edit the original entries. The history then truthfully shows money moving and moving back. Deleting would destroy the audit trail and break the append-only guarantee. In a ledger, you correct by adding, never by erasing.

QWhere is strong consistency actually required, and where isn't it?+

It's required per account: concurrent debits against one balance must be serialised so you can't overdraw, solved with a conditional atomic update or a row lock. It's not required globally; different accounts are independent and can live on different shards running in parallel. Scoping strictness to the single hot row keeps the system fast.

QWhy does a payments system need continuous reconciliation?+

Because cross-system money movement produces discrepancies: lost callbacks, stuck pending transfers, partner records that disagree with yours. Reconciliation jobs compare your ledger against external systems and itself, flag mismatches, and auto-correct or escalate. The ledger invariant catches your own bugs; reconciliation catches disagreements with the outside world.

Before you leave — how confident are you with this?

Your honest rating shapes when you'll see this again. No grades, no shame.

More deep dives

Comments

to join the discussion.

Loading comments…