Building Chat at Global Messaging Scale

A chat app is the classic "I could build that in a weekend" system. And you could build the weekend version. The production version delivers tens of billions of messages a day to phones that are offline more often than online, over networks that drop constantly, while guaranteeing that the message you send arrives, in order, exactly once as far as the recipient can tell. That set of guarantees, on top of unreliable phones, is the real problem.

The scale is real and famously lopsided against headcount. One well-known messaging service, at around half a billion users, reported handling roughly 19 billion inbound and 40 billion outbound messages a day on the order of 550 servers — run by a backend team you could count on your fingers. That ratio is only possible because the design is ruthless about doing the cheap thing: hold connections efficiently, persist before you promise, and push per-user complexity to the edges.

What we're building

Functional · what it does

One-to-one and group messaging
Delivered to phones that are often offline
Sent / delivered / read receipts (the ticks)
Online and "last seen" presence
Media (images, voice notes) and message ordering

Non-functional · what it must survive

Billions of messages a day
No message ever lost
No message shown twice (to the user)
Delivered in seconds, in order per chat
Hundreds of millions of long-lived connections

The non-functional list is where the difficulty lives. "No message lost" and "delivered in order" sound basic and are surprisingly hard once the network is allowed to fail at any moment, which it is.

The connection layer

A phone keeps a long-lived connection open to a connection server (a WebSocket, or a persistent TCP connection). This is how the server pushes a message to a phone the instant it arrives, instead of the phone polling.

A single tuned box can hold a startling number of these. One team famously pushed past 2 million open TCP connections on one server (after raising the operating system's socket and file-descriptor limits into the millions), then deliberately ran closer to 1 million per box in production — trading peak density for headroom, so losing one server hurts less. The point: the connection layer is cheap if you keep each connection idle and dumb; it gets expensive only when each one does real work.

The catch: hundreds of millions of these connections, and the recipient of any given message is connected to some other connection server than the sender. So you need a way to find the right server for a recipient and route to it.

Sender

Conn server A

Route by recipient

Conn server B

Recipient

Sender's server doesn't hold the recipient. A routing layer finds the recipient's server and hands off the message.

A session registry (often in Redis) maps each online user to the connection server currently holding them. To deliver a message, the sender's server looks up the recipient, finds their server, and forwards the message there over the internal network. If the recipient isn't connected at all, the message takes the offline path instead.

Delivery: the part that's actually hard

The core guarantee is "store, then deliver." A message is persisted before it's acknowledged to the sender. Only once it's safely written do you tell the sender "sent" (the first tick). This ordering is non-negotiable: if you acknowledged first and crashed before writing, the message would vanish and the sender would think it went through.

Send and persist
The sender's message reaches a server, which writes it to durable storage keyed by conversation. Only now does the sender get the single tick (sent).
Deliver if online
If the recipient is connected, push the message to their connection server. The recipient's app acknowledges receipt, which produces the double tick (delivered) back to the sender.
Queue if offline
If the recipient is offline, the message waits in their per-user inbox. When they reconnect, the server replays everything they missed, in order.
Read receipt
When the recipient actually opens the chat, their app sends a read event, producing the blue ticks.

This is at-least-once delivery. A dropped acknowledgement means the server re-sends, so the recipient's phone might receive the same message twice. The phone deduplicates by message ID before showing it, which turns at-least-once on the wire into exactly-once for the human. It's the same "at-least-once plus idempotency" pattern from the API chapter, with the dedup running on a phone instead of a server.

Exactly-once is a client-side illusion, built on purpose

You can't get exactly-once delivery over an unreliable network. What you build is at-least-once delivery with a unique message ID, and the recipient discards duplicates by ID. The user perceives exactly-once. Chasing true exactly-once at the network layer is a trap; dedup by ID is the standard, working answer.

Ordering

Messages in one chat must appear in the order they were sent, even though they may travel through different servers and arrive out of order. The fix is to attach an ordering key per conversation (a sequence number or a timestamp from a monotonic source) and have the client sort by it. You don't need global ordering across the whole service, only per-conversation ordering, which is far cheaper. Scoping the guarantee to what users actually perceive (the order within a single chat) is what makes it affordable.

One large messenger took this further with a neat trick: a single totally-ordered queue per user that carries both new messages and state changes (delivered, read) as one stream of updates, with two independent pointers — "last update sent to the device" and "last update written to storage." Decoupling those pointers means a slow disk or an offline phone never stalls real-time delivery; the queue just keeps advancing the other pointer. It's the same "one ordered log, many readers at their own pace" idea you'll meet again in event systems.

Offline phones and the inbox

The defining fact of mobile chat: the recipient is usually not connected. So every user has a durable inbox of undelivered messages. When they come online, the server drains the inbox in order and pushes everything they missed, then marks those messages delivered.

For phones that have been offline a long time (or never received a push), a push notification through APNs (Apple) or FCM (Google) wakes the app or at least shows a banner. Push is best-effort and unreliable on its own, so it's a nudge to reconnect and sync, never the delivery mechanism itself. The durable inbox is the source of truth; push just prompts the phone to come get its messages.

Group messages: fanout again

A group message is a fanout problem. One send must reach every member. You don't store one copy and have everyone read it; you fan the message out into each member's delivery path (online push or offline inbox), exactly as one-to-one delivery works, just repeated per member.

The cost of naive fanout is easy to underestimate. A real-time platform measured each cross-server message hand-off at tens of microseconds, which sounds tiny until you fan out to a busy room: broadcasting to a 30,000-member channel could take 0.9 to 2.1 seconds of nothing but sending. Their fix was to stop sending one-by-one: group the recipients by which server holds them and send one batched message per server, which that server then splits locally. Many small network hops become a few big ones. This is the same instinct as request coalescing — collapse N trips into one wherever a fan-out gets hot.

DecisionFan out group messages per member, but cap group size.

Treating each recipient independently keeps the delivery logic uniform and the per-chat ordering intact. But a message to a 1000-member group is 1000 deliveries, so group size is capped and very large groups (broadcast channels) switch to a different, more pull-based model closer to the news feed's celebrity case. Uniform fanout for normal groups, a separate regime for the giant ones.

Presence and "last seen"

Presence (online / last seen) seems minor and is a sneaky scaling problem, because status changes constantly and everyone wants their contacts' status in real time. Naively broadcasting every status change to every contact is a fanout storm.

The pragmatic approach mirrors the live-streaming counter: don't be perfectly real-time. Track presence with a heartbeat and a short TTL in a fast store (a user is "online" if they've pinged in the last N seconds), and push status changes lazily, often only when a contact actually opens the chat rather than to everyone continuously. Presence is a feature you make approximate on purpose to keep it affordable.

Where the data lives

Messages are stored keyed by conversation ID, in a store optimised for append and range reads (read recent messages in a chat). Sharding by conversation keeps one chat's history together and spreads load across chats. At the extreme this gets huge: one platform storing trillions of messages found its busiest channels became hot partitions, and its database's garbage-collection pauses caused latency spikes — so it moved to a store with no garbage collector and a shard-per-core design, and put a small data-service layer in front that coalesces concurrent reads of the same row into one query. The recurring lessons: pick the shard key so no single partition runs hot, and collapse the stampede when everyone opens the popular channel at once.
The session registry (who's connected where) is ephemeral and lives in Redis with TTLs, since it's recreated as users connect.
Per-user inboxes of undelivered messages are durable queues, drained on reconnect.
Media doesn't travel through the message path. The sender uploads it to object storage, gets back a reference, and the message carries only that reference plus a thumbnail. Large blobs never clog the real-time delivery system.

End-to-end encryption changes who can do what

Privacy-focused messengers encrypt messages end-to-end with the Signal Protocol, so the server routes and stores ciphertext it can't read. That's great for privacy and it removes server-side options: no server-side search, no content-based spam scanning, and group fanout has to encrypt per recipient device because each device holds its own keys. The design assumes the server is a blind courier. The server still needs the recipient's address on the outside of the envelope to route and store-and-forward; "sealed sender" goes further and hides the sender's identity from the server too, using short-lived certificates so spoofing and abuse can still be blocked without the server reading who sent what.

The one idea to take away

Chat is a reliability problem wearing a real-time costume. Persist before you acknowledge, deliver at-least-once and dedup by message ID on the client, keep a durable inbox because phones are usually offline, and scope your guarantees (ordering, presence) to what the user actually perceives so they stay affordable. Push notifications nudge; the inbox delivers.

Test yourself

Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.

QWhy must a message be persisted before the sender gets the 'sent' tick?+

Because if you acknowledged first and then crashed before writing, the message would be lost while the sender believed it was sent. Persist-then-acknowledge guarantees that once the sender sees a tick, the message is durably stored and will eventually be delivered. The ordering of those two steps is the whole reliability guarantee.

QThe recipient is connected to a different server than the sender. How does the message get there?+

A session registry (usually Redis) maps each online user to the connection server currently holding them. The sender's server looks up the recipient, finds their server, and forwards the message over the internal network to be pushed to the phone. If the recipient isn't connected, the message goes to their durable inbox instead.

QHow do you guarantee a message is shown exactly once if the network can duplicate it?+

You don't guarantee it on the wire; you deliver at-least-once and let the recipient deduplicate by unique message ID before displaying. A dropped acknowledgement causes a re-send, the phone sees the duplicate ID and discards it, and the human perceives exactly-once. True network-level exactly-once isn't achievable.

QHow do you keep messages in order when they travel through different servers?+

Attach a per-conversation ordering key (sequence number or monotonic timestamp) and have the client sort by it. You only need ordering within a single chat, not globally across the whole system, which is far cheaper to provide. Scope the guarantee to what the user actually sees.

QWhat's the role of push notifications (APNs/FCM) in delivery?+

They're a best-effort nudge to wake the app and prompt it to reconnect and sync from its durable inbox, not the delivery mechanism itself. Push is unreliable on its own, so the inbox remains the source of truth. The phone fetches missed messages on reconnect; push just tells it there's something to fetch.

QHow do you handle presence / 'last seen' without a fanout storm?+

Make it approximate. Track presence with a heartbeat and a short TTL in a fast store (online if pinged recently), and push status changes lazily, often only when a contact opens the chat rather than broadcasting every change to everyone. Perfect real-time presence isn't worth the fanout cost.

QHow does end-to-end encryption constrain the design?+

The server routes and stores ciphertext it can't read, so it acts as a blind courier. That removes server-side search and content-based spam scanning, and group fanout has to encrypt the message separately for every recipient device, since each device holds its own keys. Many otherwise-tempting server-side features become impossible, which has to be assumed from the start.