When India plays Pakistan in a World Cup, more people press play at the same instant than for almost any other event on earth. JioHotstar has reported concurrency records north of 6 crore (60 million) simultaneous viewers on a single stream. Your instinct from app development ("add more servers") is the wrong starting point here, and understanding why is the whole lesson.
The saving grace is this: at any given second, all 60 million people want the exact same two seconds of video. That sameness is what makes the problem solvable. Let's build the system from the ground up.
What we're actually building
Functional · what it does
- Watch the live match on phone, TV, web
- Pick or auto-adjust quality (240p to 1080p+)
- Pause, and rewind into the recent past (DVR)
- See a live "X are watching" counter
- Chat, reactions, and live score widgets
- Ads stitched into the stream
Non-functional · what it must survive
- 60M+ concurrent viewers on one stream
- Glass-to-glass latency in seconds, not minutes
- Survive a sudden 10x spike at the toss
- Degrade gracefully, never go fully dark
- Work on a 2G train connection and on fibre
- 99.99% available during the match window
The shape of the load is unusual and worth naming. It isn't steady traffic. It's a scheduled spike: nothing, then tens of millions of people in a ten-minute window around the toss, then sharp surges on every wicket and boundary. You know exactly when it's coming, which is a gift, and you have to be ready for all of it at once, which is the hard part.
Back-of-the-envelope: how big is this really?
That 300 Tbps number is the one that should stop you. No single data centre, no single origin, nothing you can buy as one box, serves that. For comparison, it dwarfs the total capacity most companies ever provision. The only way to move that much video is to never serve most of it from your own infrastructure. That sentence is the entire architecture in disguise.
How video actually reaches your phone
You don't stream "a video." You stream a rapid sequence of tiny files. This is the single most important idea in the whole design, so let's make it concrete.
The camera feed is encoded, then chopped into segments of a few seconds each. Every segment is just a small file sitting on a server, like an image. A text file called the manifest (or playlist) lists the segments in order. Your player reads the manifest, downloads segment 1, plays it, downloads segment 2, and so on. For a live stream, the manifest keeps getting new segments appended to its end, and the player keeps re-reading it.
The two common formats are HLS (Apple's, ubiquitous) and DASH. They differ in detail but share the idea: a manifest plus a pile of segment files. Because segments are ordinary files served over plain HTTP, every piece of HTTP caching you learned applies directly. That's why this scales when a custom streaming protocol wouldn't.
The encoding ladder and adaptive bitrate
A viewer on office fibre and a viewer on a 2G train can't get the same file. So the encoder produces the same match at several quality levels at once. That set of levels is the bitrate ladder.
The player measures how fast segments are arriving and picks the highest rung it can sustain, dropping down when the network dips and climbing back when it recovers. This is adaptive bitrate (ABR), and it all happens on the client. The server just offers the choices.
The key design choice: ABR lives in the player, not the server. The server stays dumb and cacheable. It offers the same handful of renditions to everyone, and 60 million players individually decide what to pull. No server-side per-user logic means no server-side bottleneck.
Why this caches so beautifully
Here's the payoff. At 8:46:02 PM, every viewer watching 720p wants 720p/segment_4471.ts. The same file. For all of them.
So when the first viewer in a city requests that segment, the nearby CDN edge server fetches it once from the origin and caches it. The next several hundred thousand viewers in that city are served from the edge's memory. The origin saw one request and the CDN served hundreds of thousands.
DecisionLive video is cached like static files, with very short TTLs.
A segment is immutable once created, so it can be cached hard. The manifest changes every few seconds, so it gets a tiny TTL (a second or two). The result is a CDN offload ratio above 99%: your origin serves a trickle while the CDN serves the flood. This is why "add more origin servers" is the wrong instinct, and "make it cacheable" is the right one.
There's one subtlety that trips people up. With hundreds of thousands of requests for a not-yet-cached segment arriving at the same instant, a naive cache would forward all of them to the origin (a cache stampede). CDNs solve this with request coalescing: the first request goes to origin, the rest wait for that single fetch and share its result. You should know the term, because the same pattern protects your own Redis caches.
The full architecture
Putting the pieces in layers, top (closest to the broadcast) to bottom (closest to the viewer):
A few of these layers deserve a closer look.
Origin shield. You don't want every CDN edge hitting your origin independently on a cache miss. You put a shielding tier in front: a smaller set of mid-tier caches that the edges fall back to, which in turn fall back to the origin. So a cold segment is fetched from origin essentially once globally, not once per edge. The origin's job shrinks to "answer a few thousand requests per second," which is boring and survivable.
Multi-CDN. No single CDN has enough capacity or reliability for a 60M-viewer event, so you use several (Akamai, Cloudflare, Jio's own CDN, and so on) at once. A client-side or DNS-level steering layer routes each viewer to the CDN that's healthiest and closest right now. If one CDN starts erroring or slows down, traffic shifts to the others within seconds.
DecisionRun multiple CDNs in parallel, not one with a backup.
A hot standby that only takes over on total failure doesn't help when a CDN degrades partially under load (the common case). Active multi-CDN steering lets you shift, say, 30% of traffic away from a struggling provider in real time. The cost is more operational complexity and more vendor relationships, which at this scale is unavoidable anyway.
Latency: the fight you can't fully win
"Glass to glass" is the delay from the real-world moment to your screen. With ordinary HLS using 6-second segments, you're often 20 to 45 seconds behind live, because the player buffers a few segments for safety. For cricket that's painful: your neighbour's TV erupts at a wicket before your stream shows the ball being bowled, and the WhatsApp spoilers arrive first.
The levers to cut latency:
- Shorter segments (2 seconds instead of 6) mean the player is less far behind, at the cost of more requests and more manifest churn.
- Low-Latency HLS (LL-HLS) breaks segments into smaller "parts" the player can fetch before the full segment is ready, pushing latency toward 3 to 6 seconds.
- Smaller player buffer cuts delay but raises the risk of a stall when the network hiccups.
Latency vs scale vs smoothness: pick two
Every latency win costs you something. Shorter segments multiply request volume and reduce cache efficiency. Smaller buffers make stalls more likely on weak networks, which in India is most networks. The honest engineering answer is a target like "5 to 8 seconds, almost never stalling," not "as low as possible." A stall during the final over is worse than being eight seconds behind.
The thundering herd at every wicket
Steady-state streaming is the easy part. The brutal moments are the synchronised actions:
The toss
Tens of millions open the app inside a few minutes. This is mostly an authentication and manifest-fetch spike, not a video-bandwidth spike yet.
A wicket or six
Everyone reacts at once: some seek back to replay it, many open chat, the concurrency counter jumps. A correlated burst across several subsystems in the same second.
Innings break
A huge fraction pause or background the app together, then come back together. A coordinated drop and re-spike.
The last over
Peak concurrency, peak chat, peak emotional cost of any failure. The worst possible time for anything to wobble.
The defences are the patterns from the data and API chapters, applied hard. Authentication is the first wall to fall at the toss, so token checks must be cheap and cacheable, with sessions in a fast store and as little per-request database work as possible. Anything that can be pre-computed before the match (entitlements, manifests, ad metadata) is pre-computed and warmed. And the system is pre-scaled: because you know the match schedule, you provision and warm capacity before the toss rather than autoscaling reactively, since reactive autoscaling is far too slow for a spike that arrives in minutes.
The "X million watching" counter
That live counter looks trivial and is secretly one of the hardest parts. You cannot run COUNT(*) across 60 million live connections every second, and you can't increment one hot row 60 million times without it becoming a bottleneck.
The trick is to give up on exactness. Nobody needs to know it's 59,412,008 versus 59,398,771. "About 59 million" is perfect. So:
- Each connection server tracks its own local count of viewers it's holding.
- Servers report their local counts to an aggregator every few seconds.
- The aggregator sums them and publishes an approximate total, which is pushed to clients.
This turns 60 million per-second updates into a few thousand servers reporting a number occasionally. For counting unique viewers rather than connections, a probabilistic structure like HyperLogLog estimates cardinality in kilobytes of memory instead of gigabytes. The theme runs through the whole design: at this scale, the right amount of accuracy is the least you can get away with.
The social and money layers
Chat and reactions are a fanout problem (see the news-feed deep dive). One message must reach millions, so you never store-and-forward per recipient. Messages go to a pub/sub layer, connection servers subscribe for the users they hold, and slow clients get dropped rather than allowed to slow everyone down. At cricket scale, chat is often sampled or rate-limited per region, because no human can read a feed moving at a million messages a second anyway.
Ads use server-side ad insertion (SSAI): ad segments are stitched directly into the stream so they arrive as ordinary video segments, which means they can't be blocked and they cache like everything else. Personalised ads complicate the caching story (different viewers may get different ad segments), so you balance how much to personalise against how much cache efficiency you're willing to lose during the most expensive minutes of the year.
When things go wrong
At this scale something is always degraded somewhere. The goal is never to go fully dark.
A CDN starts failing
Edges in a region throw errors or slow down. The steering layer detects elevated error rates and shifts those viewers to other CDNs within seconds. Viewers might see a brief quality drop as their player re-buffers, but the stream continues.
Graceful degradation
If the system is genuinely overloaded, you shed load deliberately: cap the top rendition (1080p off, everyone to 720p) to cut bandwidth, make the concurrency counter update less often, throttle chat. The match keeps playing. A slightly softer picture beats a spinner.
The principle worth carrying everywhere: decide your degradation order before the match. Under stress, drop the least important things first (counter freshness, chat, top quality) so the one thing that matters (the live video) survives.
The one idea to take away
Scaling live video isn't about serving 60 million people. It's about serving one segment to a CDN and letting it serve the 60 million. Make the hot path immutable and cacheable, push all per-user decisions to the client, pre-scale for the spike you can see coming, and decide in advance what you'll sacrifice when something breaks.
Test yourself
Questions· say the answer out loud before you open it. If you can't, the chapter isn't done.
QWhy can't you just add more origin servers to handle 60M viewers?+
Because 60M viewers at HD is roughly 300 Tbps of egress, which no origin fleet can serve. The design instead makes the hot path (segments) immutable and cacheable so the CDN absorbs 99%+ of traffic. The origin only needs to serve each segment a handful of times globally, behind a shield tier. The bottleneck isn't compute, it's bandwidth, and bandwidth is solved by caching, not by more origins.
QWhy is live video so cacheable when it's 'live'?+
Because at any instant every viewer at a given quality wants the exact same segment file. That file is immutable once created, so an edge fetches it once and serves it to everyone nearby. Only the manifest changes frequently, and it gets a tiny TTL. Sameness across viewers is what turns a 300 Tbps problem into a caching problem.
QWhat is a cache stampede here, and how is it handled?+
When a popular segment isn't cached yet and hundreds of thousands of requests for it hit an edge in the same instant, a naive cache forwards all of them to origin. The fix is request coalescing: the first request goes to origin, the rest wait and share its result. The same pattern protects your own Redis caches from a hot-key stampede.
QA viewer complains the stream is 30 seconds behind live. What are your levers?+
Shorter segments (2s instead of 6s), Low-Latency HLS so the player fetches partial segments before they're complete, and a smaller player buffer. Each has a cost: more requests, lower cache efficiency, and a higher chance of stalls on weak networks. The realistic target is a few seconds behind while almost never stalling, not zero latency.
QHow do you show a live concurrency counter for 60M viewers?+
Approximately. Each connection server counts its own viewers and reports to an aggregator every few seconds; the aggregator sums and publishes a rounded total. For unique counts, HyperLogLog estimates cardinality in kilobytes. You never run COUNT(*) or increment one hot row 60M times. Exactness isn't a requirement, so you trade it for scale.
QWhy multi-CDN instead of one CDN with a failover?+
No single CDN reliably has the capacity for a 60M event, and CDNs usually degrade partially under load rather than failing outright. Active multi-CDN steering lets you shift a fraction of traffic away from a struggling provider in real time, which a cold standby can't do. The cost is operational complexity, unavoidable at this scale.
QThe match starts in 10 minutes and traffic will 50x. Do you rely on autoscaling?+
No. Reactive autoscaling reacts in minutes and the spike arrives in minutes, so you'd be permanently behind. Because the schedule is known, you pre-scale: provision and warm capacity, pre-compute entitlements and manifests, and prime caches before the toss. Autoscaling handles the gentle ramp afterward, not the toss spike.
QThe system is overloaded mid-match. What do you sacrifice first?+
Whatever matters least to the viewer: cap the top rendition to cut bandwidth, slow the concurrency counter, throttle or sample chat. The live video stream is sacrosanct. The key is deciding this degradation order before the match so the system sheds the right load automatically under stress instead of failing randomly.
Comments
Loading comments…