You build a URL shortener in a weekend. It works perfectly. Then it goes viral.
At first it’s just your friends. Pages load instantly, and they love what you’ve built. They share the link with others, and you watch the user count tick towards a hundred. That quiet excitement hits; you’ve made something real. Then more people start using it.
You keep opening your dashboard. The numbers are climbing faster than you can refresh. You are half excited, half terrified. Hundreds is turning into thousands. Then the notifications start: the app is slow, links are not redirecting, users are complaining publicly. The same system that handled a hundred users without blinking is now falling apart under a thousand. You have not changed a single line of code.
So what went wrong?
Nothing went wrong. You just hit the wall that every growing system hits eventually.
The question is:
What do you actually do?
The Scaling Roadmap
Scaling isn’t a single decision, it’s a series of targeted upgrades, each one unlocking the next order of magnitude. Here’s the progression every high-traffic app follows:
•Single Server - handles your first few thousand users
•Load Balancer - distributes traffic across multiple servers
•Caching Layer - serves popular data from memory instead of the database
•Content Delivery Network (CDN) - pushes content closer to users globally
•Distributed Cache - spreads cache across multiple machines for millions of users
Single Server : Your first deployment runs everything on one machine: request handling, database queries, link generation, and page serving. This is perfectly fine up to a few thousand users; the pages load fast and the setup is simple. Don’t over-engineer this stage.
Load Balancer : Once you’re in the tens of thousands, a single server starts to buckle. Requests queue up, response times climb, and occasional timeouts start appearing. A load balancer sits in front of your servers and distributes incoming traffic across a pool of app servers, ensuring no single machine becomes a bottleneck. Traffic spikes that would have crashed your app are now absorbed gracefully.
Caching Layer : At hundreds of thousands of users, a pattern becomes obvious: the same short codes are being resolved over and over. Instead of hitting the database every time, a cache layer stores the most frequently accessed mappings in memory. A lookup that previously cost a 40ms database round-trip now completes in under 1ms. Database load drops dramatically, and your app can handle far more concurrent users on the same hardware.
Content Delivery Network (CDN) : Once your users are spread across the globe, physical distance becomes a problem. A CDN places copies of your static assets and cache-able responses at edge locations around the world. A user in Lagos, Berlin, or Sydney gets their redirect served from a nearby edge node rather than your origin server in, say, Virginia. Latency drops from hundreds of milliseconds to single digits.
Distributed Caching : At millions of users, even a single powerful cache server becomes a constraint. A distributed cache; like a Redis Cluster, spreads data across multiple nodes. The most popular short links are served instantly from memory, read throughput scales horizontally, and the system stays fast even under massive, sustained load.
Load Balancing: Distributing Traffic Across Servers
Round-Robin : Round-robin is the simplest traffic distribution strategy: each incoming request is sent to the next server in rotation, cycling back to the start. It works well when servers are equally capable and traffic is fairly uniform. For a URL shortener handling stateless redirect requests, round-robin is a reasonable starting point at modest scale.
But round-robin has a critical blind spot. It knows nothing about data locality. If one server has cached a hot short code in memory, round-robin may send the next request for that code to a different server entirely, causing a cache miss. At scale, this causes unnecessary database pressure and unpredictable latency. Adding or removing servers also reshuffles which server handles which requests, wiping out accumulated cache state.
The Rehashing Problem : Imagine your URL shortener has four servers, each caching a quarter of your popular short codes. You add a fifth server to handle increased load. With naive modulo hashing (short_code % number_of_servers), roughly 80% of your cache keys now map to different servers. Users experience redirect failures and slowdowns while servers frantically rebuild their caches.
It’s like rearranging a warehouse mid-shipment.
Consistent Hashing: The Production Solution
Consistent hashing solves this cleanly. Picture a ring. Servers occupy fixed positions along the ring, and each short code is hashed to a point on the ring. Requests route clockwise to the nearest server. When you add a new server, only the keys in the arc immediately preceding its position need to migrate roughly 1/N of total keys, where N is the number of servers. Virtual nodes (multiple positions per server) smooth out load distribution even further.
For your URL shortener, consistent hashing on the short_code ensures that popular links reliably route to the server holding their cache, and that adding capacity during a traffic spike doesn’t cascade into a cache stampede.
Here’s how round-robin looks in an NGINX upstream configuration:
upstream app_servers {
server app1.example.com;
server app2.example.com;
server app3.example.com;
}
server {
location / {
proxy_pass http://app_servers;
}
}
Algorithm comparison:
| Algorithm | Keys Moved on Change | Used By | Best For |
| Round-Robin | N/A (no cache affinity) | NGINX default | Stateless, uniform requests |
| Mod-N Hashing | ~80% when N changes | Legacy systems | Static server pools only |
| Consistent Hashing | ~1/N (minimum possible) | DynamoDB, Cassandra, Akamai | Dynamic scaling, cache affinity |
| Power of Two Choices | N/A (load-aware) | AWS Lambda, Envoy | Multi-LB environments, service mesh |
Real-world precedent : Netflix applies consistent hashing to route requests to the servers holding cached video segment data. Popular content is served without repeatedly querying origin storage, keeping playback smooth even under massive load. The same principle applies directly to your URL shortener.
HTTP Caching: Making the Web Faster
HTTP caching is built into the web protocol. When configured correctly, browsers and CDN edge nodes store responses locally, eliminating redundant trips to your origin servers. The key headers are:
•Cache-Control - defines how long content should be stored and by whom
•ETag - a fingerprint that lets clients check whether cached content is still fresh
•Vary - specifies which request headers affect the cached response
Understanding Cache-Control : A common misconception: Cache-Control: no-cache does not mean “don’t cache.” It means “cache, but revalidate before serving.” The response can live in memory; it just can’t be served without checking freshness first. Understanding this distinction is essential to using caching effectively.
A more powerful pattern is splitting browser and CDN TTLs:
Cache-Control: public, max-age=60, s-maxage=3600
This tells browsers to cache for 60 seconds (so users get fast responses on repeated clicks) and CDNs to cache for an hour (so your origin servers rarely see requests for popular links). Browsers validate frequently; CDNs absorb the bulk of the load.
ETags and Conditional Requests : On the first request, your server returns a response with an ETag header, a hash or version identifier. The browser stores it. On the next request, the browser sends the ETag back. If the content hasn’t changed, the server responds with 304 Not Modified. No body is sent, bandwidth is saved, and the user experiences an instant load. For a URL shortener, this matters for any metadata pages where content changes infrequently.
Stale-while-revalidate : stale-while-revalidate allows serving an expired cache entry immediately while fetching a fresh copy in the background. Applied to your URL shortener, this means a redirect response can be served from cache even after its TTL expires, with the cache refreshed transparently. Users never see a delay during high-traffic bursts.
The Vary Header Trap
Vary: User-Agent forces caches to store a separate copy for every distinct browser and device type. This silently destroys cache efficiency, every variation gets its own slot, and cache hit rates collapse. Avoid broad Vary headers unless you’re genuinely serving different content per device.
CDN Architecture: Bringing Content Closer to Users
At its core, a CDN is a distributed HTTP cache. Instead of routing every request back to your origin server, copies of your content live at dozens of edge locations worldwide. For a URL shortener, this means viral links, the small fraction that receive massive traffic can be served entirely from the edge, with zero database involvement.
Pull CDN vs Push CDN
Pull CDN : lazily fetches content from your origin only when a user first requests it. The cache fills naturally over time. Ideal for dynamic or unpredictable content, like short codes whose popularity you can’t know in advance.
Push CDN : requires you to proactively upload content to edge nodes. Best for static resources or pre-generated redirect tables for your most popular links.
Real-world precedent : Netflix Open Connect achieves a 98% CDN cache hit rate for video streams. Nearly every video chunk is served from the edge, not from Netflix’s origin data centers. The same model applies directly to a URL shortener: the top 0.1% of links can be handled entirely at the edge, leaving your database untouched.
Cache invalidation strategies:
| Strategy | How It Works | Speed | Use Case |
| TTL Expiration | Content expires automatically after N seconds | Delayed (waits for TTL) | Slow-changing content:: blog posts, product pages |
| Purge API | Manual API call instantly removes cached content | Fastly: 150ms global | News, e-commerce inventory, breaking content |
| Surrogate Keys | Tag responses; purge all tagged objects at once | Same as purge | Complex relationships: purge all product-123 pages |
| Soft Purge | Mark stale, serve old while refreshing in background | Immediate serve | High-traffic pages where downtime is unacceptable |
CDN provider comparison:
| Feature | Cloudflare | AWS CloudFront | Fastly |
| PoPs | 330+ cities | 750+ PoPs + 1,140 embedded | ~200 strategic |
| Routing | Anycast (single IP, BGP routing) | DNS-based (+ Anycast option) | Anycast |
| Purge Speed | Sub-150ms global | Seconds to minutes | 150ms global (since 2011) |
| Edge Compute | Workers: V8 Isolates, <1ms cold start | Lambda@Edge or CF Functions | Compute@Edge: WebAssembly |
| Cache Invalidation | Purge API + Cache Rules | API (slow) + versioned URLs | Surrogate keys: best in class |
| Free Tier | Generous: unlimited bandwidth | Pay per GB from first byte | No free tier |
Redis: Application-Level Caching
Beyond the HTTP layer, your application needs its own in-memory cache. Redis is the industry standard: it stores data in RAM rather than on disk, making look-ups orders of magnitude faster than a database query. For a URL shortener, Redis is the layer that makes redirect responses feel instantaneous.
Cache-Aside: The Recommended Pattern
When a user clicks a short link, your app checks Redis first. If the mapping is there, it’s returned immediately. If not, the app queries the database, returns the result, and stores it in Redis for future requests. Most subsequent clicks on that link never touch the database.
def get_short_url(short_code):
url = cache.get(short_code) # Step 1: Check cache
if not url: # Step 2: Cache miss
url = db.query(short_code) # Query database
cache.set(short_code, url) # Step 3: Populate cache
return url
Write-Through vs Write-Behind
Write-Through : writes to both cache and database simultaneously. Guarantees consistency but doubles write latency. Use this when data correctness is non-negotiable.
Write-Behind : writes to cache first and flushes to the database asynchronously. Faster writes, but risks data loss if the cache crashes before the flush completes. Use this for high-throughput analytics where some loss is acceptable.
Cache Stampede: The Failure Mode You Must Plan For
A cache stampede happens when a popular cache key expires and thousands of concurrent requests simultaneously find a miss. Each one fires a database query. The database buckles under the load. For a URL shortener, a single viral link expiring at the wrong moment can trigger exactly this scenario.
Three defences:
1.TTL jitter: Randomize expiry times slightly so keys don’t expire simultaneously
2.Distributed lock (Redis SET NX EX): Only one request rebuilds the cache; others wait
3.XFetch: proactively Refresh hot keys just before they expire, preventing the miss entirely
Memory Optimization
Real-world precedent: Instagram stored 300 million URL mappings in Redis using 21 GB of memory. By switching to Redis ziplist encoding (which compacts small structures), they reduced that to 5 GB - a 76% reduction. For your URL shortener, similar techniques (efficient serialization, compact data structures) can dramatically cut infrastructure costs at scale.
Eviction Policy
allkeys-lru(Least Recently Used) is the right default for general workloads. If your traffic follows an 80/20 pattern: 20% of links generating 80% of clicks, then allkeys-lfu (Least Frequently Used) keeps your hottest links in memory while evicting cold ones. Choosing the right policy ensures cache performance holds under sustained load.
Everything Applied: One Request, End to End
So let’s say a user in Nigeria clicks a short link in a tweet. Here’s exactly what happens across the full stack:
Step 1: CDN Edge (~5ms)
The request hits the nearest CDN edge location. For the top 0.1% of viral links, the ones cached at the edge via s-maxage, the redirect response is returned immediately. The request never reaches your servers. TTL jitter ensures popular links don’t expire in sync, preventing coordinated cache misses.
Step 2: Redis / Cache-Aside (~10ms)
If the CDN doesn’t have the link, the request reaches your app servers. Cache-Aside checks Redis for the short_code. A hit returns the mapping in under 10ms. A miss triggers the database path. Distributed locks or XFetch prevent simultaneous misses from cascading into a stampede.
Step 3: Database (~40ms)
On a cache miss, the app queries the sharded database (consistent hashing routes the query to the correct shard), retrieves the mapping, writes it back to Redis, and responds. Ziplist encoding and appropriate eviction policies keep the cache lean and performant for the next request.
Step 4: Redirect Response: 301 vs 302
Why 302 and not 301? Bitly famously uses 302 because click analytics are their core product. A 301 permanently caches the redirect in the browser, making future clicks invisible to their tracking. A 302 ensures every click is recorded. For your URL shortener, the answer depends on whether analytics matter more than marginal performance gains.
Performance at scale; back of the envelope:
| Metric | Calculation | Result |
| New URLs created | 100M per month / 30 days / 86,400 sec | ~40 writes/second |
| URL redirects (100:1 read ratio) | 40 writes/sec × 100 | 4,000 reads/sec (40K at peak) |
| Short code space (7 chars, Base62) | 62^7 | 3.52 trillion combinations (~100 years) |
| Storage (5 years) | 100M × 12 months × 5 years × 500 bytes | ~3 TB before replication |
| Redis hot cache (top 1% = 90% of traffic) | 1% of daily URLs × 500 bytes | ~330 MB caches 90%+ of reads |
What breaks at each scale stage:
| Scale | What Breaks | The Fix |
| 0-1K users | Nothing | Single server, SQLite or MySQL, no Redis needed |
| 1K-10K users | Database read bottleneck | Add Redis cache-aside, add read replica |
| 10K-100K users | App server CPU ceiling | Load balancer + 2-3 app servers (Round-Robin) |
| 100K-500K users | Cache miss spikes overwhelming DB | CDN for redirects, TTL jitter, Redis cluster |
| 500K-1M users | Database write throughput ceiling | Sharding with consistent hashing, async analytics |
| 1M+ users | Single-region latency for global users | Multi-region, GeoDNS routing, regional Redis clusters |
Key Trade-Offs
| Decision | Option A | Option B | Choose Based On |
| Redirect type | 301 Permanent (browser caches; no return trips) | 302 Temporary (every click reaches your servers) | Need analytics? Use 302. Click data is the product for companies like Bitly. |
| Short code generation | Auto-increment + Base62 encode (predictable, zero collisions) | Hash (MD5 truncated; collision risk) | At scale, auto-increment + XOR obfuscation beats hash complexity. |
| Database choice | NoSQL (DynamoDB/Cassandra): horizontal sharding native | SQL (MySQL): simpler, vertical scaling ceiling | At 4,000 reads/sec, NoSQL with consistent hashing wins. |
The Engineering Mindset
The best engineers are not the ones who know every tool. They are the ones who understand trade-offs deeply enough to make the right call for their specific system, their specific constraints, and their specific users.
For example, as a developer in Nigeria, your constraints are real: users are on expensive data plans, connectivity that drops without warning, servers that are geographically far away. Every caching decision you make is an act of empathy for the person on a 3G connection in Kano trying to load your app. Build accordingly. The goal isn’t to over-engineer early, it’s to design systems that evolve as demand increases.
Scaling to one million users is less about powerful hardware and more about smart architecture. Three principles drive it:
•Distribute traffic using load balancing
•Reduce repeated computation through caching
•Move content closer to users using CDNs
When combined, these strategies dramatically reduce database load, lower infrastructure costs, and keep your application fast, from your first hundred users to your first million.
Resources
•Designing Data-Intensive Applications. Kleppmann (chapters 5-6)
•RFC 9111. HTTP Caching (current standard)
•ByteByteGo YouTube. Alex Xu; visual system design
•Instagram Engineering Blog. Redis memory optimization
•Scaling Memcache at Facebook. NSDI 2013 (Nishtala et al.)
•Redis Official Docs. Caching patterns & eviction reference
What’s Next?
Once caching and load balancing are in place and your system is serving one million users reliably from cache, the next frontier is real-time communication at scale. Technologies like WebSockets and Server-Sent Events introduce a fundamentally different set of constraints; persistent connections, event fan-out, and stateful session management.
Next post will cover WebSockets, HTTP polling, and Server-Sent Events. Follow or subscribe for the next post.























