System Design Interview Preparation: The Complete Roadmap

System design interviews are the biggest differentiator between mid-level and senior engineering roles. They test whether you can think about systems holistically: scalability, reliability, trade-offs, and real-world constraints.

The problem is that most engineers study by memorizing specific system designs (URL shortener, chat app, etc.) without understanding the underlying patterns. When they get a question they haven't seen, they freeze.

This guide takes a different approach. It teaches you a repeatable framework and the core building blocks, so you can design any system on the spot.

The Framework: How to Structure Your Answer

Every system design interview should follow this structure. Internalize it.

Step 1: Clarify Requirements (3-5 minutes)

Before designing anything, ask questions. This shows maturity and prevents wasted effort.

Functional requirements:

What are the core features?
Who are the users?
What are the inputs/outputs?

Non-functional requirements:

What's the expected scale? (users, requests/sec, data volume)
What are the latency requirements?
Is availability or consistency more important?
What's the read/write ratio?

Example for "Design Twitter":

Functional:
- Post tweets (text, 280 chars)
- Follow/unfollow users
- View home timeline (tweets from followed users)
- Search tweets

Non-functional:
- 500M users, 200M DAU
- ~600 tweets/sec writes, ~600K reads/sec
- Timeline latency < 200ms
- Availability > consistency (eventual consistency OK)
- Read-heavy: ~1000:1 read/write ratio

Step 2: High-Level Design (5-10 minutes)

Draw the major components and how data flows between them:

Client → Load Balancer → API Gateway → Services
                                          │
                              ┌───────────┼───────────┐
                              ▼           ▼           ▼
                         Tweet Service  Timeline    User Service
                              │         Service         │
                              ▼           │             ▼
                         Tweet DB         ▼          User DB
                              │       Cache Layer
                              ▼     (Timeline Cache)
                         Message Queue
                              │
                              ▼
                      Fan-out Service

Step 3: Deep Dive (15-20 minutes)

Pick the most critical components and design them in detail. The interviewer will guide you, but be prepared to dive into:

Database schema and choice
API design
Scaling strategy
Caching approach
Failure handling

Step 4: Trade-offs and Bottlenecks (5 minutes)

Discuss what could break, what you'd monitor, and alternative approaches.

Core Building Blocks You Must Know

These are the Lego pieces of system design. Learn these deeply, and you can assemble any system.

1. Load Balancing

Distributes traffic across servers. Know the algorithms:

Algorithm	When to Use
Round Robin	Equal server capacity, stateless services
Weighted Round Robin	Mixed server capacities
Least Connections	Long-lived connections (WebSocket)
IP Hash	Session affinity needs
Consistent Hashing	Distributed caches, database sharding

Key point: L4 (TCP) vs L7 (HTTP) load balancing. L7 can route based on content (URL path, headers) but adds latency. L4 is faster but dumber.

2. Caching

Caching is in every system design answer. Know the patterns:

Cache-Aside (Lazy Loading):
1. App checks cache
2. Cache miss → read from DB
3. Write result to cache
4. Return to client

Write-Through:
1. App writes to cache
2. Cache writes to DB
3. Return to client

Write-Behind (Write-Back):
1. App writes to cache
2. Cache async writes to DB (batched)
3. Return to client immediately

When to use what:

Cache-aside: Default choice. Works for read-heavy workloads.
Write-through: When you can't afford cache misses on recently written data.
Write-behind: High write throughput, OK with some data loss risk.

Cache invalidation strategies:

TTL (Time-To-Live): Simple, eventual consistency. Set TTL = acceptable staleness.
Event-based: Invalidate on write. More complex but fresher data.
Version tags: Include version in cache key. New version = automatic miss.

3. Database Selection

Requirement	Database Type	Examples
Structured data, ACID	Relational	PostgreSQL, MySQL
Flexible schema, high write	Document	MongoDB, DynamoDB
Social graphs, relationships	Graph	Neo4j, Amazon Neptune
Time-series metrics	Time-series	InfluxDB, TimescaleDB
Full-text search	Search engine	Elasticsearch, OpenSearch
Session data, leaderboards	Key-Value	Redis, Memcached
Wide-column, massive scale	Column-family	Cassandra, HBase

4. Database Scaling Patterns

Vertical scaling — Bigger machine. Simple but has a ceiling.

Read replicas — Primary handles writes, replicas handle reads. Works for read-heavy workloads.

Sharding — Split data across multiple databases by a shard key.

Shard by user_id:
  user_id % 4 = 0 → Shard A
  user_id % 4 = 1 → Shard B
  user_id % 4 = 2 → Shard C
  user_id % 4 = 3 → Shard D

Problems with naive sharding:
- Hot shards (uneven distribution)
- Cross-shard queries are expensive
- Rebalancing when adding shards

Better: Consistent hashing with virtual nodes

5. Message Queues

Decouple producers from consumers. Essential for async processing.

Producer → Queue → Consumer

Use cases:
- Order processing (place order → queue → payment → queue → fulfillment)
- Notifications (event → queue → email/push/SMS services)
- Data pipelines (change event → queue → downstream processing)

Key concepts:
- At-least-once delivery (most common)
- Exactly-once semantics (harder, Kafka supports it)
- Dead letter queues (failed messages go here)
- Message ordering (per-partition in Kafka)

6. The CAP Theorem (Practical Version)

In a distributed system during a network partition, you must choose:

CP (Consistency + Partition tolerance): Every read gets the most recent write, but some requests may fail. (Banking, inventory)
AP (Availability + Partition tolerance): Every request gets a response, but it might be stale. (Social media feeds, DNS)

In practice, most systems pick AP for user-facing reads and CP for critical writes.

7. Rate Limiting

Protect services from abuse and cascading failures.

Algorithms:
1. Token Bucket — Allows bursts, smooth average rate
2. Sliding Window — Precise, more memory
3. Fixed Window — Simple, edge-case bursts at window boundaries
4. Leaky Bucket — Constant output rate, good for APIs

Where to implement:
- API Gateway (global rate limiting)
- Per-service (service-specific limits)
- Per-user/API-key (fairness)

Practice Problems with Solution Outlines

Problem 1: Design a URL Shortener

Requirements: 100M URLs/day, 1000:1 read/write, < 10ms redirect latency

Key decisions:

ID generation: Base62 encoding of auto-increment or snowflake ID. 7 chars = 3.5 trillion URLs.
Storage: Key-value store (Redis for hot URLs, DynamoDB for persistence)
Caching: Cache-aside with Redis. Most URLs follow Zipf distribution (top 20% get 80% traffic)
Read path: Cache → DB → 301/302 redirect
Analytics: Async via Kafka → Analytics service

Problem 2: Design a Notification System

Requirements: Multi-channel (push, email, SMS, in-app), 100M notifications/day, prioritization

Key decisions:

Architecture: Event-driven with priority queues
Queue design: Separate queues per channel, priority levels within each
Rate limiting: Per-user per-channel to prevent spam
Template engine: Pre-compiled templates with variable substitution
Delivery tracking: State machine (created → queued → sent → delivered → read)
Failure handling: Exponential backoff with max retries, DLQ for investigation

Problem 3: Design a Distributed Cache

Requirements: Sub-millisecond latency, 1TB data, fault-tolerant

Key decisions:

Partitioning: Consistent hashing with virtual nodes
Replication: Each partition replicated to 3 nodes
Consistency: Eventually consistent reads, quorum writes (W + R > N)
Eviction: LRU per node, with global TTL
Hot key handling: Local caching on client, key splitting

12-Week Study Plan

Week	Focus Area	Practice Problem
1-2	Scaling fundamentals, load balancing, caching	URL Shortener
3-4	Database design, SQL vs NoSQL, sharding	Instagram/Twitter
5-6	Message queues, async processing	Notification System
7-8	Real-time systems, WebSockets, pub/sub	Chat Application
9-10	Search systems, indexing, ranking	Search Engine
11-12	Distributed systems, consensus, replication	Distributed Cache

Daily Practice Routine

Morning (30 min): Review one building block concept in depth
Evening (60 min): Practice one design problem end-to-end
Weekend (2 hours): Mock interview with a peer or recording

Mistakes That Sink Interviews

Jumping to the solution — Always clarify requirements first. The interviewer is testing your process.
Not doing back-of-envelope math — "How many servers do we need?" You should be able to estimate.
Ignoring failure modes — "What happens when this component fails?" Always address this.
Over-engineering — Start simple, then add complexity as needed. Don't design for Google scale if the requirements say 10K users.
Not discussing trade-offs — There is no perfect design. Every choice has a cost. Articulate it.

Back-of-Envelope Calculations Cheat Sheet

Useful numbers:
- 1 day = ~100K seconds (86,400)
- 1 year = ~30M seconds
- QPS from daily users: DAU × avg_requests / 86400
- Storage: items × size × retention_period
- Bandwidth: QPS × avg_response_size

Example: Twitter timeline reads
- 200M DAU, each refreshes 10x/day
- QPS = 200M × 10 / 86400 ≈ 23K QPS
- Peak = 2-3× average ≈ 60K QPS

Summary

System design interviews test three things:

Can you break down ambiguous problems? (Requirements gathering)
Do you know the building blocks? (Technical knowledge)
Can you make and defend trade-offs? (Engineering judgment)

Master the framework, deeply understand 6-8 building blocks, and practice 10-15 problems. That's the formula.

Accelerate Your Interview Prep

Studying system design from scattered blog posts is inefficient. The System Design Cheat Sheets from Interview Prep Pro give you 50+ architecture diagrams covering real-world systems, with the exact patterns interviewers look for.

The full Interview Prep Pro collection includes 11 products: system design guides, behavioral question banks, coding patterns, resume templates, salary negotiation playbooks, and a 90-day study tracker.

Use code LAUNCH40 for 40% off, or STUDENT for 50% off (student email required).

Browse the Interview Prep Pro store

推荐订阅源

DEV Community

The Framework: How to Structure Your Answer

Step 1: Clarify Requirements (3-5 minutes)

Step 2: High-Level Design (5-10 minutes)

Step 3: Deep Dive (15-20 minutes)

Step 4: Trade-offs and Bottlenecks (5 minutes)

Core Building Blocks You Must Know

1. Load Balancing

2. Caching

3. Database Selection

4. Database Scaling Patterns

5. Message Queues

6. The CAP Theorem (Practical Version)

7. Rate Limiting

Practice Problems with Solution Outlines

Problem 1: Design a URL Shortener

Problem 2: Design a Notification System

Problem 3: Design a Distributed Cache

12-Week Study Plan

Daily Practice Routine

Mistakes That Sink Interviews

Back-of-Envelope Calculations Cheat Sheet

Summary

Accelerate Your Interview Prep