# The Complete Guide to System Design: From Fundamentals to Mastery
> *"A system that is not designed will not work. A system that is designed will barely work."* β Unknown Engineer
---
## π Table of Contents
1. [Introduction to System Design](#introduction)
2. [Scalability](#scalability)
3. [Reliability & Availability](#reliability)
4. [Networking Fundamentals](#networking)
5. [Load Balancing](#load-balancing)
6. [Caching](#caching)
7. [Databases](#databases)
8. [Distributed Systems](#distributed-systems)
9. [Microservices Architecture](#microservices)
10. [Message Queues & Event-Driven Architecture](#message-queues)
11. [API Design](#api-design)
12. [Storage Systems](#storage-systems)
13. [Security in System Design](#security)
14. [Monitoring & Observability](#monitoring)
15. [Case Studies](#case-studies)
---
## 1. Introduction to System Design {#introduction}
System design is the process of **defining the architecture, components, modules, interfaces, and data** for a system to satisfy specified requirements. It is one of the most critical phases in software engineering, bridging the gap between requirements and implementation.
Whether you are designing a **small web application** or a **planet-scale distributed system** serving billions of users, understanding the foundational principles of system design is absolutely essential.
### 1.1 Why System Design Matters
System design matters for multiple reasons:
- ποΈ **Scalability** β Can your system handle 10x, 100x, or 1000x more users?
- π **Reliability** β Does your system work correctly even when parts fail?
- β‘ **Performance** β Is your system fast enough for real-world usage?
- π° **Cost Efficiency** β Is your system economical to build and operate?
- π§ **Maintainability** β Can your team evolve and debug the system easily?
### 1.2 The Two Pillars: Functional vs Non-Functional Requirements
Before diving into design, every engineer must distinguish between two types of requirements:
#### Functional Requirements
These define ***what*** the system should do:
- Users can upload photos
- Users can follow other users
- The system should send email notifications
- The system must process payments
#### Non-Functional Requirements
These define ***how well*** the system does it:
| Requirement | Description | Example |
|---|---|---|
| **Latency** | Response time | < 200ms for API calls |
| **Throughput** | Requests per second | 100,000 RPS |
| **Availability** | Uptime percentage | 99.99% (52 min downtime/year) |
| **Consistency** | Data correctness | All users see same data |
| **Durability** | Data persistence | No data loss after failure |
| **Scalability** | Growth capacity | 10M to 1B users |
### 1.3 The STAR Framework for System Design Interviews
When approaching any system design problem, use the **STAR framework**:
```
S β Scope the problem (clarify requirements)
T β Think about capacity estimation
A β Architect the high-level design
R β Refine with deep dives
```
### 1.4 Back-of-the-Envelope Calculations
One of the most underrated skills in system design is **estimation**. Engineers must be comfortable doing rough math to validate design decisions.
#### Common Numbers Every Engineer Should Know
```
Latency Numbers (Approximate):
ββββββββββββββββββββββββββββββββ
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with 3,000 ns (3 ΞΌs)
Snappy
Send 1K bytes over 1 Gbps 10,000 ns (10 ΞΌs)
network
Read 4K randomly from SSD 150,000 ns (150 ΞΌs)
Read 1 MB sequentially 250,000 ns (250 ΞΌs)
from memory
Round trip within same DC 500,000 ns (0.5 ms)
Read 1 MB sequentially 1,000,000 ns (1 ms)
from SSD
Disk seek 10,000,000 ns (10 ms)
Read 1 MB sequentially 10,000,000 ns (10 ms)
from disk
Send packet CAβNetherlands 150,000,000 ns (150 ms)
βCA
```
#### Data Volume Calculations
> **Pro Tip:** When estimating storage, always think in terms of: *users Γ data per user Γ time period*
**Example:** Estimating storage for a Twitter-like service
- 500 million daily active users
- Each user sends ~2 tweets/day
- Each tweet = 300 bytes (text) + 200 KB (media, 30% of tweets)
- **Daily storage = 500M Γ 2 Γ 300B + 500M Γ 0.3 Γ 200KB**
- **= ~30 TB/day just for media**
---
## 2. Scalability {#scalability}
Scalability is the ability of a system to handle **increased load gracefully**. It is arguably the most discussed topic in system design, and for good reason β it directly impacts user experience and business viability.
### 2.1 Vertical Scaling (Scale Up)
**Vertical scaling** means adding more power to an existing machine β more CPU, more RAM, faster disks.
```
Before Scaling: After Vertical Scaling:
ββββββββββββββββ ββββββββββββββββββββββββ
β Server β β Upgraded Server β
β CPU: 4 core β βββΊ β CPU: 64 core β
β RAM: 16 GB β β RAM: 512 GB β
β SSD: 500 GB β β SSD: 10 TB β
ββββββββββββββββ ββββββββββββββββββββββββ
```
**Advantages:**
- β Simple β no application changes needed
- β No distributed system complexity
- β Stronger consistency (single machine)
**Disadvantages:**
- β Has a hard limit (you can't infinitely upgrade hardware)
- β Single point of failure
- β Expensive at high end
- β Downtime during upgrades
### 2.2 Horizontal Scaling (Scale Out)
**Horizontal scaling** means adding more machines to the pool of resources.
```
Before Scaling: After Horizontal Scaling:
ββββββββββββ
ββββββββββββ β Server 1 β
β Server 1 β βββΊ β Server 2 β β Load Balancer distributes
ββββββββββββ β Server 3 β
β Server N β
ββββββββββββ
```
**Advantages:**
- β Theoretically unlimited scaling
- β No single point of failure (high availability)
- β Cost-effective with commodity hardware
- β Zero-downtime upgrades (rolling deploys)
**Disadvantages:**
- β Application must be stateless (or use distributed state)
- β Increased complexity (distributed systems problems)
- β Network latency between nodes
### 2.3 The Scalability Bottlenecks
Scaling isn't just about servers. Every layer of your stack can become a bottleneck:
#### The Scalability Stack
```
βββββββββββββββββββββββββββββββββββββββ
β DNS Resolution β β Often ignored
βββββββββββββββββββββββββββββββββββββββ€
β Load Balancer β β Can become bottleneck
βββββββββββββββββββββββββββββββββββββββ€
β Web/API Servers β β Usually easiest to scale
βββββββββββββββββββββββββββββββββββββββ€
β Cache Layer β β Critical for performance
βββββββββββββββββββββββββββββββββββββββ€
β Application Servers β β Business logic
βββββββββββββββββββββββββββββββββββββββ€
β Databases β β Hardest to scale
βββββββββββββββββββββββββββββββββββββββ€
β Storage Systems β β File/blob storage
βββββββββββββββββββββββββββββββββββββββ
```
### 2.4 Stateless vs Stateful Architecture
One of the most important design decisions when scaling horizontally is managing **state**.
#### Stateless Architecture β (Preferred for scaling)
In a stateless system, **each request contains all the information needed to process it**. The server does not store any session information.
```
Request βββΊ Server A βββΊ Response
Request βββΊ Server B βββΊ Response (same result!)
Request βββΊ Server C βββΊ Response (same result!)
```
*State is stored externally in:*
- Databases
- Redis (shared session store)
- JWTs (client-side tokens)
#### Stateful Architecture β οΈ (Problematic for scaling)
In a stateful system, the server remembers previous requests. This makes load balancing difficult.
```
Request 1 βββΊ Server A (stores session)
Request 2 βββΊ Server B (no session!) βββΊ ERROR β
```
*Solutions:*
- **Sticky sessions** (route same user to same server β but limits scaling)
- **Centralized session store** (Redis, Memcached)
### 2.5 The CAP Theorem
One of the most fundamental theorems in distributed systems, stated by Eric Brewer in 2000:
> ***"In a distributed system, you can only guarantee two of the following three properties simultaneously: Consistency, Availability, and Partition Tolerance."***
```
C (Consistency)
/ \
/ \
/ \
/ CA \
/ RDBMS \
/ββββββββββΊ\
/ \
CP AP
MongoDB Cassandra
HBase CouchDB
Redis DynamoDB
\ /
\ /
\ /
A βββββββ P
(Availability) (Partition
Tolerance)
```
| System Type | Guarantee | Examples | Trade-off |
|---|---|---|---|
| **CP** | Consistency + Partition Tolerance | MongoDB, HBase, Zookeeper | May be unavailable during partition |
| **AP** | Availability + Partition Tolerance | Cassandra, DynamoDB, CouchDB | May return stale data |
| **CA** | Consistency + Availability | Traditional RDBMS (single node) | Cannot handle network partitions |
> **β οΈ Important:** In practice, *Partition Tolerance is non-negotiable* in distributed systems (networks fail). So the real choice is between **CP** and **AP**.
### 2.6 PACELC Theorem (Extension of CAP)
PACELC extends CAP by acknowledging that even **without** partitions, there's a trade-off between **latency** and **consistency**.
```
If (Partition):
Choose between: Availability β or β Consistency
Else (Normal operation):
Choose between: Latency β or β Consistency
```
---
## 3. Reliability & Availability {#reliability}
### 3.1 Defining Reliability
**Reliability** is the probability that a system will perform its required function **without failure** over a specified period under given conditions.
A reliable system is one that:
- Continues to work correctly even when things go wrong
- Handles hardware faults, software faults, and human errors
- Degrades gracefully rather than failing catastrophically
### 3.2 The Nines of Availability
Availability is typically expressed as a percentage of uptime:
| Availability | Downtime per Year | Downtime per Month | Downtime per Week |
|---|---|---|---|
| **90%** (1 nine) | 36.5 days | 72 hours | 16.8 hours |
| **99%** (2 nines) | 3.65 days | 7.2 hours | 1.68 hours |
| **99.9%** (3 nines) | 8.76 hours | 43.8 minutes | 10.1 minutes |
| **99.99%** (4 nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| **99.999%** (5 nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds |
| **99.9999%** (6 nines) | 31.5 seconds | 2.63 seconds | 0.60 seconds |
> **π‘ Industry Standard:** Most production systems aim for **99.9% to 99.99%**. Five nines (99.999%) is the gold standard for critical infrastructure like telephone networks.
### 3.3 Fault Tolerance Patterns
#### Circuit Breaker Pattern π
The Circuit Breaker prevents cascading failures by detecting when a service is down and *failing fast* instead of waiting for timeouts.
```
States:
βββββββββββββ failures exceed βββββββββββ
β CLOSED β ββββββββββββββΊ β OPEN β
β (normal) β β (fast β
βββββββββββββ β fail) β
β² βββββββββββ
β β
β success β timeout
β βΌ
β ββββββββββββββ
ββββββββββββββββββββββββββ HALF-OPEN β
β (test req) β
ββββββββββββββ
```
#### Retry with Exponential Backoff
```
Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds
Attempt 5: wait 16 seconds + give up
```
> **Always add jitter (randomness) to backoff to prevent thundering herd!**
#### Bulkhead Pattern π’
Inspired by ship design β isolate components so failures in one don't sink the whole ship.
```
Without Bulkhead: With Bulkhead:
ββββββββββββββββββ ββββββββ ββββββββ ββββββββ
β All services β β Svc Aβ β Svc Bβ β Svc Cβ
β share one β β Pool β β Pool β β Pool β
β thread pool β β 10 β β 10 β β 10 β
β β βthreads threads threadsβ
β Svc A fails β β ββββββββ ββββββββ ββββββββ
β ALL fail β β A fails β only A affected β
ββββββββββββββββββ
```
### 3.4 Redundancy Strategies
| Strategy | Description | Use Case |
|---|---|---|
| **Active-Active** | All nodes handle traffic simultaneously | Load distribution + HA |
| **Active-Passive** | One node handles traffic; backup on standby | Simpler failover |
| **N+1 Redundancy** | N required nodes + 1 spare | Cost-effective HA |
| **Geographic Redundancy** | Multiple data centers in different regions | Disaster recovery |
### 3.5 Mean Time Between Failures (MTBF) and Mean Time to Recovery (MTTR)
```
Availability = MTBF / (MTBF + MTTR)
MTBF = Mean Time Between Failures
MTTR = Mean Time to Recovery
Example:
MTBF = 720 hours (fails once a month)
MTTR = 1 hour (takes 1 hour to recover)
Availability = 720 / (720 + 1) = 99.86%
```
---
## 4. Networking Fundamentals {#networking}
### 4.1 The OSI Model
Understanding the OSI (Open Systems Interconnection) model is essential for any system designer:
```
ββββββββββββββββββββββββββββββββββββββββββββ
β Layer 7 β Application (HTTP, SMTP, FTP)β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 6 β Presentation (SSL/TLS, JPEG) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 5 β Session (NetBIOS, PPTP) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 4 β Transport (TCP, UDP) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 3 β Network (IP, ICMP) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2 β Data Link (Ethernet, MAC) β
ββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 1 β Physical (Cables, WiFi) β
ββββββββββββββββββββββββββββββββββββββββββββ
```
### 4.2 TCP vs UDP
| Feature | TCP | UDP |
|---|---|---|
| **Connection** | Connection-oriented (3-way handshake) | Connectionless |
| **Reliability** | Guaranteed delivery | No guarantee |
| **Ordering** | In-order delivery | No ordering |
| **Speed** | Slower (overhead) | Faster |
| **Use Cases** | HTTP, email, file transfer | DNS, video streaming, gaming |
| **Flow Control** | Yes | No |
| **Error Checking** | Yes (with correction) | Yes (detection only) |
### 4.3 DNS β The Internet's Phone Book
DNS (Domain Name System) translates **human-readable domain names** into **IP addresses**.
```
User types: www.example.com
β
βΌ
βββββββββββββββββ
β Browser Cacheβ βββ found? serve it β
βββββββββ¬ββββββββ
β not found
βΌ
βββββββββββββββββ
β OS DNS Cache β βββ found? serve it β
βββββββββ¬ββββββββ
β not found
βΌ
βββββββββββββββββ
β ISP Resolver β βββ found? serve it β
βββββββββ¬ββββββββ
β not found
βΌ
βββββββββββββββββ
β Root Nameserverβ β directs to .com TLD
βββββββββ¬ββββββββ
βΌ
βββββββββββββββββ
β TLD Nameserverβ β directs to example.com NS
βββββββββ¬ββββββββ
βΌ
βββββββββββββββββββββββββ
β Authoritative NS β β returns 93.184.216.34
βββββββββββββββββββββββββ
```
#### DNS Record Types
| Record | Purpose | Example |
|---|---|---|
| **A** | IPv4 address | example.com β 93.184.216.34 |
| **AAAA** | IPv6 address | example.com β 2606:2800::1 |
| **CNAME** | Canonical name (alias) | www β example.com |
| **MX** | Mail exchange | mail.example.com |
| **TXT** | Text record (SPF, DKIM) | v=spf1 include:... |
| **NS** | Nameserver | ns1.example.com |
| **SOA** | Start of Authority | Zone metadata |
### 4.4 HTTP/HTTPS and HTTP/2 vs HTTP/3
#### HTTP/1.1 Problems
- One request per connection (or pipelining, which is flawed)
- Head-of-line blocking
- Plain text headers (no compression)
#### HTTP/2 Improvements β
- **Multiplexing** β multiple requests over single connection
- **Header compression** (HPACK)
- **Server push** β server can proactively send resources
- **Binary protocol** β more efficient than text
#### HTTP/3 (QUIC) Improvements β
- Built on **UDP** instead of TCP
- **0-RTT connection establishment**
- **Better mobile performance** (handles IP changes)
- **Eliminates head-of-line blocking** at the transport layer
### 4.5 WebSockets vs Long Polling vs Server-Sent Events
When building **real-time features**, engineers must choose the right communication protocol:
```
Long Polling:
Client βββΊ Server (request)
Server waits...
Server βββΊ Client (response when data ready)
Client βββΊ Server (immediately new request)
[High latency, many connections]
Server-Sent Events (SSE):
Client βββΊ Server (one-time request)
Server βββΊ Client (stream of events)
[One-way, simple, HTTP-based]
WebSockets:
Client ββββΊ Server (persistent bidirectional connection)
[Low latency, bidirectional, complex]
```
| Feature | Long Polling | SSE | WebSocket |
|---|---|---|---|
| **Direction** | Bidirectional | Server β Client only | Bidirectional |
| **Protocol** | HTTP | HTTP | WS/WSS |
| **Latency** | Medium | Low | Very Low |
| **Complexity** | Low | Low | Medium |
| **Use Case** | Notifications | Live feeds | Chat, gaming |
---
## 5. Load Balancing {#load-balancing}
A **load balancer** distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed.
### 5.1 Load Balancing Algorithms
#### Round Robin
```
Request 1 βββΊ Server A
Request 2 βββΊ Server B
Request 3 βββΊ Server C
Request 4 βββΊ Server A (cycle repeats)
```
*Simple and equal distribution. Good when all servers have equal capacity.*
#### Weighted Round Robin
```
Server A (weight: 3) βββΊ gets 3 out of every 5 requests
Server B (weight: 2) βββΊ gets 2 out of every 5 requests
```
*Good when servers have different capacities.*
#### Least Connections
```
Server A: 100 connections
Server B: 45 connections β Next request goes here
Server C: 78 connections
```
*Best for long-lived connections (WebSockets, databases).*
#### IP Hash / Sticky Sessions
```
User IP 192.168.1.1 βββΊ always routed to Server A
User IP 10.0.0.5 βββΊ always routed to Server B
```
*Good when session state is stored on servers (though stateless is preferred).*
#### Least Response Time
```
Server A: avg 120ms
Server B: avg 45ms β Next request goes here
Server C: avg 89ms
```
*Optimal performance but requires active health monitoring.*
#### Random
*Simple, works well at large scale due to the law of large numbers.*
### 5.2 Layer 4 vs Layer 7 Load Balancing
#### Layer 4 (Transport Layer)
- Operates on **TCP/UDP** packets
- Routes based on **IP address + port**
- **Very fast** β no packet inspection
- Cannot make content-aware decisions
- *Example tools: HAProxy (TCP mode), AWS NLB*
#### Layer 7 (Application Layer)
- Operates on **HTTP/HTTPS** content
- Can route based on **URL path, headers, cookies, content type**
- More powerful but slightly higher overhead
- Enables **A/B testing, canary deployments, content-based routing**
- *Example tools: Nginx, HAProxy (HTTP mode), AWS ALB*
```
L7 Load Balancer routing:
/api/* βββΊ API Server Farm
/static/* βββΊ CDN / Static Server Farm
/ws/* βββΊ WebSocket Server Farm
/admin/* βββΊ Admin Server (with auth)
```
### 5.3 Health Checks
Load balancers continuously monitor server health:
```
Active Health Check:
Load Balancer βββΊ GET /health βββΊ Server
βββ 200 OK βββ
If 200: server stays in pool β
If timeout/5xx: server removed from pool β
```
### 5.4 DNS Load Balancing
DNS can also be used for basic load balancing by returning multiple A records:
```
example.com β [93.184.216.34, 198.51.100.1, 203.0.113.2]
Client selects one (usually first)
TTL controls how long it's cached
```
*Limitations: No health checking, TTL delays failover*
### 5.5 Global Server Load Balancing (GSLB)
For **multi-region** systems, GSLB routes users to the geographically closest healthy datacenter:
```
User in India βββΊ Mumbai DC
User in USA βββΊ Virginia DC
User in Europe βββΊ Frankfurt DC
User in Australia βββΊ Singapore DC
```
---
## 6. Caching {#caching}
> ***"There are only two hard things in Computer Science: cache invalidation and naming things."*** β Phil Karlton
Caching is storing copies of data in a faster storage layer to speed up future requests. It is one of the **most impactful optimizations** in system design.
### 6.1 Cache Hierarchy
```
Registers (< 1 ns)
β
L1 Cache (~0.5 ns, 32-64 KB)
β
L2 Cache (~7 ns, 256 KB - 1 MB)
β
L3 Cache (~30 ns, 4-32 MB)
β
RAM (~100 ns, GBs)
β
SSD (~150 ΞΌs, TBs)
β
HDD (~10 ms, TBs)
β
Network/Remote Cache (~0.5 ms)
β
Database (~10-100 ms)
```
### 6.2 Caching Strategies
#### Cache-Aside (Lazy Loading) β Most Common
```
Read flow:
App βββΊ Cache βββΊ Miss? βββΊ Database βββΊ Store in Cache βββΊ Return to App
Write flow:
App βββΊ Write to Database βββΊ Invalidate Cache
```
**Pros:** Only caches what's actually needed
**Cons:** Cache miss causes 3 network trips; possible stale data
#### Write-Through
```
Write: App βββΊ Cache βββΊ Database (synchronously)
Read: App βββΊ Cache (always fresh)
```
**Pros:** Cache always up-to-date
**Cons:** Write latency higher; may cache data that's never read
#### Write-Behind (Write-Back)
```
Write: App βββΊ Cache (immediately returns)
Cache βββΊ Database (asynchronously later)
```
**Pros:** Very fast writes
**Cons:** Risk of data loss if cache crashes before writing to DB
#### Read-Through
```
Read: App βββΊ Cache βββΊ Cache fetches from DB on miss (not the app)
```
**Pros:** Simplified application code
**Cons:** Cache miss still slow; cold start problem
### 6.3 Cache Eviction Policies
When the cache is full, **which data gets removed?**
| Policy | Full Name | How It Works | Best For |
|---|---|---|---|
| **LRU** | Least Recently Used | Remove least recently accessed item | General purpose |
| **LFU** | Least Frequently Used | Remove least frequently accessed | Long-lived caches |
| **MRU** | Most Recently Used | Remove most recently accessed | Specific access patterns |
| **FIFO** | First In First Out | Remove oldest item | Simple queues |
| **Random** | Random Replacement | Remove random item | Simple, low overhead |
| **TTL** | Time To Live | Remove expired items | All caches (combined) |
### 6.4 Cache Invalidation
This is the hard part. Three main approaches:
1. **TTL (Time To Live)** β Cache expires after N seconds automatically
```
SET key value EX 3600 # expires in 1 hour
```
2. **Event-based invalidation** β Invalidate when data changes
```
User updates profile β DELETE cache:user:123
```
3. **Cache versioning** β Include version in cache key
```
cache:user:123:v5 # bump version on update
```
### 6.5 Cache Problems and Solutions
#### Cache Stampede (Thundering Herd) π
**Problem:** Cache expires β thousands of requests hit database simultaneously
**Solutions:**
- **Mutex/Lock** β Only one request refreshes cache; others wait
- **Probabilistic Early Expiration** β Randomly refresh before expiry
- **Stale-While-Revalidate** β Serve stale data while refreshing in background
#### Cache Penetration π³οΈ
**Problem:** Requests for **non-existent data** always bypass cache and hit database
```
Malicious user queries: user_id=-1, user_id=-2, user_id=-3...
Each misses cache and hammers database
```
**Solutions:**
- **Cache null results** β Store "NOT FOUND" in cache with short TTL
- **Bloom Filter** β Probabilistic data structure to check existence without DB hit
#### Cache Avalanche βοΈ
**Problem:** Many cache keys expire at the **same time** β massive DB load
**Solutions:**
- Add **random jitter** to TTL values: `TTL = base_ttl + random(0, base_ttl * 0.1)`
- Use **different TTLs** for different data types
- **Warm up cache** gradually before traffic shift
### 6.6 Distributed Caching Systems
#### Redis
```
Features:
- In-memory key-value store
- Rich data structures: String, Hash, List, Set, Sorted Set, Stream
- Persistence: RDB snapshots + AOF logs
- Pub/Sub messaging
- Cluster mode (horizontal scaling)
- Sentinel (high availability)
Use Cases:
- Session management
- Rate limiting
- Leaderboards (Sorted Sets)
- Pub/Sub
- Distributed locks
```
#### Memcached
```
Features:
- Simple key-value store
- Multi-threaded (better raw throughput than Redis)
- No persistence
- No replication (simpler)
Use Cases:
- Simple object caching
- When you need maximum throughput
- When data loss is acceptable
```
### 6.7 CDN (Content Delivery Network)
A CDN is a **geographically distributed** cache for static content:
```
Without CDN:
User in Tokyo βββββββββββββββββββΊ Origin Server in Virginia
150ms RTT
With CDN:
User in Tokyo βββΊ CDN Edge in Tokyo βββΊ (cache hit!)
5ms RTT
(cache miss) βββΊ Origin Server in Virginia
(refills CDN edge)
```
**What CDNs cache:**
- Static assets (images, CSS, JS)
- Videos (HLS/DASH chunks)
- API responses (with cache headers)
- HTML pages (for static sites)
**Popular CDNs:** Cloudflare, AWS CloudFront, Fastly, Akamai
---
## 7. Databases {#databases}
Databases are the **backbone of most systems**. Choosing the right database is one of the most consequential system design decisions.
### 7.1 Relational Databases (SQL)
Relational databases store data in **structured tables** with **predefined schemas** and use **SQL** for querying.
```
Users Table:
ββββββ¬βββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββββββ
β id β username β email β created_at β
ββββββΌβββββββββββΌβββββββββββββββββββββΌβββββββββββββββββββββββββ€
β 1 β alice β alice@example.com β 2024-01-15 10:30:00 β
β 2 β bob β bob@example.com β 2024-01-16 14:22:00 β
β 3 β charlie β charlie@example.comβ 2024-01-17 09:15:00 β
ββββββ΄βββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββββββββ
```
#### ACID Properties
The gold standard for database transactions:
| Property | Definition | Example |
|---|---|---|
| **A**tomicity | All-or-nothing transactions | Bank transfer: debit + credit both succeed or both fail |
| **C**onsistency | Data always in valid state | Cannot have negative balance |
| **I**solation | Concurrent transactions don't interfere | Two simultaneous transfers don't corrupt data |
| **D**urability | Committed data survives failures | After power outage, committed transaction persists |
#### Transaction Isolation Levels
```
READ UNCOMMITTED β weakest (dirty reads possible)
READ COMMITTED
REPEATABLE READ
SERIALIZABLE β strongest (most consistent but slowest)
```
**Common concurrency problems:**
- **Dirty Read** β Reading uncommitted data from another transaction
- **Non-Repeatable Read** β Same query returns different results in same transaction
- **Phantom Read** β New rows appear between reads in same transaction
### 7.2 NoSQL Databases
NoSQL databases trade some ACID guarantees for **flexibility, scalability, and performance**.
#### Key-Value Stores
```
Key: "user:123:session"
Value: {"token": "abc123", "expires": "2024-12-31"}
Examples: Redis, DynamoDB, Memcached
Best for: Sessions, caching, leaderboards
```
#### Document Databases
```json
{
"_id": "user_123",
"name": "Alice Johnson",
"email": "alice@example.com",
"addresses": [
{"type": "home", "city": "New York"},
{"type": "work", "city": "San Francisco"}
],
"preferences": {
"theme": "dark",
"notifications": true
}
}
```
*Examples: MongoDB, CouchDB, Firestore*
*Best for: User profiles, product catalogs, content management*
#### Column-Family Stores (Wide-Column)
```
Row Key: "user_123"
Columns:
profile: {name: "Alice", email: "alice@example.com"}
activity: {last_login: "2024-01-15", login_count: "142"}
settings: {theme: "dark", lang: "en"}
Examples: Apache Cassandra, HBase, ScyllaDB
Best for: Time-series data, IoT, write-heavy workloads
```
#### Graph Databases
```
(Alice) ββ[FOLLOWS]βββΊ (Bob)
(Alice) ββ[LIKES]βββββΊ (Post:123)
(Bob) ββ[CREATED]βββΊ (Post:123)
(Bob) ββ[FOLLOWS]βββΊ (Charlie)
Examples: Neo4j, Amazon Neptune, ArangoDB
Best for: Social networks, fraud detection, recommendation engines
```
#### Time-Series Databases
```
timestamp | sensor_id | temperature | humidity
--------------------|-----------|-------------|----------
2024-01-15 10:00:00 | sensor_1 | 23.5 | 65.2
2024-01-15 10:00:01 | sensor_1 | 23.6 | 65.1
2024-01-15 10:00:02 | sensor_1 | 23.4 | 65.3
Examples: InfluxDB, TimescaleDB, Prometheus
Best for: Metrics, monitoring, IoT, financial data
```
### 7.3 Database Indexing
An index is a **data structure** that improves the speed of data retrieval operations.
```
Without Index:
SELECT * FROM users WHERE email = 'alice@example.com'
β Full table scan: check ALL 10 million rows = O(n) = SLOW β
With Index on email:
β B-tree lookup: O(log n) = FAST β
```
#### Types of Indexes
**B-Tree Index** β Most common, good for range queries and equality
```sql
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_date ON orders(created_at);
```
**Hash Index** β Only for equality queries, very fast
```
hash("alice@example.com") β bucket 4829 β row pointer
```
**Composite Index** β Multiple columns
```sql
CREATE INDEX idx_orders ON orders(user_id, status, created_at);
-- Efficient for: WHERE user_id = 1 AND status = 'pending'
-- Uses index prefix rule
```
**Covering Index** β Index contains all columns needed by query
```sql
-- Query:
SELECT id, email FROM users WHERE username = 'alice';
-- Covering index:
CREATE INDEX idx_covering ON users(username, id, email);
-- Never touches actual table rows!
```
**Full-Text Index** β For text search
```sql
CREATE FULLTEXT INDEX idx_content ON posts(title, body);
SELECT * FROM posts WHERE MATCH(title, body) AGAINST('system design');
```
#### The Index Trade-off
> βοΈ **More indexes = faster reads, slower writes, more storage**
Every INSERT/UPDATE/DELETE must update all indexes on the table.
### 7.4 Database Replication
Replication copies data to **multiple servers** for availability and read scaling.
#### Primary-Replica (Master-Slave) Replication
```
Writes
β
βΌ
ββββββββββββββ
β Primary ββββ¬βββΊ Replica 1 (async)
β (Master) β ββββΊ Replica 2 (async)
ββββββββββββββ ββββΊ Replica 3 (async)
Reads βββΊ Any Replica
Writes βββΊ Primary Only
```
**Synchronous replication:** Primary waits for replica to confirm write
*Pros:* No data loss | *Cons:* Higher write latency
**Asynchronous replication:** Primary doesn't wait
*Pros:* Lower write latency | *Cons:* Potential data loss (replication lag)
#### Multi-Primary (Multi-Master) Replication
```
ββββββββββββ sync/async ββββββββββββ
β Primary Aβ ββββββββββββββΊ β Primary Bβ
ββββββββββββ ββββββββββββ
Writes + Reads Writes + Reads
```
*Allows writes to multiple nodes β more complex conflict resolution needed.*
### 7.5 Database Sharding (Horizontal Partitioning)
Sharding splits data across multiple database instances, each holding a **subset of the data**.
```
Without Sharding: With Sharding:
ββββββββββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
β Database β β Shard 1 β β Shard 2 β β Shard 3 β
β users: 1-100M β βββΊ β users: β β users: β β users: β
β orders: 1-500M β β 1-33M β β 34-66M β β 67-100M β
ββββββββββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ
```
#### Sharding Strategies
**Range-based sharding:**
```
user_id 1-1,000,000 β Shard 1
user_id 1,000,001-2M β Shard 2
user_id 2,000,001-3M β Shard 3
```
*Problem: Hot spots if users are not uniformly distributed*
**Hash-based sharding:**
```
shard = hash(user_id) % num_shards
hash("user_123") % 3 = Shard 2
hash("user_456") % 3 = Shard 0
```
*Problem: Resharding when adding shards (consistent hashing solves this)*
**Consistent Hashing:**
```
Hash Ring (0 to 2^32):
0
β
βββΌββ Node A (at position 100)
β
βββΌββ Node B (at position 200)
β
βββΌββ Node C (at position 300)
β
360Β°
Key maps to nearest node clockwise.
Adding node only rehashes ~1/N keys!
```
#### Sharding Challenges
- β **Cross-shard queries** β JOINs across shards are expensive
- β **Non-uniform distribution** β Hot shards
- β **Rebalancing** β Moving data when adding/removing shards
- β **Global transactions** β ACID across shards is very complex
### 7.6 SQL vs NoSQL β When to Use What
| Criteria | SQL | NoSQL |
|---|---|---|
| **Data Structure** | Well-defined schema | Flexible / dynamic schema |
| **Relationships** | Complex joins needed | Denormalized / embedded |
| **Scaling** | Vertical (mostly) | Horizontal (designed for) |
| **Consistency** | Strong (ACID) | Eventual (usually) |
| **Query Complexity** | Complex SQL queries | Simple key-based lookups |
| **Write Pattern** | Moderate writes | High-volume writes |
| **Use Cases** | Finance, ERP, CRM | Social media, IoT, catalog |
### 7.7 NewSQL Databases
NewSQL databases aim to provide the **scalability of NoSQL with the ACID guarantees of SQL**:
- **Google Spanner** β Globally distributed SQL with TrueTime
- **CockroachDB** β PostgreSQL-compatible distributed database
- **TiDB** β MySQL-compatible distributed database
- **YugabyteDB** β PostgreSQL-compatible distributed database
---
## 8. Distributed Systems {#distributed-systems}
### 8.1 The Fallacies of Distributed Computing
Peter Deutsch and James Gosling identified **8 fallacies** that developers often assume (incorrectly) about distributed systems:
1. π« ~~The network is reliable~~
2. π« ~~Latency is zero~~
3. π« ~~Bandwidth is infinite~~
4. π« ~~The network is secure~~
5. π« ~~Topology doesn't change~~
6. π« ~~There is one administrator~~
7. π« ~~Transport cost is zero~~
8. π« ~~The network is homogeneous~~
> **Every distributed system design must account for these realities.**
### 8.2 Consistency Models
#### Strong Consistency
Every read returns the **most recent write**. After a write completes, all subsequent reads from any node will return that value.
*Trade-off: Higher latency, lower availability*
#### Eventual Consistency
Given enough time with no new updates, all replicas will **converge to the same value**.
```
t=0: Write "Alice" to Node A
t=1: Read from Node B β "Bob" (stale) β temporarily inconsistent
t=5: Node B syncs from Node A
t=6: Read from Node B β "Alice" β (converged)
```
#### Read-Your-Writes Consistency
A user **always sees their own writes**, even if other users might see stale data.
*Implementation: Route reads from same user to same replica (or use primary)*
#### Monotonic Read Consistency
A user never reads **older data** after reading newer data (no time travel backwards).
#### Causal Consistency
Operations that are **causally related** are seen in order:
```
Alice posts message β Bob replies to message
All users who see Bob's reply must also see Alice's original message
```
### 8.3 Distributed Transactions
#### Two-Phase Commit (2PC)
```
Phase 1: Prepare
Coordinator βββΊ "Prepare to commit?" βββΊ Participant A
Coordinator βββΊ "Prepare to commit?" βββΊ Participant B
βββ "Ready" βββββββββββββ
βββ "Ready" βββββββββββββ
Phase 2: Commit
Coordinator βββΊ "Commit!" βββΊ Participant A
Coordinator βββΊ "Commit!" βββΊ Participant B
If ANY participant says "Abort" in Phase 1:
Coordinator βββΊ "Rollback!" βββΊ All Participants
```
**Problems with 2PC:**
- Blocking protocol β if coordinator fails between phases, participants are stuck
- Poor performance (2 round trips)
#### SAGA Pattern
Saga breaks a **distributed transaction into a sequence of local transactions**, each with a **compensating transaction** for rollback:
```
Create Order β Reserve Inventory β Process Payment β Ship Order
β fails β fails β fails
Cancel Order β Unreserve Inv β Refund Payment
```
*Two implementations:*
- **Choreography** β Each service publishes events; next service listens
- **Orchestration** β A central orchestrator tells each service what to do
### 8.4 Distributed Consensus
How do distributed nodes **agree on a value** when nodes can fail?
#### Paxos (The Classic, Hard to Understand)
Paxos is a family of protocols for solving consensus in a network of unreliable processors.
Key roles: **Proposer**, **Acceptor**, **Learner**
#### Raft (The Understandable Alternative)
Raft decomposes consensus into **leader election** + **log replication**:
```
Normal operation:
Leader βββΊ AppendEntries RPC βββΊ Follower 1
βββΊ AppendEntries RPC βββΊ Follower 2
βββΊ AppendEntries RPC βββΊ Follower 3
Majority acknowledges β entry is committed
Leader fails:
Follower 1: hasn't heard from leader (election timeout)
Follower 1: becomes Candidate, sends RequestVote to others
Majority votes for Candidate β New Leader elected β
```
*Used by: etcd, CockroachDB, TiKV, Consul*
#### ZooKeeper (ZAB Protocol)
ZooKeeper uses **ZAB (ZooKeeper Atomic Broadcast)** for coordination.
*Used for: Service discovery, distributed locking, configuration management*
### 8.5 Vector Clocks and Conflict Resolution
In distributed systems with no central clock, we need **logical clocks** to track causality:
```
Vector Clock: [A:0, B:0, C:0]
A sends event: [A:1, B:0, C:0]
B receives, sends: [A:1, B:1, C:0]
C receives, sends: [A:1, B:1, C:1]
If A and B have concurrent events:
A: [A:2, B:1, C:1]
B: [A:1, B:2, C:1]
Neither dominates β conflict! Must be resolved by application.
```
*Conflict resolution strategies:*
- **Last-write-wins** (LWW) β Use timestamps
- **Application-level merge** β e.g., CRDT (Conflict-Free Replicated Data Types)
- **User-level resolution** β Present both versions to user (Git-style)
---
## 9. Microservices Architecture {#microservices}
### 9.1 Monolith vs Microservices
#### Monolithic Architecture
```
ββββββββββββββββββββββββββββββββββββββββββ
β Monolith Application β
β ββββββββββββ ββββββββββββ β
β β User β β Payment β β
β β Service β β Service β β
β ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ β
β β Product β β Order β β
β β Service β β Service β β
β ββββββββββββ ββββββββββββ β
β Single Database β
ββββββββββββββββββββββββββββββββββββββββββ
```
**Pros:**
- β Simple development, testing, deployment
- β Low latency (in-process calls)
- β Easy transactions (single DB)
**Cons:**
- β Scaling must scale everything
- β Long build and deploy times
- β Tech stack lock-in
- β Single point of failure
#### Microservices Architecture
```
API Gateway
β
ββββββββββΌβββββββββ
β β β
βΌ βΌ βΌ
βββββββββ βββββββββ βββββββββ
β User β βProductβ β Order β
βServiceβ βServiceβ βServiceβ
β DB β β DB β β DB β
βββββββββ βββββββββ βββββββββ
β β
βββββ Message Bus βββ
```
**Pros:**
- β Independent scaling per service
- β Independent deployment
- β Technology diversity
- β Fault isolation
**Cons:**
- β Network latency between services
- β Distributed system complexity
- β Data consistency challenges
- β Operational overhead
### 9.2 Service Communication Patterns
#### Synchronous Communication
```
Service A βββΊ HTTP/gRPC βββΊ Service B βββΊ Response βββΊ Service A
```
*Simple but creates coupling and cascading failures*
#### Asynchronous Communication
```
Service A βββΊ Message Queue βββΊ Service B (processes independently)
```
*Decoupled but adds complexity and eventual consistency*
### 9.3 API Gateway
An API Gateway is the **single entry point** for all clients:
```
Mobile App βββ
Web App βββ€
3rd Party βββΌβββΊ API Gateway βββΊ Authentication
Partners βββ€ Rate Limiting
IoT Devices βββ Routing
Load Balancing
SSL Termination
Request Transformation
Monitoring/Analytics
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
User Service Order Service Payment Service
```
*Examples: Kong, AWS API Gateway, Netflix Zuul, Nginx*
### 9.4 Service Discovery
With many microservices, how do services **find each other**?
#### Client-Side Discovery
```
Service A βββΊ Service Registry (Consul/Eureka) βββΊ "User Service is at 10.0.0.5:8080"
Service A βββΊ 10.0.0.5:8080 (direct call)
```
#### Server-Side Discovery
```
Service A βββΊ Load Balancer βββΊ Service Registry βββΊ Routes to healthy instance
```
### 9.5 The Strangler Fig Pattern
A proven strategy to **migrate from monolith to microservices** incrementally:
```
Phase 1: All traffic to Monolith
Phase 2: Extract User Service β Route /users to microservice
Phase 3: Extract Payment Service β Route /payments to microservice
Phase 4: Extract Order Service β Route /orders to microservice
Phase N: Monolith is "strangled" β all functionality migrated
```
### 9.6 Twelve-Factor App Methodology
The **12-Factor App** is a methodology for building cloud-native, scalable microservices:
1. **Codebase** β One codebase, many deploys
2. **Dependencies** β Explicitly declare dependencies
3. **Config** β Store config in environment variables
4. **Backing Services** β Treat as attached resources
5. **Build, Release, Run** β Strictly separate stages
6. **Processes** β Execute as stateless processes
7. **Port Binding** β Export services via port binding
8. **Concurrency** β Scale out via process model
9. **Disposability** β Fast startup and graceful shutdown
10. **Dev/Prod Parity** β Keep environments similar
11. **Logs** β Treat as event streams
12. **Admin Processes** β Run as one-off processes
---
## 10. Message Queues & Event-Driven Architecture {#message-queues}
### 10.1 Why Message Queues?
Message queues provide **asynchronous communication**, **decoupling**, and **buffering** between services.
```
Without Queue:
Order Service βββΊ Payment Service
If Payment Service is down β Order Service fails β
With Queue:
Order Service βββΊ Queue βββΊ Payment Service
If Payment Service is down β messages queue up β
When it comes back up β processes all queued messages β
```
**Benefits:**
- **Decoupling** β Producer and consumer are independent
- **Load leveling** β Handle traffic spikes without overloading downstream
- **Reliability** β Messages persist even if consumer is down
- **Parallel processing** β Multiple consumers process simultaneously
- **Retry logic** β Failed messages can be retried
### 10.2 Message Queue Models
#### Point-to-Point (Queue)
```
Producer βββΊ [ Queue ] βββΊ Consumer A
(each message consumed by exactly ONE consumer)
```
*Use case: Task distribution, order processing*
#### Publish-Subscribe (Topic)
```
Publisher βββΊ [ Topic ] βββΊ Subscriber A
ββββΊ Subscriber B
ββββΊ Subscriber C
(each message delivered to ALL subscribers)
```
*Use case: Event notifications, logging, analytics*
### 10.3 Apache Kafka
Kafka is a **distributed event streaming platform** designed for high-throughput:
```
Architecture:
Producers βββΊ Topics (partitioned) βββΊ Consumer Groups
Topics: log_events, user_signups, order_placed
Partitions: distribute across brokers for parallelism
Consumer Groups: multiple consumers share the load
Partition 0 βββΊ Consumer 1
Partition 1 βββΊ Consumer 2
Partition 2 βββΊ Consumer 3
```
**Key Kafka Concepts:**
| Concept | Description |
|---|---|
| **Topic** | A category/feed of messages |
| **Partition** | Ordered, immutable sequence of records |
| **Offset** | Position of a message in a partition |
| **Consumer Group** | Group of consumers sharing partitions |
| **Broker** | A Kafka server |
| **Retention** | How long messages are kept (default: 7 days) |
**Kafka vs Traditional Message Queues:**
| Feature | Kafka | RabbitMQ/SQS |
|---|---|---|
| **Message retention** | Persists (days/forever) | Deleted after consumption |
| **Throughput** | Millions/sec | Thousands/sec |
| **Ordering** | Per-partition | Per-queue (FIFO) |
| **Replay** | Yes β re-read old messages | No |
| **Use Case** | Event streaming, log aggregation | Task queues, RPC |
### 10.4 Event-Driven Architecture Patterns
#### Event Sourcing
Instead of storing **current state**, store **all events** that led to that state:
```
Traditional: Store current balance
Account: { id: 123, balance: $450 }
Event Sourcing: Store all events
Event 1: AccountOpened { amount: $1000 }
Event 2: Withdrawal { amount: $200 }
Event 3: Deposit { amount: $150 }
Event 4: Withdrawal { amount: $500 }
βββββββββββββββββββββββββββββββββββββ
Replay all events β Balance = $450
```
**Benefits:**
- Complete audit trail
- Time travel (reconstruct past states)
- Natural fit for event-driven systems
**Drawbacks:**
- Querying current state requires replaying events
- Eventual consistency
- Schema evolution complexity
#### CQRS (Command Query Responsibility Segregation)
Separate the **write model (Commands)** from the **read model (Queries)**:
```
User Action
β
ββββΊ Command Model (Write) βββΊ Event Store βββΊ Projects to
β - Handles writes Read Model DB
β - Validates business rules
β
ββββΊ Query Model (Read) βββΊ Optimized Read DB
- Handles reads
- Denormalized for performance
```
---
## 11. API Design {#api-design}
### 11.1 REST API Design Principles
REST (Representational State Transfer) is the most widely used API design paradigm.
#### REST Constraints
1. **Client-Server** β Separation of concerns
2. **Stateless** β Each request contains all necessary information
3. **Cacheable** β Responses declare cacheability
4. **Uniform Interface** β Consistent resource identification
5. **Layered System** β Client doesn't know if it's talking to final server
6. **Code on Demand** (optional) β Server can send executable code
#### RESTful Resource Design
```
β Good REST API Design:
GET /users β List all users
POST /users β Create a user
GET /users/{id} β Get specific user
PUT /users/{id} β Replace specific user
PATCH /users/{id} β Partially update user
DELETE /users/{id} β Delete specific user
GET /users/{id}/orders β List user's orders
POST /users/{id}/orders β Create order for user
β Bad REST API Design (RPC-style):
GET /getUser
POST /createUser
POST /updateUser
GET /deleteUser β Using GET for side effects!
POST /getUserOrders
```
#### HTTP Status Codes
```
2xx β Success
200 OK
201 Created
204 No Content
3xx β Redirection
301 Moved Permanently
302 Found (temporary redirect)
304 Not Modified (cached response still valid)
4xx β Client Errors
400 Bad Request
401 Unauthorized (not authenticated)
403 Forbidden (authenticated but not authorized)
404 Not Found
405 Method Not Allowed
409 Conflict
422 Unprocessable Entity
429 Too Many Requests
5xx β Server Errors
500 Internal Server Error
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
```
### 11.2 GraphQL
GraphQL is a **query language for APIs** that lets clients request exactly the data they need.
```graphql
# REST Problem: Over-fetching and Under-fetching
GET /users/123 β returns ALL user fields (over-fetch)
GET /users/123, GET /users/123/posts, GET /users/123/followers β multiple requests (under-fetch)
# GraphQL Solution:
query {
user(id: "123") {
name
posts(last: 3) {
title
createdAt
}
followers {
count
}
}
}
# Returns EXACTLY what was requested in ONE request β
```
**GraphQL vs REST:**
| Feature | REST | GraphQL |
|---|---|---|
| **Over-fetching** | Common | Eliminated |
| **Under-fetching** | Common (N+1) | Eliminated |
| **Versioning** | URL versioning needed | Schema evolution |
| **Caching** | HTTP caching built-in | Complex (no URL per query) |
| **Learning curve** | Low | Medium |
| **Type Safety** | Depends on tooling | Built-in |
### 11.3 gRPC
gRPC is a **high-performance RPC framework** using **Protocol Buffers**:
```protobuf
// Define service in .proto file
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc ListUsers(ListUsersRequest) returns (stream User);
rpc CreateUser(CreateUserRequest) returns (User);
rpc WatchUser(WatchUserRequest) returns (stream UserEvent);
}
message User {
int64 id = 1;
string name = 2;
string email = 3;
}
```
**gRPC vs REST:**
| Feature | REST | gRPC |
|---|---|---|
| **Protocol** | HTTP/1.1 (usually) | HTTP/2 |
| **Payload** | JSON (verbose) | Protobuf (binary, compact) |
| **Streaming** | Limited | Native (bidirectional) |
| **Contract** | OpenAPI/Swagger | .proto files |
| **Performance** | Baseline | ~5-10x faster |
| **Browser support** | Native | Requires proxy |
| **Use Case** | Public APIs | Internal microservices |
### 11.4 API Rate Limiting
Rate limiting protects APIs from abuse and ensures fair usage.
#### Rate Limiting Algorithms
**Token Bucket:**
```
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Request arrives: consume 1 token
No tokens left: request rejected (429)
Allows burst up to bucket capacity β
```
**Leaky Bucket:**
```
Requests enter bucket (queue)
Bucket leaks at constant rate (e.g., 10 req/sec)
Bucket overflows: reject request
Smooths bursty traffic β
```
**Fixed Window Counter:**
```
Window: [10:00:00 - 10:01:00]
Limit: 100 requests/minute
Counter: 0
Each request: increment counter
Counter > 100: reject (429)
At 10:01:00: reset counter to 0
Problem: Burst at window boundary (200 req in 2 seconds)
```
**Sliding Window Log:**
```
Store timestamp of each request in log
For each new request:
Remove timestamps older than window
If len(log) >= limit: reject
Else: add timestamp, allow
Most accurate but memory intensive
```
### 11.5 API Versioning Strategies
```
1. URL Path Versioning (most common):
https://api.example.com/v1/users
https://api.example.com/v2/users
2. Query Parameter:
https://api.example.com/users?version=2
3. HTTP Header:
Accept: application/vnd.example.v2+json
4. Content Negotiation:
Accept: application/json; version=2.0
```
---
## 12. Storage Systems {#storage-systems}
### 12.1 Block Storage
Block storage divides data into **fixed-size blocks**, each with a unique address. The OS treats it like a physical disk.
```
Block Storage:
ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ
β B1 β B2 β B3 β B4 β B5 β B6 β B7 β B8 β
ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ
File system manages which blocks belong to which files
```
*Examples: AWS EBS, Google Persistent Disk, Azure Managed Disk*
*Use Cases: Databases, virtual machines, boot volumes*
### 12.2 File Storage (Network File System)
File storage presents a **file system interface** over a network:
```
Server βββΊ NFS/SMB βββΊ Client sees: /mnt/shared/
βββ file1.txt
βββ folder1/
βββ file2.pdf
```
*Examples: AWS EFS, Azure Files, NFS*
*Use Cases: Shared files across servers, home directories, media files*
### 12.3 Object Storage
Object storage stores data as **objects** with metadata and a unique ID. No hierarchy β flat namespace.
```
Object: {
key: "users/profile-photos/user_123.jpg"
data: <binary image data>
metadata: {
content-type: "image/jpeg",
size: 245678,
uploaded-by: "user_123",
custom: { "crop": "center" }
}
}
```
*Examples: AWS S3, Google Cloud Storage, Azure Blob Storage*
*Use Cases: Images, videos, backups, static website hosting, data lakes*
**Key properties:**
- Virtually unlimited scalability
- No hierarchical file system
- Accessed via HTTP (REST API)
- Highly durable (AWS S3: 99.999999999% β 11 nines!)
### 12.4 RAID (Redundant Array of Independent Disks)
| Level | Description | Min Disks | Fault Tolerance | Performance |
|---|---|---|---|---|
| **RAID 0** | Striping (no redundancy) | 2 | None | Best read/write |
| **RAID 1** | Mirroring | 2 | 1 disk | Good read |
| **RAID 5** | Striping + distributed parity | 3 | 1 disk | Good read |
| **RAID 6** | Striping + double parity | 4 | 2 disks | Good read |
| **RAID 10** | Striping + mirroring | 4 | 1 per mirror | Very good |
### 12.5 Data Replication and Disaster Recovery
#### Recovery Point Objective (RPO)
*How much data can we afford to lose?*
```
RPO = 1 hour β backup every hour β lose at most 1 hour of data
RPO = 0 β synchronous replication β no data loss acceptable
```
#### Recovery Time Objective (RTO)
*How long can we afford to be down?*
```
RTO = 4 hours β system can be offline for 4 hours during disaster
RTO = 0 β requires hot standby, automatic failover
```
#### Disaster Recovery Tiers
```
Tier 0: No recovery (backups only) RTO: Days Cost: $
Tier 1: Cold standby RTO: Hours Cost: $$
Tier 2: Warm standby RTO: Minutes Cost: $$$
Tier 3: Hot standby (active-passive) RTO: Seconds Cost: $$$$
Tier 4: Active-active RTO: ~0 Cost: $$$$$
```
---
## 13. Security in System Design {#security}
### 13.1 Authentication vs Authorization
| Concept | Question | Example | Technologies |
|---|---|---|---|
| **Authentication** | Who are you? | Username + password | JWT, OAuth, SSO |
| **Authorization** | What can you do? | Can you access this resource? | RBAC, ABAC, ACL |
### 13.2 Authentication Mechanisms
#### JWT (JSON Web Token)
```
Header.Payload.Signature
Header: {"alg": "HS256", "typ": "JWT"}
Payload: {
"sub": "user_123",
"email": "alice@example.com",
"role": "admin",
"iat": 1704067200,
"exp": 1704153600
}
Signature: HMACSHA256(base64(header) + "." + base64(payload), secret)
```
**Pros:** Stateless, self-contained, works across services
**Cons:** Cannot be revoked (until expiry), larger than session tokens
#### OAuth 2.0 / OpenID Connect
```
OAuth 2.0 Authorization Code Flow:
User βββΊ "Login with Google"
App βββΊ Google Authorization Server
Google: "Allow App to access your profile?"
User: "Allow"
Google βββΊ App (authorization code)
App βββΊ Google (code + client_secret)
Google βββΊ App (access_token + refresh_token)
App uses access_token to call Google APIs
```
### 13.3 Common Security Vulnerabilities (OWASP Top 10)
1. **Injection** (SQL, NoSQL, OS command injection)
```sql
-- Vulnerable:
SELECT * FROM users WHERE id = '" + userId + "'";
-- Safe (parameterized query):
SELECT * FROM users WHERE id = ?
```
2. **Broken Authentication** β Weak passwords, session management flaws
3. **Sensitive Data Exposure** β Unencrypted PII, weak hashing
4. **XML External Entities (XXE)** β Malicious XML parsing
5. **Broken Access Control** β IDOR vulnerabilities
```
GET /api/user/123/profile β your profile
GET /api/user/456/profile β another user's profile (should be forbidden!)
```
6. **Security Misconfiguration** β Default passwords, open ports
7. **XSS (Cross-Site Scripting)**
```html
<!-- Vulnerable input rendered as HTML -->
<script>document.cookie</script> β injected malicious script
```
8. **Insecure Deserialization**
9. **Using Components with Known Vulnerabilities**
10. **Insufficient Logging & Monitoring**
### 13.4 Encryption
#### Encryption at Rest
Data encrypted when stored on disk:
```
AES-256 encryption for databases, file systems, backups
AWS S3: Server-Side Encryption (SSE-S3, SSE-KMS, SSE-C)
```
#### Encryption in Transit
Data encrypted when transmitted:
```
HTTPS/TLS for web traffic
TLS for internal service communication
mTLS (mutual TLS) for microservices authentication
```
#### Hashing Passwords
```
β Never store plaintext passwords
β Never use MD5 or SHA-1 for passwords (too fast = brute-forceable)
β Use: bcrypt, scrypt, argon2 (slow by design, adaptive cost)
bcrypt example:
$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewdBPj2gUexhXWym
β β
β cost factor (12 = 2^12 = 4096 iterations)
algorithm version
```
### 13.5 DDoS Protection Strategies
```
Layer 3/4 DDoS (volumetric):
β Anycast network diffusion (CDN absorbs traffic)
β Rate limiting at network edge
β IP reputation filtering
β ISP-level scrubbing
Layer 7 DDoS (application):
β Bot detection (CAPTCHA, behavioral analysis)
β Rate limiting per IP/user
β WAF (Web Application Firewall)
β Challenge suspicious traffic
```
---
## 14. Monitoring & Observability {#monitoring}
### 14.1 The Three Pillars of Observability
> **Observability** is the ability to understand what's happening inside your system from its external outputs.
#### 1. Metrics
*Numerical measurements over time*
```
System Metrics:
CPU usage: 67%
Memory: 4.2GB / 8GB
Network I/O: 245 MB/s
Application Metrics:
Requests per second: 12,450
Error rate: 0.02%
P99 latency: 187ms
Business Metrics:
Orders placed: 1,240/minute
Revenue: $45,230/hour
Active users: 234,567
```
*Tools: Prometheus, Datadog, CloudWatch, Grafana*
#### 2. Logs
*Immutable, timestamped records of discrete events*
```json
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"user_id": "user_789",
"message": "Payment processing failed",
"error": "Card declined",
"metadata": {
"amount": 99.99,
"currency": "USD",
"attempt": 2
}
}
```
*Tools: ELK Stack (Elasticsearch + Logstash + Kibana), Splunk, Loki*
#### 3. Traces
*Record of a request's journey through distributed services*
```
Request ID: abc123
β
ββ [0ms] API Gateway (2ms)
ββ [2ms] User Service (5ms)
β ββ [3ms] DB Query (4ms)
ββ [7ms] Product Service (12ms)
β ββ [8ms] Cache Hit (1ms)
β ββ [9ms] DB Query (10ms)
ββ [19ms] Order Service (8ms)
β ββ [20ms] Payment Service (45ms)
β ββ [21ms] Stripe API (42ms)
ββ [64ms] Total response time
```
*Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM*
### 14.2 The RED Method and USE Method
#### RED Method (for services)
- **R**ate β Requests per second
- **E**rror rate β Fraction of requests that are errors
- **D**uration β Time to handle each request (latency distribution)
#### USE Method (for resources)
- **U**tilization β % time resource is busy
- **S**aturation β Amount of work resource can't process (queue length)
- **E**rrors β Error events count
### 14.3 SLI, SLO, and SLA
```
SLI (Service Level Indicator):
The actual measured metric
"Our 99th percentile latency is 187ms"
SLO (Service Level Objective):
Your internal target
"99th percentile latency < 200ms for 99.9% of requests"
SLA (Service Level Agreement):
Legal/contractual commitment to customers
"We guarantee 99.9% uptime; you get credit if we fall below"
SLO > SLA (SLO should be stricter than SLA β gives buffer before SLA breach)
```
### 14.4 Error Budgets
Error budget = *How much unreliability you're allowed before violating your SLO*
```
SLO: 99.9% availability
Error budget: 0.1% = 43.8 minutes/month
If you've used 30 minutes of downtime this month:
Remaining budget: 13.8 minutes
β Slow down risky deployments
β Focus on reliability improvements
If you have budget remaining:
β Ship new features confidently!
```
### 14.5 Alerting Best Practices
```
Avoid alert fatigue:
β Alert on SLO violations (symptoms), not causes
β Set appropriate thresholds (not too sensitive)
β Alert should be actionable
β Different severity levels (page vs ticket vs info)
Bad alert: CPU > 80% for 1 minute (not actionable, too frequent)
Good alert: Error rate > 1% for 5 minutes (SLO at risk!)
Alerting hierarchy:
P0 (Critical) β Page on-call immediately, 24/7
P1 (High) β Page on-call during business hours
P2 (Medium) β Create ticket for next sprint
P3 (Low) β Log for weekly review
```
---
## 15. Real-World System Design Case Studies {#case-studies}
### 15.1 Case Study: Design Twitter/X
#### Requirements
- 500M daily active users
- 150M tweets/day
- Read-heavy (read:write = 100:1)
- Timeline generation (home feed)
#### High-Level Architecture
```
CDN (Static content)
β
Client βββΊ API Gateway βββΊ Auth Service
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
Tweet User Timeline
Service Service Service
β β β
βΌ βΌ βΌ
Tweets DB Users DB Cache (Redis)
β β²
ββββΊ Fanout Service βββββββ
β
Message Queue (Kafka)
```
#### Timeline Generation: Push vs Pull
**Push (Fanout on Write):**
```
Alice posts tweet
β Kafka event
β Fanout service reads Alice's 10,000 followers
β Writes tweet to each follower's timeline cache
Read: O(1) β just read from cache
Write: O(n) β n = number of followers
Problem: Celebrity with 100M followers = 100M writes per tweet!
```
**Pull (Fanout on Read):**
```
Alice visits home page
β Fetch list of people she follows
β Query each person's recent tweets
β Merge, sort, return
Read: O(n) where n = number of accounts followed
Write: O(1)
Problem: High read latency, many DB queries
```
**Twitter's Hybrid Solution:**
```
Regular users: Push (pre-computed timelines)
Celebrities (>X followers): Pull (queried at read time)
Combine both approaches in timeline service
```
### 15.2 Case Study: Design a URL Shortener (bit.ly)
#### Requirements
- 100M URLs shortened per day
- 10B URL redirects per day (100:1 read:write)
- Short URL must be ~7 characters
- URLs must not expire (or expire after N years)
#### Capacity Estimation
```
Write: 100M / 86400 = ~1,160 URLs/second
Read: 10B / 86400 = ~115,740 redirects/second
Storage (5 years):
100M URLs/day Γ 365 Γ 5 = 182.5B URLs
Each URL record: ~500 bytes
Total: 182.5B Γ 500B = ~91 TB
Short URL key space:
7 characters, base62 (a-z, A-Z, 0-9):
62^7 = 3.5 trillion unique URLs β
```
#### URL Encoding Strategies
**MD5 hashing:**
```
MD5("https://example.com/very/long/url") = "1a79a4d60de6718e8e5b326e338ae533"
Take first 7 chars: "1a79a4d" β bit.ly/1a79a4d
Problem: Collisions possible
```
**Base62 encoding of auto-increment ID:**
```
DB auto-increment ID: 100000
Convert to base62: "q0" (much shorter)
ID 100000000 β base62 β "FXoSg" (6 chars)
```
**Distributed ID generation (Twitter Snowflake):**
```
64-bit ID structure:
1 bit (unused) | 41 bits (timestamp ms) | 10 bits (machine ID) | 12 bits (sequence)
Generates ~4M unique IDs per second per machine β
Sortable by time β
No coordination needed β
```
#### Architecture
```
Client βββΊ Load Balancer βββΊ URL Shortener Service
β
βββββββββββ΄ββββββββββ
βΌ βΌ
Write Path Read Path
β β
MySQL DB Redis Cache (TTL)
(source of truth) β
MySQL DB (cache miss)
```
### 15.3 Case Study: Design Netflix
#### Key Challenges
- 200M+ subscribers globally
- Billions of hours streamed per month
- Video files are massive (1 hour HD = ~2 GB)
- Users on wildly different network conditions
#### Video Ingestion Pipeline
```
Raw Video File
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Transcoding Service β
β β
β Input: "Inception.mp4" (20 GB, 4K RAW) β
β β
β Output: Multiple formats & resolutions β
β βββ inception_240p.mp4 (low bandwidth) β
β βββ inception_480p.mp4 β
β βββ inception_720p.mp4 β
β βββ inception_1080p.mp4 β
β βββ inception_4k.mp4 (high bandwidth) β
β βββ inception_hdr.mp4 β
β βββ inception_audio_en.aac (+ 20 languages) β
β β
β Also: DRM encryption, content validation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
AWS S3 (origin storage)
β
βΌ
CDN (Akamai, Netflix Open Connect)
```
#### Adaptive Bitrate Streaming (ABR)
```
Player monitors bandwidth continuously:
Bandwidth: 20 Mbps β Stream 4K
Bandwidth drops to 8 Mbps β Switch to 1080p seamlessly
Bandwidth drops to 2 Mbps β Switch to 720p
Bandwidth = 0.5 Mbps β Switch to 480p
HLS: Video split into 2-10 second chunks
Player fetches chunks + adjusts quality per chunk
```
#### Netflix's Chaos Engineering
> Netflix pioneered **Chaos Engineering** with **Chaos Monkey** β a tool that ***randomly terminates production instances*** to ensure systems are resilient.
---
## π Summary: Key Principles Cheat Sheet
### The System Design Hierarchy of Needs
```
βββββββββββββββββββ
β BUSINESS β β Features, Cost
β GOALS β
ββββ΄ββββββββββββββββββ΄βββ
β RELIABILITY β β Works correctly
ββββ΄ββββββββββββββββββββββββ΄βββ
β SCALABILITY β β Handles growth
ββββ΄ββββββββββββββββββββββββββββββ΄βββ
β PERFORMANCE β β Fast enough
ββββ΄ββββββββββββββββββββββββββββββββββββ΄βββ
β SECURITY β β Not exploitable
ββββ΄ββββββββββββββββββββββββββββββββββββββββββ΄βββ
β MAINTAINABILITY β β Evolvable
ββββββββββββββββββββββββββββββββββββββββββββββββββ
```
### Quick Reference: Technology Choices
| Need | Technology |
|---|---|
| **Relational data** | PostgreSQL, MySQL |
| **Document store** | MongoDB |
| **Cache** | Redis, Memcached |
| **Wide-column** | Cassandra, HBase |
| **Search** | Elasticsearch |
| **Graph** | Neo4j |
| **Time-series** | InfluxDB, TimescaleDB |
| **Message queue** | Kafka, RabbitMQ, SQS |
| **Object storage** | S3, GCS |
| **Load balancer** | Nginx, HAProxy, AWS ALB |
| **Service discovery** | Consul, Eureka |
| **Container orchestration** | Kubernetes |
| **Monitoring** | Prometheus + Grafana |
| **Distributed tracing** | Jaeger, Zipkin |
| **CDN** | Cloudflare, CloudFront |
### The Golden Rules of System Design
> 1. π **Start simple** β Don't over-engineer. Add complexity only when needed.
> 2. π **Estimate before designing** β Know your scale before choosing solutions.
> 3. π **Embrace trade-offs** β Every design decision is a trade-off. Know what you're trading.
> 4. π§± **Design for failure** β Assume everything will fail. Build for resilience.
> 5. π **Scale horizontally** β Design stateless services that scale out, not up.
> 6. ποΈ **Cache aggressively** β Cache at every layer, but handle invalidation carefully.
> 7. π **Measure everything** β You can't improve what you don't measure.
> 8. π **Iterate** β No design survives contact with production unchanged.
---
*This guide covers the core concepts of system design. The field is constantly evolving β stay curious, read engineering blogs from companies like Netflix, Uber, Airbnb, and Cloudflare, and always tie theoretical knowledge to real implementations.*
---
**Further Reading:**
- *Designing Data-Intensive Applications* β Martin Kleppmann
- *The System Design Interview* β Alex Xu
- *Site Reliability Engineering* β Google (free online)
- *Building Microservices* β Sam Newman
- Engineering blogs: Netflix Tech Blog, Uber Engineering, Cloudflare Blog, AWS Architecture Blog