The Complete Guide to System Design: From Fundamentals to Mastery

April 8, 2026•2 hour

A system that is not designed will not work. A system that is designed will barely work.

# The Complete Guide to System Design: From Fundamentals to Mastery

> *"A system that is not designed will not work. A system that is designed will barely work."* — Unknown Engineer

---

## 📌 Table of Contents

1. [Introduction to System Design](#introduction)

2. [Scalability](#scalability)

3. [Reliability & Availability](#reliability)

4. [Networking Fundamentals](#networking)

5. [Load Balancing](#load-balancing)

6. [Caching](#caching)

7. [Databases](#databases)

8. [Distributed Systems](#distributed-systems)

9. [Microservices Architecture](#microservices)

10. [Message Queues & Event-Driven Architecture](#message-queues)

11. [API Design](#api-design)

12. [Storage Systems](#storage-systems)

13. [Security in System Design](#security)

14. [Monitoring & Observability](#monitoring)

15. [Case Studies](#case-studies)

---

## 1. Introduction to System Design {#introduction}

System design is the process of **defining the architecture, components, modules, interfaces, and data** for a system to satisfy specified requirements. It is one of the most critical phases in software engineering, bridging the gap between requirements and implementation.

Whether you are designing a **small web application** or a **planet-scale distributed system** serving billions of users, understanding the foundational principles of system design is absolutely essential.

### 1.1 Why System Design Matters

System design matters for multiple reasons:

- 🏗️ **Scalability** — Can your system handle 10x, 100x, or 1000x more users?

- 🔒 **Reliability** — Does your system work correctly even when parts fail?

- ⚡ **Performance** — Is your system fast enough for real-world usage?

- 💰 **Cost Efficiency** — Is your system economical to build and operate?

- 🔧 **Maintainability** — Can your team evolve and debug the system easily?

### 1.2 The Two Pillars: Functional vs Non-Functional Requirements

Before diving into design, every engineer must distinguish between two types of requirements:

#### Functional Requirements

These define ***what*** the system should do:

- Users can upload photos

- Users can follow other users

- The system should send email notifications

- The system must process payments

#### Non-Functional Requirements

These define ***how well*** the system does it:

| Requirement | Description | Example |

|---|---|---|

| **Latency** | Response time | < 200ms for API calls |

| **Throughput** | Requests per second | 100,000 RPS |

| **Availability** | Uptime percentage | 99.99% (52 min downtime/year) |

| **Consistency** | Data correctness | All users see same data |

| **Durability** | Data persistence | No data loss after failure |

| **Scalability** | Growth capacity | 10M to 1B users |

### 1.3 The STAR Framework for System Design Interviews

When approaching any system design problem, use the **STAR framework**:

```

S — Scope the problem (clarify requirements)

T — Think about capacity estimation

A — Architect the high-level design

R — Refine with deep dives

```

### 1.4 Back-of-the-Envelope Calculations

One of the most underrated skills in system design is **estimation**. Engineers must be comfortable doing rough math to validate design decisions.

#### Common Numbers Every Engineer Should Know

```

Latency Numbers (Approximate):

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lock/unlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with 3,000 ns (3 μs)

Snappy

Send 1K bytes over 1 Gbps 10,000 ns (10 μs)

network

Read 4K randomly from SSD 150,000 ns (150 μs)

Read 1 MB sequentially 250,000 ns (250 μs)

from memory

Round trip within same DC 500,000 ns (0.5 ms)

Read 1 MB sequentially 1,000,000 ns (1 ms)

from SSD

Disk seek 10,000,000 ns (10 ms)

Read 1 MB sequentially 10,000,000 ns (10 ms)

from disk

Send packet CA→Netherlands 150,000,000 ns (150 ms)

→CA

```

#### Data Volume Calculations

> **Pro Tip:** When estimating storage, always think in terms of: *users × data per user × time period*

**Example:** Estimating storage for a Twitter-like service

- 500 million daily active users

- Each user sends ~2 tweets/day

- Each tweet = 300 bytes (text) + 200 KB (media, 30% of tweets)

- **Daily storage = 500M × 2 × 300B + 500M × 0.3 × 200KB**

- **= ~30 TB/day just for media**

---

## 2. Scalability {#scalability}

Scalability is the ability of a system to handle **increased load gracefully**. It is arguably the most discussed topic in system design, and for good reason — it directly impacts user experience and business viability.

### 2.1 Vertical Scaling (Scale Up)

**Vertical scaling** means adding more power to an existing machine — more CPU, more RAM, faster disks.

```

Before Scaling: After Vertical Scaling:

┌──────────────┐ ┌──────────────────────┐

│ Server │ │ Upgraded Server │

│ CPU: 4 core │ ──► │ CPU: 64 core │

│ RAM: 16 GB │ │ RAM: 512 GB │

│ SSD: 500 GB │ │ SSD: 10 TB │

└──────────────┘ └──────────────────────┘

```

**Advantages:**

- ✅ Simple — no application changes needed

- ✅ No distributed system complexity

- ✅ Stronger consistency (single machine)

**Disadvantages:**

- ❌ Has a hard limit (you can't infinitely upgrade hardware)

- ❌ Single point of failure

- ❌ Expensive at high end

- ❌ Downtime during upgrades

### 2.2 Horizontal Scaling (Scale Out)

**Horizontal scaling** means adding more machines to the pool of resources.

```

Before Scaling: After Horizontal Scaling:

┌──────────┐

┌──────────┐ │ Server 1 │

│ Server 1 │ ──► │ Server 2 │ ← Load Balancer distributes

└──────────┘ │ Server 3 │

│ Server N │

└──────────┘

```

**Advantages:**

- ✅ Theoretically unlimited scaling

- ✅ No single point of failure (high availability)

- ✅ Cost-effective with commodity hardware

- ✅ Zero-downtime upgrades (rolling deploys)

**Disadvantages:**

- ❌ Application must be stateless (or use distributed state)

- ❌ Increased complexity (distributed systems problems)

- ❌ Network latency between nodes

### 2.3 The Scalability Bottlenecks

Scaling isn't just about servers. Every layer of your stack can become a bottleneck:

#### The Scalability Stack

```

┌─────────────────────────────────────┐

│ DNS Resolution │ ← Often ignored

├─────────────────────────────────────┤

│ Load Balancer │ ← Can become bottleneck

├─────────────────────────────────────┤

│ Web/API Servers │ ← Usually easiest to scale

├─────────────────────────────────────┤

│ Cache Layer │ ← Critical for performance

├─────────────────────────────────────┤

│ Application Servers │ ← Business logic

├─────────────────────────────────────┤

│ Databases │ ← Hardest to scale

├─────────────────────────────────────┤

│ Storage Systems │ ← File/blob storage

└─────────────────────────────────────┘

```

### 2.4 Stateless vs Stateful Architecture

One of the most important design decisions when scaling horizontally is managing **state**.

#### Stateless Architecture ✅ (Preferred for scaling)

In a stateless system, **each request contains all the information needed to process it**. The server does not store any session information.

```

Request ──► Server A ──► Response

Request ──► Server B ──► Response (same result!)

Request ──► Server C ──► Response (same result!)

```

*State is stored externally in:*

- Databases

- Redis (shared session store)

- JWTs (client-side tokens)

#### Stateful Architecture ⚠️ (Problematic for scaling)

In a stateful system, the server remembers previous requests. This makes load balancing difficult.

```

Request 1 ──► Server A (stores session)

Request 2 ──► Server B (no session!) ──► ERROR ❌

```

*Solutions:*

- **Sticky sessions** (route same user to same server — but limits scaling)

- **Centralized session store** (Redis, Memcached)

### 2.5 The CAP Theorem

One of the most fundamental theorems in distributed systems, stated by Eric Brewer in 2000:

> ***"In a distributed system, you can only guarantee two of the following three properties simultaneously: Consistency, Availability, and Partition Tolerance."***

```

C (Consistency)

/ \

/ CA \

/ RDBMS \

/─────────►\

/ \

CP AP

MongoDB Cassandra

HBase CouchDB

Redis DynamoDB

\ /

A ─────── P

(Availability) (Partition

Tolerance)

```

|---|---|---|---|

> **⚠️ Important:** In practice, *Partition Tolerance is non-negotiable* in distributed systems (networks fail). So the real choice is between **CP** and **AP**.

### 2.6 PACELC Theorem (Extension of CAP)

PACELC extends CAP by acknowledging that even **without** partitions, there's a trade-off between **latency** and **consistency**.

```

If (Partition):

Choose between: Availability ← or → Consistency

Else (Normal operation):

Choose between: Latency ← or → Consistency

```

---

## 3. Reliability & Availability {#reliability}

### 3.1 Defining Reliability

**Reliability** is the probability that a system will perform its required function **without failure** over a specified period under given conditions.

A reliable system is one that:

- Continues to work correctly even when things go wrong

- Handles hardware faults, software faults, and human errors

- Degrades gracefully rather than failing catastrophically

### 3.2 The Nines of Availability

Availability is typically expressed as a percentage of uptime:

|---|---|---|---|

> **💡 Industry Standard:** Most production systems aim for **99.9% to 99.99%**. Five nines (99.999%) is the gold standard for critical infrastructure like telephone networks.

### 3.3 Fault Tolerance Patterns

#### Circuit Breaker Pattern 🔌

The Circuit Breaker prevents cascading failures by detecting when a service is down and *failing fast* instead of waiting for timeouts.

```

States:

┌───────────┐ failures exceed ┌─────────┐

│ CLOSED │ ─────────────► │ OPEN │

│ (normal) │ │ (fast │

└───────────┘ │ fail) │

▲ └─────────┘

│ │

│ success │ timeout

│ ▼

│ ┌────────────┐

└────────────────────────│ HALF-OPEN │

│ (test req) │

└────────────┘

```

#### Retry with Exponential Backoff

```

Attempt 1: wait 1 second

Attempt 2: wait 2 seconds

Attempt 3: wait 4 seconds

Attempt 4: wait 8 seconds

Attempt 5: wait 16 seconds + give up

```

> **Always add jitter (randomness) to backoff to prevent thundering herd!**

#### Bulkhead Pattern 🚢

Inspired by ship design — isolate components so failures in one don't sink the whole ship.

```

Without Bulkhead: With Bulkhead:

┌────────────────┐ ┌──────┐ ┌──────┐ ┌──────┐

│ All services │ │ Svc A│ │ Svc B│ │ Svc C│

│ share one │ │ Pool │ │ Pool │ │ Pool │

│ thread pool │ │ 10 │ │ 10 │ │ 10 │

│ │ │threads threads threads│

│ Svc A fails → │ └──────┘ └──────┘ └──────┘

│ ALL fail ❌ │ A fails → only A affected ✅

└────────────────┘

```

### 3.4 Redundancy Strategies

| Strategy | Description | Use Case |

|---|---|---|

| **Active-Active** | All nodes handle traffic simultaneously | Load distribution + HA |

| **Active-Passive** | One node handles traffic; backup on standby | Simpler failover |

| **N+1 Redundancy** | N required nodes + 1 spare | Cost-effective HA |

| **Geographic Redundancy** | Multiple data centers in different regions | Disaster recovery |

### 3.5 Mean Time Between Failures (MTBF) and Mean Time to Recovery (MTTR)

```

Availability = MTBF / (MTBF + MTTR)

MTBF = Mean Time Between Failures

MTTR = Mean Time to Recovery

Example:

MTBF = 720 hours (fails once a month)

MTTR = 1 hour (takes 1 hour to recover)

Availability = 720 / (720 + 1) = 99.86%

```

---

## 4. Networking Fundamentals {#networking}

### 4.1 The OSI Model

Understanding the OSI (Open Systems Interconnection) model is essential for any system designer:

```

┌──────────────────────────────────────────┐

│ Layer 7 — Application (HTTP, SMTP, FTP)│

├──────────────────────────────────────────┤

│ Layer 6 — Presentation (SSL/TLS, JPEG) │

├──────────────────────────────────────────┤

│ Layer 5 — Session (NetBIOS, PPTP) │

├──────────────────────────────────────────┤

│ Layer 4 — Transport (TCP, UDP) │

├──────────────────────────────────────────┤

│ Layer 3 — Network (IP, ICMP) │

├──────────────────────────────────────────┤

│ Layer 2 — Data Link (Ethernet, MAC) │

├──────────────────────────────────────────┤

│ Layer 1 — Physical (Cables, WiFi) │

└──────────────────────────────────────────┘

```

### 4.2 TCP vs UDP

| Feature | TCP | UDP |

|---|---|---|

| **Connection** | Connection-oriented (3-way handshake) | Connectionless |

| **Reliability** | Guaranteed delivery | No guarantee |

| **Ordering** | In-order delivery | No ordering |

| **Speed** | Slower (overhead) | Faster |

| **Use Cases** | HTTP, email, file transfer | DNS, video streaming, gaming |

| **Flow Control** | Yes | No |

| **Error Checking** | Yes (with correction) | Yes (detection only) |

### 4.3 DNS — The Internet's Phone Book

DNS (Domain Name System) translates **human-readable domain names** into **IP addresses**.

```

User types: www.example.com

│

▼

┌───────────────┐

│ Browser Cache│ ─── found? serve it ✅

└───────┬───────┘

│ not found

▼

┌───────────────┐

│ OS DNS Cache │ ─── found? serve it ✅

└───────┬───────┘

│ not found

▼

┌───────────────┐

│ ISP Resolver │ ─── found? serve it ✅

└───────┬───────┘

│ not found

▼

┌───────────────┐

│ Root Nameserver│ → directs to .com TLD

└───────┬───────┘

▼

┌───────────────┐

│ TLD Nameserver│ → directs to example.com NS

└───────┬───────┘

▼

┌───────────────────────┐

│ Authoritative NS │ → returns 93.184.216.34

└───────────────────────┘

```

#### DNS Record Types

| Record | Purpose | Example |

|---|---|---|

| **A** | IPv4 address | example.com → 93.184.216.34 |

| **AAAA** | IPv6 address | example.com → 2606:2800::1 |

| **CNAME** | Canonical name (alias) | www → example.com |

| **MX** | Mail exchange | mail.example.com |

| **TXT** | Text record (SPF, DKIM) | v=spf1 include:... |

| **NS** | Nameserver | ns1.example.com |

| **SOA** | Start of Authority | Zone metadata |

### 4.4 HTTP/HTTPS and HTTP/2 vs HTTP/3

#### HTTP/1.1 Problems

- One request per connection (or pipelining, which is flawed)

- Head-of-line blocking

- Plain text headers (no compression)

#### HTTP/2 Improvements ✅

- **Multiplexing** — multiple requests over single connection

- **Header compression** (HPACK)

- **Server push** — server can proactively send resources

- **Binary protocol** — more efficient than text

#### HTTP/3 (QUIC) Improvements ✅

- Built on **UDP** instead of TCP

- **0-RTT connection establishment**

- **Better mobile performance** (handles IP changes)

- **Eliminates head-of-line blocking** at the transport layer

### 4.5 WebSockets vs Long Polling vs Server-Sent Events

When building **real-time features**, engineers must choose the right communication protocol:

```

Long Polling:

Client ──► Server (request)

Server waits...

Server ──► Client (response when data ready)

Client ──► Server (immediately new request)

[High latency, many connections]

Server-Sent Events (SSE):

Client ──► Server (one-time request)

Server ──► Client (stream of events)

[One-way, simple, HTTP-based]

WebSockets:

Client ◄──► Server (persistent bidirectional connection)

[Low latency, bidirectional, complex]

```

|---|---|---|---|

| **Complexity** | Low | Low | Medium |

---

## 5. Load Balancing {#load-balancing}

A **load balancer** distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed.

### 5.1 Load Balancing Algorithms

#### Round Robin

```

Request 1 ──► Server A

Request 2 ──► Server B

Request 3 ──► Server C

Request 4 ──► Server A (cycle repeats)

```

*Simple and equal distribution. Good when all servers have equal capacity.*

#### Weighted Round Robin

```

Server A (weight: 3) ──► gets 3 out of every 5 requests

Server B (weight: 2) ──► gets 2 out of every 5 requests

```

*Good when servers have different capacities.*

#### Least Connections

```

Server A: 100 connections

Server B: 45 connections ← Next request goes here

Server C: 78 connections

```

*Best for long-lived connections (WebSockets, databases).*

#### IP Hash / Sticky Sessions

```

User IP 192.168.1.1 ──► always routed to Server A

User IP 10.0.0.5 ──► always routed to Server B

```

*Good when session state is stored on servers (though stateless is preferred).*

#### Least Response Time

```

Server A: avg 120ms

Server B: avg 45ms ← Next request goes here

Server C: avg 89ms

```

*Optimal performance but requires active health monitoring.*

#### Random

*Simple, works well at large scale due to the law of large numbers.*

### 5.2 Layer 4 vs Layer 7 Load Balancing

#### Layer 4 (Transport Layer)

- Operates on **TCP/UDP** packets

- Routes based on **IP address + port**

- **Very fast** — no packet inspection

- Cannot make content-aware decisions

- *Example tools: HAProxy (TCP mode), AWS NLB*

#### Layer 7 (Application Layer)

- Operates on **HTTP/HTTPS** content

- Can route based on **URL path, headers, cookies, content type**

- More powerful but slightly higher overhead

- Enables **A/B testing, canary deployments, content-based routing**

- *Example tools: Nginx, HAProxy (HTTP mode), AWS ALB*

```

L7 Load Balancer routing:

/api/* ──► API Server Farm

/static/* ──► CDN / Static Server Farm

/ws/* ──► WebSocket Server Farm

/admin/* ──► Admin Server (with auth)

```

### 5.3 Health Checks

Load balancers continuously monitor server health:

```

Active Health Check:

Load Balancer ──► GET /health ──► Server

◄── 200 OK ◄──

If 200: server stays in pool ✅

If timeout/5xx: server removed from pool ❌

```

### 5.4 DNS Load Balancing

DNS can also be used for basic load balancing by returning multiple A records:

```

example.com → [93.184.216.34, 198.51.100.1, 203.0.113.2]

Client selects one (usually first)

TTL controls how long it's cached

```

*Limitations: No health checking, TTL delays failover*

### 5.5 Global Server Load Balancing (GSLB)

For **multi-region** systems, GSLB routes users to the geographically closest healthy datacenter:

```

User in India ──► Mumbai DC

User in USA ──► Virginia DC

User in Europe ──► Frankfurt DC

User in Australia ──► Singapore DC

```

---

## 6. Caching {#caching}

> ***"There are only two hard things in Computer Science: cache invalidation and naming things."*** — Phil Karlton

Caching is storing copies of data in a faster storage layer to speed up future requests. It is one of the **most impactful optimizations** in system design.

### 6.1 Cache Hierarchy

```

Registers (< 1 ns)

↓

L1 Cache (~0.5 ns, 32-64 KB)

↓

L2 Cache (~7 ns, 256 KB - 1 MB)

↓

L3 Cache (~30 ns, 4-32 MB)

↓

RAM (~100 ns, GBs)

↓

SSD (~150 μs, TBs)

↓

HDD (~10 ms, TBs)

↓

Network/Remote Cache (~0.5 ms)

↓

Database (~10-100 ms)

```

### 6.2 Caching Strategies

#### Cache-Aside (Lazy Loading) — Most Common

```

Read flow:

App ──► Cache ──► Miss? ──► Database ──► Store in Cache ──► Return to App

Write flow:

App ──► Write to Database ──► Invalidate Cache

```

**Pros:** Only caches what's actually needed

**Cons:** Cache miss causes 3 network trips; possible stale data

#### Write-Through

```

Write: App ──► Cache ──► Database (synchronously)

Read: App ──► Cache (always fresh)

```

**Pros:** Cache always up-to-date

**Cons:** Write latency higher; may cache data that's never read

#### Write-Behind (Write-Back)

```

Write: App ──► Cache (immediately returns)

Cache ──► Database (asynchronously later)

```

**Pros:** Very fast writes

**Cons:** Risk of data loss if cache crashes before writing to DB

#### Read-Through

```

Read: App ──► Cache ──► Cache fetches from DB on miss (not the app)

```

**Pros:** Simplified application code

**Cons:** Cache miss still slow; cold start problem

### 6.3 Cache Eviction Policies

When the cache is full, **which data gets removed?**

|---|---|---|---|

### 6.4 Cache Invalidation

This is the hard part. Three main approaches:

1. **TTL (Time To Live)** — Cache expires after N seconds automatically

```

SET key value EX 3600 # expires in 1 hour

```

2. **Event-based invalidation** — Invalidate when data changes

```

User updates profile → DELETE cache:user:123

```

3. **Cache versioning** — Include version in cache key

```

cache:user:123:v5 # bump version on update

```

### 6.5 Cache Problems and Solutions

#### Cache Stampede (Thundering Herd) 🐘

**Problem:** Cache expires → thousands of requests hit database simultaneously

**Solutions:**

- **Mutex/Lock** — Only one request refreshes cache; others wait

- **Probabilistic Early Expiration** — Randomly refresh before expiry

- **Stale-While-Revalidate** — Serve stale data while refreshing in background

#### Cache Penetration 🕳️

**Problem:** Requests for **non-existent data** always bypass cache and hit database

```

Malicious user queries: user_id=-1, user_id=-2, user_id=-3...

Each misses cache and hammers database

```

**Solutions:**

- **Cache null results** — Store "NOT FOUND" in cache with short TTL

- **Bloom Filter** — Probabilistic data structure to check existence without DB hit

#### Cache Avalanche ❄️

**Problem:** Many cache keys expire at the **same time** → massive DB load

**Solutions:**

- Add **random jitter** to TTL values: `TTL = base_ttl + random(0, base_ttl * 0.1)`

- Use **different TTLs** for different data types

- **Warm up cache** gradually before traffic shift

### 6.6 Distributed Caching Systems

#### Redis

```

Features:

- In-memory key-value store

- Rich data structures: String, Hash, List, Set, Sorted Set, Stream

- Persistence: RDB snapshots + AOF logs

- Pub/Sub messaging

- Cluster mode (horizontal scaling)

- Sentinel (high availability)

Use Cases:

- Session management

- Rate limiting

- Leaderboards (Sorted Sets)

- Pub/Sub

- Distributed locks

```

#### Memcached

```

Features:

- Simple key-value store

- Multi-threaded (better raw throughput than Redis)

- No persistence

- No replication (simpler)

Use Cases:

- Simple object caching

- When you need maximum throughput

- When data loss is acceptable

```

### 6.7 CDN (Content Delivery Network)

A CDN is a **geographically distributed** cache for static content:

```

Without CDN:

User in Tokyo ──────────────────► Origin Server in Virginia

150ms RTT

With CDN:

User in Tokyo ──► CDN Edge in Tokyo ──► (cache hit!)

5ms RTT

(cache miss) ──► Origin Server in Virginia

(refills CDN edge)

```

**What CDNs cache:**

- Static assets (images, CSS, JS)

- Videos (HLS/DASH chunks)

- API responses (with cache headers)

- HTML pages (for static sites)

**Popular CDNs:** Cloudflare, AWS CloudFront, Fastly, Akamai

---

## 7. Databases {#databases}

Databases are the **backbone of most systems**. Choosing the right database is one of the most consequential system design decisions.

### 7.1 Relational Databases (SQL)

Relational databases store data in **structured tables** with **predefined schemas** and use **SQL** for querying.

```

Users Table:

┌────┬──────────┬────────────────────┬────────────────────────┐

│ id │ username │ email │ created_at │

├────┼──────────┼────────────────────┼────────────────────────┤

│ 1 │ alice │ alice@example.com │ 2024-01-15 10:30:00 │

│ 2 │ bob │ bob@example.com │ 2024-01-16 14:22:00 │

│ 3 │ charlie │ charlie@example.com│ 2024-01-17 09:15:00 │

└────┴──────────┴────────────────────┴────────────────────────┘

```

#### ACID Properties

The gold standard for database transactions:

| Property | Definition | Example |

|---|---|---|

| **A**tomicity | All-or-nothing transactions | Bank transfer: debit + credit both succeed or both fail |

| **C**onsistency | Data always in valid state | Cannot have negative balance |

| **I**solation | Concurrent transactions don't interfere | Two simultaneous transfers don't corrupt data |

| **D**urability | Committed data survives failures | After power outage, committed transaction persists |

#### Transaction Isolation Levels

```

READ UNCOMMITTED ← weakest (dirty reads possible)

READ COMMITTED

REPEATABLE READ

SERIALIZABLE ← strongest (most consistent but slowest)

```

**Common concurrency problems:**

- **Dirty Read** — Reading uncommitted data from another transaction

- **Non-Repeatable Read** — Same query returns different results in same transaction

- **Phantom Read** — New rows appear between reads in same transaction

### 7.2 NoSQL Databases

NoSQL databases trade some ACID guarantees for **flexibility, scalability, and performance**.

#### Key-Value Stores

```

Key: "user:123:session"

Value: {"token": "abc123", "expires": "2024-12-31"}

Examples: Redis, DynamoDB, Memcached

Best for: Sessions, caching, leaderboards

```

#### Document Databases

```json

{

"_id": "user_123",

"name": "Alice Johnson",

"email": "alice@example.com",

"addresses": [

{"type": "home", "city": "New York"},

{"type": "work", "city": "San Francisco"}

"preferences": {

"theme": "dark",

"notifications": true

}

```

*Examples: MongoDB, CouchDB, Firestore*

*Best for: User profiles, product catalogs, content management*

#### Column-Family Stores (Wide-Column)

```

Row Key: "user_123"

Columns:

profile: {name: "Alice", email: "alice@example.com"}

activity: {last_login: "2024-01-15", login_count: "142"}

settings: {theme: "dark", lang: "en"}

Examples: Apache Cassandra, HBase, ScyllaDB

Best for: Time-series data, IoT, write-heavy workloads

```

#### Graph Databases

```

(Alice) ──[FOLLOWS]──► (Bob)

(Alice) ──[LIKES]────► (Post:123)

(Bob) ──[CREATED]──► (Post:123)

(Bob) ──[FOLLOWS]──► (Charlie)

Examples: Neo4j, Amazon Neptune, ArangoDB

Best for: Social networks, fraud detection, recommendation engines

```

#### Time-Series Databases

```

timestamp | sensor_id | temperature | humidity

--------------------|-----------|-------------|----------

2024-01-15 10:00:00 | sensor_1 | 23.5 | 65.2

2024-01-15 10:00:01 | sensor_1 | 23.6 | 65.1

2024-01-15 10:00:02 | sensor_1 | 23.4 | 65.3

Examples: InfluxDB, TimescaleDB, Prometheus

Best for: Metrics, monitoring, IoT, financial data

```

### 7.3 Database Indexing

An index is a **data structure** that improves the speed of data retrieval operations.

```

Without Index:

SELECT * FROM users WHERE email = 'alice@example.com'

→ Full table scan: check ALL 10 million rows = O(n) = SLOW ❌

With Index on email:

→ B-tree lookup: O(log n) = FAST ✅

```

#### Types of Indexes

**B-Tree Index** — Most common, good for range queries and equality

```sql

CREATE INDEX idx_users_email ON users(email);

CREATE INDEX idx_orders_date ON orders(created_at);

```

**Hash Index** — Only for equality queries, very fast

```

hash("alice@example.com") → bucket 4829 → row pointer

```

**Composite Index** — Multiple columns

```sql

CREATE INDEX idx_orders ON orders(user_id, status, created_at);

-- Efficient for: WHERE user_id = 1 AND status = 'pending'

-- Uses index prefix rule

```

**Covering Index** — Index contains all columns needed by query

```sql

-- Query:

SELECT id, email FROM users WHERE username = 'alice';

-- Covering index:

CREATE INDEX idx_covering ON users(username, id, email);

-- Never touches actual table rows!

```

**Full-Text Index** — For text search

```sql

CREATE FULLTEXT INDEX idx_content ON posts(title, body);

SELECT * FROM posts WHERE MATCH(title, body) AGAINST('system design');

```

#### The Index Trade-off

> ⚖️ **More indexes = faster reads, slower writes, more storage**

Every INSERT/UPDATE/DELETE must update all indexes on the table.

### 7.4 Database Replication

Replication copies data to **multiple servers** for availability and read scaling.

#### Primary-Replica (Master-Slave) Replication

```

Writes

│

▼

┌────────────┐

│ Primary │──┬──► Replica 1 (async)

│ (Master) │ ├──► Replica 2 (async)

└────────────┘ └──► Replica 3 (async)

Reads ──► Any Replica

Writes ──► Primary Only

```

**Synchronous replication:** Primary waits for replica to confirm write

*Pros:* No data loss | *Cons:* Higher write latency

**Asynchronous replication:** Primary doesn't wait

*Pros:* Lower write latency | *Cons:* Potential data loss (replication lag)

#### Multi-Primary (Multi-Master) Replication

```

┌──────────┐ sync/async ┌──────────┐

│ Primary A│ ◄────────────► │ Primary B│

└──────────┘ └──────────┘

Writes + Reads Writes + Reads

```

*Allows writes to multiple nodes — more complex conflict resolution needed.*

### 7.5 Database Sharding (Horizontal Partitioning)

Sharding splits data across multiple database instances, each holding a **subset of the data**.

```

Without Sharding: With Sharding:

┌──────────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐

│ Database │ │ Shard 1 │ │ Shard 2 │ │ Shard 3 │

│ users: 1-100M │ ──► │ users: │ │ users: │ │ users: │

│ orders: 1-500M │ │ 1-33M │ │ 34-66M │ │ 67-100M │

└──────────────────┘ └──────────┘ └──────────┘ └──────────┘

```

#### Sharding Strategies

**Range-based sharding:**

```

user_id 1-1,000,000 → Shard 1

user_id 1,000,001-2M → Shard 2

user_id 2,000,001-3M → Shard 3

```

*Problem: Hot spots if users are not uniformly distributed*

**Hash-based sharding:**

```

shard = hash(user_id) % num_shards

hash("user_123") % 3 = Shard 2

hash("user_456") % 3 = Shard 0

```

*Problem: Resharding when adding shards (consistent hashing solves this)*

**Consistent Hashing:**

```

Hash Ring (0 to 2^32):

│

──┼── Node A (at position 100)

│

──┼── Node B (at position 200)

│

──┼── Node C (at position 300)

│

360°

Key maps to nearest node clockwise.

Adding node only rehashes ~1/N keys!

```

#### Sharding Challenges

- ❌ **Cross-shard queries** — JOINs across shards are expensive

- ❌ **Non-uniform distribution** — Hot shards

- ❌ **Rebalancing** — Moving data when adding/removing shards

- ❌ **Global transactions** — ACID across shards is very complex

### 7.6 SQL vs NoSQL — When to Use What

| Criteria | SQL | NoSQL |

|---|---|---|

| **Data Structure** | Well-defined schema | Flexible / dynamic schema |

| **Relationships** | Complex joins needed | Denormalized / embedded |

| **Scaling** | Vertical (mostly) | Horizontal (designed for) |

| **Consistency** | Strong (ACID) | Eventual (usually) |

| **Query Complexity** | Complex SQL queries | Simple key-based lookups |

| **Write Pattern** | Moderate writes | High-volume writes |

| **Use Cases** | Finance, ERP, CRM | Social media, IoT, catalog |

### 7.7 NewSQL Databases

NewSQL databases aim to provide the **scalability of NoSQL with the ACID guarantees of SQL**:

- **Google Spanner** — Globally distributed SQL with TrueTime

- **CockroachDB** — PostgreSQL-compatible distributed database

- **TiDB** — MySQL-compatible distributed database

- **YugabyteDB** — PostgreSQL-compatible distributed database

---

## 8. Distributed Systems {#distributed-systems}

### 8.1 The Fallacies of Distributed Computing

Peter Deutsch and James Gosling identified **8 fallacies** that developers often assume (incorrectly) about distributed systems:

1. 🚫 ~~The network is reliable~~

2. 🚫 ~~Latency is zero~~

3. 🚫 ~~Bandwidth is infinite~~

4. 🚫 ~~The network is secure~~

5. 🚫 ~~Topology doesn't change~~

6. 🚫 ~~There is one administrator~~

7. 🚫 ~~Transport cost is zero~~

8. 🚫 ~~The network is homogeneous~~

> **Every distributed system design must account for these realities.**

### 8.2 Consistency Models

#### Strong Consistency

Every read returns the **most recent write**. After a write completes, all subsequent reads from any node will return that value.

*Trade-off: Higher latency, lower availability*

#### Eventual Consistency

Given enough time with no new updates, all replicas will **converge to the same value**.

```

t=0: Write "Alice" to Node A

t=1: Read from Node B → "Bob" (stale) ← temporarily inconsistent

t=5: Node B syncs from Node A

t=6: Read from Node B → "Alice" ✅ (converged)

```

#### Read-Your-Writes Consistency

A user **always sees their own writes**, even if other users might see stale data.

*Implementation: Route reads from same user to same replica (or use primary)*

#### Monotonic Read Consistency

A user never reads **older data** after reading newer data (no time travel backwards).

#### Causal Consistency

Operations that are **causally related** are seen in order:

```

Alice posts message → Bob replies to message

All users who see Bob's reply must also see Alice's original message

```

### 8.3 Distributed Transactions

#### Two-Phase Commit (2PC)

```

Phase 1: Prepare

Coordinator ──► "Prepare to commit?" ──► Participant A

Coordinator ──► "Prepare to commit?" ──► Participant B

◄── "Ready" ─────────────

Phase 2: Commit

Coordinator ──► "Commit!" ──► Participant A

Coordinator ──► "Commit!" ──► Participant B

If ANY participant says "Abort" in Phase 1:

Coordinator ──► "Rollback!" ──► All Participants

```

**Problems with 2PC:**

- Blocking protocol — if coordinator fails between phases, participants are stuck

- Poor performance (2 round trips)

#### SAGA Pattern

Saga breaks a **distributed transaction into a sequence of local transactions**, each with a **compensating transaction** for rollback:

```

Create Order → Reserve Inventory → Process Payment → Ship Order

↓ fails ↓ fails ↓ fails

Cancel Order ← Unreserve Inv ← Refund Payment

```

*Two implementations:*

- **Choreography** — Each service publishes events; next service listens

- **Orchestration** — A central orchestrator tells each service what to do

### 8.4 Distributed Consensus

How do distributed nodes **agree on a value** when nodes can fail?

#### Paxos (The Classic, Hard to Understand)

Paxos is a family of protocols for solving consensus in a network of unreliable processors.

Key roles: **Proposer**, **Acceptor**, **Learner**

#### Raft (The Understandable Alternative)

Raft decomposes consensus into **leader election** + **log replication**:

```

Normal operation:

Leader ──► AppendEntries RPC ──► Follower 1

──► AppendEntries RPC ──► Follower 2

──► AppendEntries RPC ──► Follower 3

Majority acknowledges → entry is committed

Leader fails:

Follower 1: hasn't heard from leader (election timeout)

Follower 1: becomes Candidate, sends RequestVote to others

Majority votes for Candidate → New Leader elected ✅

```

*Used by: etcd, CockroachDB, TiKV, Consul*

#### ZooKeeper (ZAB Protocol)

ZooKeeper uses **ZAB (ZooKeeper Atomic Broadcast)** for coordination.

*Used for: Service discovery, distributed locking, configuration management*

### 8.5 Vector Clocks and Conflict Resolution

In distributed systems with no central clock, we need **logical clocks** to track causality:

```

Vector Clock: [A:0, B:0, C:0]

A sends event: [A:1, B:0, C:0]

B receives, sends: [A:1, B:1, C:0]

C receives, sends: [A:1, B:1, C:1]

If A and B have concurrent events:

A: [A:2, B:1, C:1]

B: [A:1, B:2, C:1]

Neither dominates → conflict! Must be resolved by application.

```

*Conflict resolution strategies:*

- **Last-write-wins** (LWW) — Use timestamps

- **Application-level merge** — e.g., CRDT (Conflict-Free Replicated Data Types)

- **User-level resolution** — Present both versions to user (Git-style)

---

## 9. Microservices Architecture {#microservices}

### 9.1 Monolith vs Microservices

#### Monolithic Architecture

```

┌────────────────────────────────────────┐

│ Monolith Application │

│ ┌──────────┐ ┌──────────┐ │

│ │ User │ │ Payment │ │

│ │ Service │ │ Service │ │

│ └──────────┘ └──────────┘ │

│ ┌──────────┐ ┌──────────┐ │

│ │ Product │ │ Order │ │

│ │ Service │ │ Service │ │

│ └──────────┘ └──────────┘ │

│ Single Database │

└────────────────────────────────────────┘

```

**Pros:**

- ✅ Simple development, testing, deployment

- ✅ Low latency (in-process calls)

- ✅ Easy transactions (single DB)

**Cons:**

- ❌ Scaling must scale everything

- ❌ Long build and deploy times

- ❌ Tech stack lock-in

- ❌ Single point of failure

#### Microservices Architecture

```

API Gateway

│

┌────────┼────────┐

│ │ │

▼ ▼ ▼

┌───────┐ ┌───────┐ ┌───────┐

│ User │ │Product│ │ Order │

│Service│ │Service│ │Service│

│ DB │ │ DB │ │ DB │

└───────┘ └───────┘ └───────┘

│ │

└──── Message Bus ──┘

```

**Pros:**

- ✅ Independent scaling per service

- ✅ Independent deployment

- ✅ Technology diversity

- ✅ Fault isolation

**Cons:**

- ❌ Network latency between services

- ❌ Distributed system complexity

- ❌ Data consistency challenges

- ❌ Operational overhead

### 9.2 Service Communication Patterns

#### Synchronous Communication

```

Service A ──► HTTP/gRPC ──► Service B ──► Response ──► Service A

```

*Simple but creates coupling and cascading failures*

#### Asynchronous Communication

```

Service A ──► Message Queue ──► Service B (processes independently)

```

*Decoupled but adds complexity and eventual consistency*

### 9.3 API Gateway

An API Gateway is the **single entry point** for all clients:

```

Mobile App ──┐

Web App ──┤

3rd Party ──┼──► API Gateway ──► Authentication

Partners ──┤ Rate Limiting

IoT Devices ──┘ Routing

Load Balancing

SSL Termination

Request Transformation

Monitoring/Analytics

│

┌────────────┼────────────┐

▼ ▼ ▼

User Service Order Service Payment Service

```

*Examples: Kong, AWS API Gateway, Netflix Zuul, Nginx*

### 9.4 Service Discovery

With many microservices, how do services **find each other**?

#### Client-Side Discovery

```

Service A ──► Service Registry (Consul/Eureka) ──► "User Service is at 10.0.0.5:8080"

Service A ──► 10.0.0.5:8080 (direct call)

```

#### Server-Side Discovery

```

Service A ──► Load Balancer ──► Service Registry ──► Routes to healthy instance

```

### 9.5 The Strangler Fig Pattern

A proven strategy to **migrate from monolith to microservices** incrementally:

```

Phase 1: All traffic to Monolith

Phase 2: Extract User Service → Route /users to microservice

Phase 3: Extract Payment Service → Route /payments to microservice

Phase 4: Extract Order Service → Route /orders to microservice

Phase N: Monolith is "strangled" — all functionality migrated

```

### 9.6 Twelve-Factor App Methodology

The **12-Factor App** is a methodology for building cloud-native, scalable microservices:

1. **Codebase** — One codebase, many deploys

2. **Dependencies** — Explicitly declare dependencies

3. **Config** — Store config in environment variables

4. **Backing Services** — Treat as attached resources

5. **Build, Release, Run** — Strictly separate stages

6. **Processes** — Execute as stateless processes

7. **Port Binding** — Export services via port binding

8. **Concurrency** — Scale out via process model

9. **Disposability** — Fast startup and graceful shutdown

10. **Dev/Prod Parity** — Keep environments similar

11. **Logs** — Treat as event streams

12. **Admin Processes** — Run as one-off processes

---

## 10. Message Queues & Event-Driven Architecture {#message-queues}

### 10.1 Why Message Queues?

Message queues provide **asynchronous communication**, **decoupling**, and **buffering** between services.

```

Without Queue:

Order Service ──► Payment Service

If Payment Service is down → Order Service fails ❌

With Queue:

Order Service ──► Queue ──► Payment Service

If Payment Service is down → messages queue up ✅

When it comes back up → processes all queued messages ✅

```

**Benefits:**

- **Decoupling** — Producer and consumer are independent

- **Load leveling** — Handle traffic spikes without overloading downstream

- **Reliability** — Messages persist even if consumer is down

- **Parallel processing** — Multiple consumers process simultaneously

- **Retry logic** — Failed messages can be retried

### 10.2 Message Queue Models

#### Point-to-Point (Queue)

```

Producer ──► [ Queue ] ──► Consumer A

(each message consumed by exactly ONE consumer)

```

*Use case: Task distribution, order processing*

#### Publish-Subscribe (Topic)

```

Publisher ──► [ Topic ] ──► Subscriber A

└──► Subscriber B

└──► Subscriber C

(each message delivered to ALL subscribers)

```

*Use case: Event notifications, logging, analytics*

### 10.3 Apache Kafka

Kafka is a **distributed event streaming platform** designed for high-throughput:

```

Architecture:

Producers ──► Topics (partitioned) ──► Consumer Groups

Topics: log_events, user_signups, order_placed

Partitions: distribute across brokers for parallelism

Consumer Groups: multiple consumers share the load

Partition 0 ──► Consumer 1

Partition 1 ──► Consumer 2

Partition 2 ──► Consumer 3

```

**Key Kafka Concepts:**

| Concept | Description |

|---|---|

| **Topic** | A category/feed of messages |

| **Partition** | Ordered, immutable sequence of records |

| **Offset** | Position of a message in a partition |

| **Consumer Group** | Group of consumers sharing partitions |

| **Broker** | A Kafka server |

| **Retention** | How long messages are kept (default: 7 days) |

**Kafka vs Traditional Message Queues:**

| Feature | Kafka | RabbitMQ/SQS |

|---|---|---|

| **Message retention** | Persists (days/forever) | Deleted after consumption |

| **Throughput** | Millions/sec | Thousands/sec |

| **Ordering** | Per-partition | Per-queue (FIFO) |

| **Replay** | Yes — re-read old messages | No |

| **Use Case** | Event streaming, log aggregation | Task queues, RPC |

### 10.4 Event-Driven Architecture Patterns

#### Event Sourcing

Instead of storing **current state**, store **all events** that led to that state:

```

Traditional: Store current balance

Account: { id: 123, balance: $450 }

Event Sourcing: Store all events

Event 1: AccountOpened { amount: $1000 }

Event 2: Withdrawal { amount: $200 }

Event 3: Deposit { amount: $150 }

Event 4: Withdrawal { amount: $500 }

─────────────────────────────────────

Replay all events → Balance = $450

```

**Benefits:**

- Complete audit trail

- Time travel (reconstruct past states)

- Natural fit for event-driven systems

**Drawbacks:**

- Querying current state requires replaying events

- Eventual consistency

- Schema evolution complexity

#### CQRS (Command Query Responsibility Segregation)

Separate the **write model (Commands)** from the **read model (Queries)**:

```

User Action

│

├──► Command Model (Write) ──► Event Store ──► Projects to

│ - Handles writes Read Model DB

│ - Validates business rules

│

└──► Query Model (Read) ──► Optimized Read DB

- Handles reads

- Denormalized for performance

```

---

## 11. API Design {#api-design}

### 11.1 REST API Design Principles

REST (Representational State Transfer) is the most widely used API design paradigm.

#### REST Constraints

1. **Client-Server** — Separation of concerns

2. **Stateless** — Each request contains all necessary information

3. **Cacheable** — Responses declare cacheability

4. **Uniform Interface** — Consistent resource identification

5. **Layered System** — Client doesn't know if it's talking to final server

6. **Code on Demand** (optional) — Server can send executable code

#### RESTful Resource Design

```

✅ Good REST API Design:

GET /users → List all users

POST /users → Create a user

GET /users/{id} → Get specific user

PUT /users/{id} → Replace specific user

PATCH /users/{id} → Partially update user

DELETE /users/{id} → Delete specific user

GET /users/{id}/orders → List user's orders

POST /users/{id}/orders → Create order for user

❌ Bad REST API Design (RPC-style):

GET /getUser

POST /createUser

POST /updateUser

GET /deleteUser ← Using GET for side effects!

POST /getUserOrders

```

#### HTTP Status Codes

```

2xx — Success

200 OK

201 Created

204 No Content

3xx — Redirection

301 Moved Permanently

302 Found (temporary redirect)

304 Not Modified (cached response still valid)

4xx — Client Errors

400 Bad Request

401 Unauthorized (not authenticated)

403 Forbidden (authenticated but not authorized)

404 Not Found

405 Method Not Allowed

409 Conflict

422 Unprocessable Entity

429 Too Many Requests

5xx — Server Errors

500 Internal Server Error

502 Bad Gateway

503 Service Unavailable

504 Gateway Timeout

```

### 11.2 GraphQL

GraphQL is a **query language for APIs** that lets clients request exactly the data they need.

```graphql

# REST Problem: Over-fetching and Under-fetching

GET /users/123 → returns ALL user fields (over-fetch)

GET /users/123, GET /users/123/posts, GET /users/123/followers → multiple requests (under-fetch)

# GraphQL Solution:

query {

user(id: "123") {

name

posts(last: 3) {

title

createdAt

}

followers {

count

}

# Returns EXACTLY what was requested in ONE request ✅

```

**GraphQL vs REST:**

| Feature | REST | GraphQL |

|---|---|---|

| **Over-fetching** | Common | Eliminated |

| **Under-fetching** | Common (N+1) | Eliminated |

| **Versioning** | URL versioning needed | Schema evolution |

| **Caching** | HTTP caching built-in | Complex (no URL per query) |

| **Learning curve** | Low | Medium |

| **Type Safety** | Depends on tooling | Built-in |

### 11.3 gRPC

gRPC is a **high-performance RPC framework** using **Protocol Buffers**:

```protobuf

// Define service in .proto file

service UserService {

rpc GetUser(GetUserRequest) returns (User);

rpc ListUsers(ListUsersRequest) returns (stream User);

rpc CreateUser(CreateUserRequest) returns (User);

rpc WatchUser(WatchUserRequest) returns (stream UserEvent);

}

message User {

int64 id = 1;

string name = 2;

string email = 3;

}

```

**gRPC vs REST:**

| Feature | REST | gRPC |

|---|---|---|

| **Protocol** | HTTP/1.1 (usually) | HTTP/2 |

| **Payload** | JSON (verbose) | Protobuf (binary, compact) |

| **Streaming** | Limited | Native (bidirectional) |

| **Contract** | OpenAPI/Swagger | .proto files |

| **Performance** | Baseline | ~5-10x faster |

| **Browser support** | Native | Requires proxy |

| **Use Case** | Public APIs | Internal microservices |

### 11.4 API Rate Limiting

Rate limiting protects APIs from abuse and ensures fair usage.

#### Rate Limiting Algorithms

**Token Bucket:**

```

Bucket capacity: 100 tokens

Refill rate: 10 tokens/second

Request arrives: consume 1 token

No tokens left: request rejected (429)

Allows burst up to bucket capacity ✅

```

**Leaky Bucket:**

```

Requests enter bucket (queue)

Bucket leaks at constant rate (e.g., 10 req/sec)

Bucket overflows: reject request

Smooths bursty traffic ✅

```

**Fixed Window Counter:**

```

Window: [10:00:00 - 10:01:00]

Limit: 100 requests/minute

Counter: 0

Each request: increment counter

Counter > 100: reject (429)

At 10:01:00: reset counter to 0

Problem: Burst at window boundary (200 req in 2 seconds)

```

**Sliding Window Log:**

```

Store timestamp of each request in log

For each new request:

Remove timestamps older than window

If len(log) >= limit: reject

Else: add timestamp, allow

Most accurate but memory intensive

```

### 11.5 API Versioning Strategies

```

1. URL Path Versioning (most common):

https://api.example.com/v1/users

https://api.example.com/v2/users

2. Query Parameter:

https://api.example.com/users?version=2

3. HTTP Header:

Accept: application/vnd.example.v2+json

4. Content Negotiation:

Accept: application/json; version=2.0

```

---

## 12. Storage Systems {#storage-systems}

### 12.1 Block Storage

Block storage divides data into **fixed-size blocks**, each with a unique address. The OS treats it like a physical disk.

```

Block Storage:

┌────┬────┬────┬────┬────┬────┬────┬────┐

│ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │ B8 │

└────┴────┴────┴────┴────┴────┴────┴────┘

File system manages which blocks belong to which files

```

*Examples: AWS EBS, Google Persistent Disk, Azure Managed Disk*

*Use Cases: Databases, virtual machines, boot volumes*

### 12.2 File Storage (Network File System)

File storage presents a **file system interface** over a network:

```

Server ──► NFS/SMB ──► Client sees: /mnt/shared/

├── file1.txt

├── folder1/

└── file2.pdf

```

*Examples: AWS EFS, Azure Files, NFS*

*Use Cases: Shared files across servers, home directories, media files*

### 12.3 Object Storage

Object storage stores data as **objects** with metadata and a unique ID. No hierarchy — flat namespace.

```

Object: {

key: "users/profile-photos/user_123.jpg"

data: <binary image data>

metadata: {

content-type: "image/jpeg",

size: 245678,

uploaded-by: "user_123",

custom: { "crop": "center" }

}

```

*Examples: AWS S3, Google Cloud Storage, Azure Blob Storage*

*Use Cases: Images, videos, backups, static website hosting, data lakes*

**Key properties:**

- Virtually unlimited scalability

- No hierarchical file system

- Accessed via HTTP (REST API)

- Highly durable (AWS S3: 99.999999999% — 11 nines!)

### 12.4 RAID (Redundant Array of Independent Disks)

|---|---|---|---|---|

### 12.5 Data Replication and Disaster Recovery

#### Recovery Point Objective (RPO)

*How much data can we afford to lose?*

```

RPO = 1 hour → backup every hour → lose at most 1 hour of data

RPO = 0 → synchronous replication → no data loss acceptable

```

#### Recovery Time Objective (RTO)

*How long can we afford to be down?*

```

RTO = 4 hours → system can be offline for 4 hours during disaster

RTO = 0 → requires hot standby, automatic failover

```

#### Disaster Recovery Tiers

```

Tier 0: No recovery (backups only) RTO: Days Cost: $

Tier 1: Cold standby RTO: Hours Cost: $$

Tier 2: Warm standby RTO: Minutes Cost: $$$

Tier 3: Hot standby (active-passive) RTO: Seconds Cost: $$$$

Tier 4: Active-active RTO: ~0 Cost: $$$$$

```

---

## 13. Security in System Design {#security}

### 13.1 Authentication vs Authorization

|---|---|---|---|

### 13.2 Authentication Mechanisms

#### JWT (JSON Web Token)

```

Header.Payload.Signature

Header: {"alg": "HS256", "typ": "JWT"}

Payload: {

"sub": "user_123",

"email": "alice@example.com",

"role": "admin",

"iat": 1704067200,

"exp": 1704153600

}

Signature: HMACSHA256(base64(header) + "." + base64(payload), secret)

```

**Pros:** Stateless, self-contained, works across services

**Cons:** Cannot be revoked (until expiry), larger than session tokens

#### OAuth 2.0 / OpenID Connect

```

OAuth 2.0 Authorization Code Flow:

User ──► "Login with Google"

App ──► Google Authorization Server

Google: "Allow App to access your profile?"

User: "Allow"

Google ──► App (authorization code)

App ──► Google (code + client_secret)

Google ──► App (access_token + refresh_token)

App uses access_token to call Google APIs

```

### 13.3 Common Security Vulnerabilities (OWASP Top 10)

1. **Injection** (SQL, NoSQL, OS command injection)

```sql

-- Vulnerable:

SELECT * FROM users WHERE id = '" + userId + "'";

-- Safe (parameterized query):

SELECT * FROM users WHERE id = ?

```

2. **Broken Authentication** — Weak passwords, session management flaws

3. **Sensitive Data Exposure** — Unencrypted PII, weak hashing

4. **XML External Entities (XXE)** — Malicious XML parsing

5. **Broken Access Control** — IDOR vulnerabilities

```

GET /api/user/123/profile ← your profile

GET /api/user/456/profile ← another user's profile (should be forbidden!)

```

6. **Security Misconfiguration** — Default passwords, open ports

7. **XSS (Cross-Site Scripting)**

```html

```

8. **Insecure Deserialization**

9. **Using Components with Known Vulnerabilities**

10. **Insufficient Logging & Monitoring**

### 13.4 Encryption

#### Encryption at Rest

Data encrypted when stored on disk:

```

AES-256 encryption for databases, file systems, backups

AWS S3: Server-Side Encryption (SSE-S3, SSE-KMS, SSE-C)

```

#### Encryption in Transit

Data encrypted when transmitted:

```

HTTPS/TLS for web traffic

TLS for internal service communication

mTLS (mutual TLS) for microservices authentication

```

#### Hashing Passwords

```

❌ Never store plaintext passwords

❌ Never use MD5 or SHA-1 for passwords (too fast = brute-forceable)

✅ Use: bcrypt, scrypt, argon2 (slow by design, adaptive cost)

bcrypt example:

$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewdBPj2gUexhXWym

│ │

│ cost factor (12 = 2^12 = 4096 iterations)

algorithm version

```

### 13.5 DDoS Protection Strategies

```

Layer 3/4 DDoS (volumetric):

→ Anycast network diffusion (CDN absorbs traffic)

→ Rate limiting at network edge

→ IP reputation filtering

→ ISP-level scrubbing

Layer 7 DDoS (application):

→ Bot detection (CAPTCHA, behavioral analysis)

→ Rate limiting per IP/user

→ WAF (Web Application Firewall)

→ Challenge suspicious traffic

```

---

## 14. Monitoring & Observability {#monitoring}

### 14.1 The Three Pillars of Observability

> **Observability** is the ability to understand what's happening inside your system from its external outputs.

#### 1. Metrics

*Numerical measurements over time*

```

System Metrics:

CPU usage: 67%

Memory: 4.2GB / 8GB

Network I/O: 245 MB/s

Application Metrics:

Requests per second: 12,450

Error rate: 0.02%

P99 latency: 187ms

Business Metrics:

Orders placed: 1,240/minute

Revenue: $45,230/hour

Active users: 234,567

```

*Tools: Prometheus, Datadog, CloudWatch, Grafana*

#### 2. Logs

*Immutable, timestamped records of discrete events*

```json

{

"timestamp": "2024-01-15T10:30:45.123Z",

"level": "ERROR",

"service": "payment-service",

"trace_id": "abc123def456",

"user_id": "user_789",

"message": "Payment processing failed",

"error": "Card declined",

"metadata": {

"amount": 99.99,

"currency": "USD",

"attempt": 2

}

```

*Tools: ELK Stack (Elasticsearch + Logstash + Kibana), Splunk, Loki*

#### 3. Traces

*Record of a request's journey through distributed services*

```

Request ID: abc123

│

├─ [0ms] API Gateway (2ms)

├─ [2ms] User Service (5ms)

│ └─ [3ms] DB Query (4ms)

├─ [7ms] Product Service (12ms)

│ ├─ [8ms] Cache Hit (1ms)

│ └─ [9ms] DB Query (10ms)

├─ [19ms] Order Service (8ms)

│ └─ [20ms] Payment Service (45ms)

│ └─ [21ms] Stripe API (42ms)

└─ [64ms] Total response time

```

*Tools: Jaeger, Zipkin, AWS X-Ray, Datadog APM*

### 14.2 The RED Method and USE Method

#### RED Method (for services)

- **R**ate — Requests per second

- **E**rror rate — Fraction of requests that are errors

- **D**uration — Time to handle each request (latency distribution)

#### USE Method (for resources)

- **U**tilization — % time resource is busy

- **S**aturation — Amount of work resource can't process (queue length)

- **E**rrors — Error events count

### 14.3 SLI, SLO, and SLA

```

SLI (Service Level Indicator):

The actual measured metric

"Our 99th percentile latency is 187ms"

SLO (Service Level Objective):

Your internal target

"99th percentile latency < 200ms for 99.9% of requests"

SLA (Service Level Agreement):

Legal/contractual commitment to customers

"We guarantee 99.9% uptime; you get credit if we fall below"

SLO > SLA (SLO should be stricter than SLA — gives buffer before SLA breach)

```

### 14.4 Error Budgets

Error budget = *How much unreliability you're allowed before violating your SLO*

```

SLO: 99.9% availability

Error budget: 0.1% = 43.8 minutes/month

If you've used 30 minutes of downtime this month:

Remaining budget: 13.8 minutes

→ Slow down risky deployments

→ Focus on reliability improvements

If you have budget remaining:

→ Ship new features confidently!

```

### 14.5 Alerting Best Practices

```

Avoid alert fatigue:

✅ Alert on SLO violations (symptoms), not causes

✅ Set appropriate thresholds (not too sensitive)

✅ Alert should be actionable

✅ Different severity levels (page vs ticket vs info)

Bad alert: CPU > 80% for 1 minute (not actionable, too frequent)

Good alert: Error rate > 1% for 5 minutes (SLO at risk!)

Alerting hierarchy:

P0 (Critical) → Page on-call immediately, 24/7

P1 (High) → Page on-call during business hours

P2 (Medium) → Create ticket for next sprint

P3 (Low) → Log for weekly review

```

---

## 15. Real-World System Design Case Studies {#case-studies}

### 15.1 Case Study: Design Twitter/X

#### Requirements

- 500M daily active users

- 150M tweets/day

- Read-heavy (read:write = 100:1)

- Timeline generation (home feed)

#### High-Level Architecture

```

CDN (Static content)

│

Client ──► API Gateway ──► Auth Service

│

┌───────────┼───────────┐

▼ ▼ ▼

Tweet User Timeline

Service Service Service

│ │ │

▼ ▼ ▼

Tweets DB Users DB Cache (Redis)

│ ▲

└──► Fanout Service ──────┘

│

Message Queue (Kafka)

```

#### Timeline Generation: Push vs Pull

**Push (Fanout on Write):**

```

Alice posts tweet

→ Kafka event

→ Fanout service reads Alice's 10,000 followers

→ Writes tweet to each follower's timeline cache

Read: O(1) — just read from cache

Write: O(n) — n = number of followers

Problem: Celebrity with 100M followers = 100M writes per tweet!

```

**Pull (Fanout on Read):**

```

Alice visits home page

→ Fetch list of people she follows

→ Query each person's recent tweets

→ Merge, sort, return

Read: O(n) where n = number of accounts followed

Write: O(1)

Problem: High read latency, many DB queries

```

**Twitter's Hybrid Solution:**

```

Regular users: Push (pre-computed timelines)

Celebrities (>X followers): Pull (queried at read time)

Combine both approaches in timeline service

```

### 15.2 Case Study: Design a URL Shortener (bit.ly)

#### Requirements

- 100M URLs shortened per day

- 10B URL redirects per day (100:1 read:write)

- Short URL must be ~7 characters

- URLs must not expire (or expire after N years)

#### Capacity Estimation

```

Write: 100M / 86400 = ~1,160 URLs/second

Read: 10B / 86400 = ~115,740 redirects/second

Storage (5 years):

100M URLs/day × 365 × 5 = 182.5B URLs

Each URL record: ~500 bytes

Total: 182.5B × 500B = ~91 TB

Short URL key space:

7 characters, base62 (a-z, A-Z, 0-9):

62^7 = 3.5 trillion unique URLs ✅

```

#### URL Encoding Strategies

**MD5 hashing:**

```

MD5("https://example.com/very/long/url") = "1a79a4d60de6718e8e5b326e338ae533"

Take first 7 chars: "1a79a4d" → bit.ly/1a79a4d

Problem: Collisions possible

```

**Base62 encoding of auto-increment ID:**

```

DB auto-increment ID: 100000

Convert to base62: "q0" (much shorter)

ID 100000000 → base62 → "FXoSg" (6 chars)

```

**Distributed ID generation (Twitter Snowflake):**

```

64-bit ID structure:

1 bit (unused) | 41 bits (timestamp ms) | 10 bits (machine ID) | 12 bits (sequence)

Generates ~4M unique IDs per second per machine ✅

Sortable by time ✅

No coordination needed ✅

```

#### Architecture

```

Client ──► Load Balancer ──► URL Shortener Service

│

┌─────────┴─────────┐

▼ ▼

Write Path Read Path

│ │

MySQL DB Redis Cache (TTL)

(source of truth) │

MySQL DB (cache miss)

```

### 15.3 Case Study: Design Netflix

#### Key Challenges

- 200M+ subscribers globally

- Billions of hours streamed per month

- Video files are massive (1 hour HD = ~2 GB)

- Users on wildly different network conditions

#### Video Ingestion Pipeline

```

Raw Video File

│

▼

┌─────────────────────────────────────────────────────┐

│ Transcoding Service │

│ │

│ Input: "Inception.mp4" (20 GB, 4K RAW) │

│ │

│ Output: Multiple formats & resolutions │

│ ├── inception_240p.mp4 (low bandwidth) │

│ ├── inception_480p.mp4 │

│ ├── inception_720p.mp4 │

│ ├── inception_1080p.mp4 │

│ ├── inception_4k.mp4 (high bandwidth) │

│ ├── inception_hdr.mp4 │

│ └── inception_audio_en.aac (+ 20 languages) │

│ │

│ Also: DRM encryption, content validation │

└─────────────────────────────────────────────────────┘

│

▼

AWS S3 (origin storage)

│

▼

CDN (Akamai, Netflix Open Connect)

```

#### Adaptive Bitrate Streaming (ABR)

```

Player monitors bandwidth continuously:

Bandwidth: 20 Mbps → Stream 4K

Bandwidth drops to 8 Mbps → Switch to 1080p seamlessly

Bandwidth drops to 2 Mbps → Switch to 720p

Bandwidth = 0.5 Mbps → Switch to 480p

HLS: Video split into 2-10 second chunks

Player fetches chunks + adjusts quality per chunk

```

#### Netflix's Chaos Engineering

> Netflix pioneered **Chaos Engineering** with **Chaos Monkey** — a tool that ***randomly terminates production instances*** to ensure systems are resilient.

---

## 📚 Summary: Key Principles Cheat Sheet

### The System Design Hierarchy of Needs

```

┌─────────────────┐

│ BUSINESS │ ← Features, Cost

│ GOALS │

┌──┴─────────────────┴──┐

│ RELIABILITY │ ← Works correctly

┌──┴───────────────────────┴──┐

│ SCALABILITY │ ← Handles growth

┌──┴─────────────────────────────┴──┐

│ PERFORMANCE │ ← Fast enough

┌──┴───────────────────────────────────┴──┐

│ SECURITY │ ← Not exploitable

┌──┴─────────────────────────────────────────┴──┐

│ MAINTAINABILITY │ ← Evolvable

└────────────────────────────────────────────────┘

```

### Quick Reference: Technology Choices

| Need | Technology |

|---|---|

| **Relational data** | PostgreSQL, MySQL |

| **Document store** | MongoDB |

| **Cache** | Redis, Memcached |

| **Wide-column** | Cassandra, HBase |

| **Search** | Elasticsearch |

| **Graph** | Neo4j |

| **Time-series** | InfluxDB, TimescaleDB |

| **Message queue** | Kafka, RabbitMQ, SQS |

| **Object storage** | S3, GCS |

| **Load balancer** | Nginx, HAProxy, AWS ALB |

| **Service discovery** | Consul, Eureka |

| **Container orchestration** | Kubernetes |

| **Monitoring** | Prometheus + Grafana |

| **Distributed tracing** | Jaeger, Zipkin |

| **CDN** | Cloudflare, CloudFront |

### The Golden Rules of System Design

> 1. 📐 **Start simple** — Don't over-engineer. Add complexity only when needed.

> 2. 📊 **Estimate before designing** — Know your scale before choosing solutions.

> 3. 🔁 **Embrace trade-offs** — Every design decision is a trade-off. Know what you're trading.

> 4. 🧱 **Design for failure** — Assume everything will fail. Build for resilience.

> 5. 📏 **Scale horizontally** — Design stateless services that scale out, not up.

> 6. 🏎️ **Cache aggressively** — Cache at every layer, but handle invalidation carefully.

> 7. 🔍 **Measure everything** — You can't improve what you don't measure.

> 8. 🚀 **Iterate** — No design survives contact with production unchanged.

---

*This guide covers the core concepts of system design. The field is constantly evolving — stay curious, read engineering blogs from companies like Netflix, Uber, Airbnb, and Cloudflare, and always tie theoretical knowledge to real implementations.*

---

**Further Reading:**

- *Designing Data-Intensive Applications* — Martin Kleppmann

- *The System Design Interview* — Alex Xu

- *Site Reliability Engineering* — Google (free online)

- *Building Microservices* — Sam Newman

- Engineering blogs: Netflix Tech Blog, Uber Engineering, Cloudflare Blog, AWS Architecture Blog