System Design & Software Architecture: Building Scalable Systems

System design is the art of building scalable, reliable, and maintainable systems. This guide covers fundamental principles and advanced patterns used by tech giants to handle millions of users.

Fundamental Concepts#

Scalability Types#

Vertical Scaling (Scale Up)
├── Add more CPU, RAM, Storage to existing server
├── Simpler to implement
├── Hardware limits exist
└── Single point of failure

Horizontal Scaling (Scale Out)
├── Add more servers to the pool
├── Theoretically unlimited scaling
├── Requires distributed system design
└── More complex but more resilient

CAP Theorem#

In a distributed system, you can only guarantee two of three properties:

Consistency - All nodes see the same data at the same time
Availability - Every request receives a response
Partition Tolerance - System continues despite network failures

In practice, partition tolerance is non-negotiable in distributed systems. The real choice is between consistency and availability during network partitions.

CP Systems (Consistency + Partition Tolerance)
├── MongoDB (with majority write concern)
├── HBase
├── Redis Cluster
└── Use when: Financial transactions, inventory management

AP Systems (Availability + Partition Tolerance)
├── Cassandra
├── DynamoDB
├── CouchDB
└── Use when: Social media feeds, analytics, caching

Load Balancing Strategies#

Layer 4 vs Layer 7 Load Balancing#

Layer 4 (Transport Layer)
├── Routes based on IP and TCP/UDP port
├── Faster, less CPU intensive
├── No content inspection
└── Use for: TCP/UDP traffic, gaming, streaming

Layer 7 (Application Layer)
├── Routes based on HTTP headers, URL, cookies
├── Content-aware routing
├── SSL termination
└── Use for: Web applications, API routing

Load Balancing Algorithms#

# Round Robin - Simple rotation
servers = ['server1', 'server2', 'server3']
current = 0
 
def round_robin():
    global current
    server = servers[current]
    current = (current + 1) % len(servers)
    return server
 
# Weighted Round Robin - Based on server capacity
servers = [
    {'host': 'server1', 'weight': 5},  # 50% traffic
    {'host': 'server2', 'weight': 3},  # 30% traffic
    {'host': 'server3', 'weight': 2},  # 20% traffic
]
 
# Least Connections - Route to server with fewest active connections
def least_connections(servers):
    return min(servers, key=lambda s: s.active_connections)
 
# IP Hash - Consistent routing for same client
def ip_hash(client_ip, servers):
    hash_value = hash(client_ip)
    return servers[hash_value % len(servers)]
 
# Consistent Hashing - Minimizes redistribution when servers change
class ConsistentHash:
    def __init__(self, nodes, virtual_nodes=150):
        self.ring = {}
        self.sorted_keys = []
        
        for node in nodes:
            for i in range(virtual_nodes):
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
                self.sorted_keys.append(key)
        
        self.sorted_keys.sort()
    
    def get_node(self, key):
        hash_key = self._hash(key)
        for ring_key in self.sorted_keys:
            if hash_key <= ring_key:
                return self.ring[ring_key]
        return self.ring[self.sorted_keys[0]]

Database Scaling Patterns#

Read Replicas#

                    ┌─────────────┐
                    │   Primary   │
                    │  (Writes)   │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
    │  Replica 1  │ │  Replica 2  │ │  Replica 3  │
    │   (Reads)   │ │   (Reads)   │ │   (Reads)   │
    └─────────────┘ └─────────────┘ └─────────────┘

class DatabaseRouter:
    def __init__(self, primary, replicas):
        self.primary = primary
        self.replicas = replicas
        self.replica_index = 0
    
    def get_connection(self, operation):
        if operation in ('INSERT', 'UPDATE', 'DELETE'):
            return self.primary
        else:
            # Round-robin across replicas
            replica = self.replicas[self.replica_index]
            self.replica_index = (self.replica_index + 1) % len(self.replicas)
            return replica

Database Sharding#

Horizontal Sharding (by rows)
├── Range-based: users 1-1M → Shard1, 1M-2M → Shard2
├── Hash-based: hash(user_id) % num_shards
├── Directory-based: Lookup service maps keys to shards
└── Geographic: Users by region

Vertical Sharding (by columns/tables)
├── User data → Database A
├── Order data → Database B
└── Analytics → Database C

class ShardRouter:
    def __init__(self, shards):
        self.shards = shards
        self.num_shards = len(shards)
    
    def get_shard(self, user_id):
        # Consistent hashing for even distribution
        shard_index = hash(str(user_id)) % self.num_shards
        return self.shards[shard_index]
    
    def get_shard_for_range_query(self, start_id, end_id):
        # Range queries may need to hit multiple shards
        shards_needed = set()
        for user_id in range(start_id, end_id + 1):
            shards_needed.add(self.get_shard(user_id))
        return list(shards_needed)

Caching Architecture#

Multi-Level Caching#

Request Flow:
                                                    
Client → CDN → Load Balancer → App Server → Cache → Database
         ↓           ↓              ↓          ↓
      Static     Session       Application  Query
      Assets     Affinity       Cache       Cache

Cache Levels:
├── L1: Browser Cache (client-side)
├── L2: CDN Cache (edge locations)
├── L3: Application Cache (Redis/Memcached)
├── L4: Database Query Cache
└── L5: Database Buffer Pool

Cache Invalidation Strategies#

class CacheManager:
    def __init__(self, cache, db):
        self.cache = cache
        self.db = db
    
    # Cache-Aside (Lazy Loading)
    def get_user(self, user_id):
        key = f"user:{user_id}"
        
        # Try cache first
        user = self.cache.get(key)
        if user:
            return user
        
        # Cache miss - load from DB
        user = self.db.get_user(user_id)
        if user:
            self.cache.set(key, user, ttl=3600)
        
        return user
    
    # Write-Through
    def update_user_write_through(self, user_id, data):
        # Update DB first
        user = self.db.update_user(user_id, data)
        
        # Then update cache
        self.cache.set(f"user:{user_id}", user, ttl=3600)
        
        return user
    
    # Write-Behind (Async)
    def update_user_write_behind(self, user_id, data):
        key = f"user:{user_id}"
        
        # Update cache immediately
        self.cache.set(key, data, ttl=3600)
        
        # Queue DB write for async processing
        self.queue.push({
            'operation': 'update_user',
            'user_id': user_id,
            'data': data
        })
        
        return data
    
    # Cache Invalidation
    def invalidate_user(self, user_id):
        # Delete specific key
        self.cache.delete(f"user:{user_id}")
        
        # Invalidate related caches
        self.cache.delete(f"user:{user_id}:orders")
        self.cache.delete(f"user:{user_id}:preferences")
        
        # Tag-based invalidation
        self.cache.delete_by_tag(f"user:{user_id}")

Microservices Architecture#

Service Communication Patterns#

Synchronous Communication
├── REST APIs
│   └── Simple, stateless, HTTP-based
├── gRPC
│   └── High performance, binary protocol, streaming
└── GraphQL
    └── Flexible queries, single endpoint

Asynchronous Communication
├── Message Queues (RabbitMQ, SQS)
│   └── Point-to-point, guaranteed delivery
├── Event Streaming (Kafka)
│   └── Pub/sub, event sourcing, replay capability
└── Event Bus
    └── Loose coupling, broadcast events

Service Mesh Architecture#

┌─────────────────────────────────────────────────────────┐
│                    Control Plane                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │
│  │   Config    │  │  Discovery  │  │    Certs    │     │
│  │   Server    │  │   Service   │  │   Manager   │     │
│  └─────────────┘  └─────────────┘  └─────────────┘     │
└─────────────────────────────────────────────────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
         ▼                 ▼                 ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│   Service A     │ │   Service B     │ │   Service C     │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │    App      │ │ │ │    App      │ │ │ │    App      │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
│ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │
│ │   Sidecar   │ │ │ │   Sidecar   │ │ │ │   Sidecar   │ │
│ │   Proxy     │◄┼─┼─►   Proxy     │◄┼─┼─►   Proxy     │ │
│ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘

Circuit Breaker Pattern#

import time
from enum import Enum
 
class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered
 
class CircuitBreaker:
    def __init__(
        self,
        failure_threshold=5,
        recovery_timeout=30,
        expected_exception=Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.expected_exception = expected_exception
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.success_count = 0
    
    def call(self, func, *args, **kwargs):
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitBreakerOpenException()
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except self.expected_exception as e:
            self._on_failure()
            raise e
    
    def _on_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= 3:  # Require 3 successes to close
                self.state = CircuitState.CLOSED
                self.failure_count = 0
                self.success_count = 0
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
    
    def _should_attempt_reset(self):
        return (
            time.time() - self.last_failure_time >= self.recovery_timeout
        )
 
# Usage
user_service_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=30
)
 
def get_user(user_id):
    return user_service_breaker.call(
        external_user_service.get,
        user_id
    )

Event-Driven Architecture#

Event Sourcing#

from dataclasses import dataclass
from datetime import datetime
from typing import List
import json
 
@dataclass
class Event:
    event_type: str
    aggregate_id: str
    data: dict
    timestamp: datetime
    version: int
 
class EventStore:
    def __init__(self):
        self.events = []
    
    def append(self, event: Event):
        self.events.append(event)
    
    def get_events(self, aggregate_id: str) -> List[Event]:
        return [e for e in self.events if e.aggregate_id == aggregate_id]
 
class Order:
    def __init__(self, order_id: str):
        self.order_id = order_id
        self.status = None
        self.items = []
        self.total = 0
        self.version = 0
    
    def apply(self, event: Event):
        if event.event_type == "OrderCreated":
            self.status = "created"
            self.items = event.data["items"]
            self.total = event.data["total"]
        elif event.event_type == "OrderPaid":
            self.status = "paid"
        elif event.event_type == "OrderShipped":
            self.status = "shipped"
            self.tracking_number = event.data["tracking_number"]
        elif event.event_type == "OrderCancelled":
            self.status = "cancelled"
        
        self.version = event.version
    
    @classmethod
    def rebuild(cls, events: List[Event]) -> "Order":
        if not events:
            return None
        
        order = cls(events[0].aggregate_id)
        for event in events:
            order.apply(event)
        
        return order
 
# Usage
event_store = EventStore()
 
# Create order
event_store.append(Event(
    event_type="OrderCreated",
    aggregate_id="order-123",
    data={"items": [{"sku": "ABC", "qty": 2}], "total": 99.99},
    timestamp=datetime.now(),
    version=1
))
 
# Rebuild order state from events
events = event_store.get_events("order-123")
order = Order.rebuild(events)

CQRS (Command Query Responsibility Segregation)#

                    ┌─────────────────┐
                    │     Client      │
                    └────────┬────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
       ┌─────────────┐              ┌─────────────┐
       │  Commands   │              │   Queries   │
       │   (Write)   │              │   (Read)    │
       └──────┬──────┘              └──────┬──────┘
              │                             │
              ▼                             ▼
       ┌─────────────┐              ┌─────────────┐
       │   Command   │              │    Query    │
       │   Handler   │              │   Handler   │
       └──────┬──────┘              └──────┬──────┘
              │                             │
              ▼                             ▼
       ┌─────────────┐              ┌─────────────┐
       │   Write     │   Events    │    Read     │
       │   Model     │────────────►│    Model    │
       │ (Normalized)│              │(Denormalized│
       └─────────────┘              └─────────────┘

High Availability Patterns#

Active-Passive Failover#

Normal Operation:
┌─────────────┐     ┌─────────────┐
│   Active    │────►│   Passive   │
│   Server    │     │   Server    │
│  (Primary)  │     │  (Standby)  │
└─────────────┘     └─────────────┘
       │
       ▼
   [Traffic]

After Failover:
┌─────────────┐     ┌─────────────┐
│   Failed    │     │   Active    │
│   Server    │     │   Server    │
│  (Down)     │     │  (Promoted) │
└─────────────┘     └─────────────┘
                           │
                           ▼
                       [Traffic]

Active-Active (Multi-Master)#

┌─────────────┐     ┌─────────────┐
│   Active    │◄───►│   Active    │
│   Server 1  │     │   Server 2  │
└──────┬──────┘     └──────┬──────┘
       │                   │
       └─────────┬─────────┘
                 │
          ┌──────┴──────┐
          │Load Balancer│
          └──────┬──────┘
                 │
             [Traffic]

Conclusion#

System design is about making informed trade-offs based on requirements. There's no one-size-fits-all solution—the best architecture depends on your specific scale, consistency requirements, team expertise, and business constraints.

Key takeaways:

Understand CAP theorem and its implications
Start simple, scale when needed
Use caching strategically at multiple levels
Design for failure with circuit breakers and retries
Consider event-driven architecture for loose coupling
Monitor everything and plan for observability

System Design Primer