Scene — Overview

// A visual guide to

Distributed
Systems

Modern software doesn't run on one machine. It runs on thousands — connected, collaborating, and constantly failing. This page teaches you how distributed systems work through live 3D graphics. Scroll to explore each concept.

Scroll to explore

CAP Theorem

A distributed system can simultaneously guarantee only two of three properties: Consistency, Availability, and Partition Tolerance. The 3D triangle above shows all three nodes — the red fault line represents a network partition.

C — Consistency

Every read receives the most recent write, or an error. All nodes see identical data at any point in time. Example: a bank balance must be exact.

A — Availability

Every request receives a response (not necessarily the latest). The system never refuses a read or write. Example: a DNS lookup always returns something.

P — Partition Tolerance

The system continues operating even when network messages are lost between nodes. All real distributed systems must tolerate partitions — so you choose C or A.

Trade-offSystem TypeReal Examples
CPConsistent + Partition TolerantHBase, Zookeeper, Etcd, MongoDB (strong)
APAvailable + Partition TolerantCassandra, DynamoDB, CouchDB, Riak
CAConsistent + Available (no partition)Traditional RDBMS on single node

Consensus & Raft

How do distributed nodes agree on a single value even when some of them fail? The Raft algorithm solves this with leader election and append-only logs. The golden node above is the current leader; cyan nodes are followers casting votes.

Phase 1

Leader Election

A node becomes a candidate when it hasn't heard from a leader in a timeout window. It requests votes from peers. A majority win makes it the Leader. Others become Followers.

Phase 2

Log Replication

The Leader receives all client writes. It appends them to its log and sends AppendEntries RPCs to followers. Once a majority acknowledge the entry, it is committed.

Phase 3

Safety & Terms

Each election starts a new term. A leader with a lower term is immediately rejected, preventing split-brain. Leaders must have the most up-to-date log to win election.

Leader heartbeat → [F1] [F2] [F3] [F4] ↑ ↑ Vote granted Vote granted (majority = commit)
Strong Consistency Majority Quorum Used by: etcd · Kafka · CockroachDB

Replication Strategies

Replication copies data across multiple nodes for fault tolerance and read scalability. The cyan node above is the Primary; purple nodes are Replicas. Watch data flow from primary to replicas — the red dot on one replica indicates replication lag.

Synchronous Replication

The primary waits for all replicas to acknowledge before confirming the write. Zero data loss, but higher latency. Used in: PostgreSQL synchronous standbys, Aurora.

Asynchronous Replication

The primary confirms the write immediately and replicates in the background. Lower latency but replicas may lag — if the primary crashes, recent writes can be lost.

Semi-Synchronous

Wait for at least one replica to acknowledge. Balances durability and performance. Reduces the blast radius of primary failure to a single replica lag window.

TypeRPO (data loss)Write LatencyFailover
Synchronous0 msHighClean
AsynchronousSeconds to minutesLowPossible data loss
Semi-syncBounded lagMediumNear-zero loss

Network Partitions

A network partition occurs when nodes can no longer communicate with each other — even though the nodes themselves are alive. The red fault line above splits two sub-clusters. Packets (red dots) bounce back rather than crossing the partition.

Split-Brain

Each sub-cluster believes it is the primary and accepts writes — leading to conflicting data when the network heals. Detecting and resolving this is one of distributed systems' hardest problems.

Fencing & STONITH

"Shoot The Other Node In The Head" — ensure only one partition can write by physically fencing the other (power-cycling, removing network access). Prevents split-brain at hardware level.

Conflict Resolution

When partitions heal, conflicting writes must be merged. Strategies: last-write-wins, vector clocks, CRDTs (Conflict-Free Replicated Data Types), or application-level merge logic.

Before partition: [A] ←──→ [B] (both agree on x=1) During partition: [A] | | [B] A receives: x=2 B receives: x=3 After heal: ??? (conflict — who wins?)
Split-Brain Risk Vector Clocks CRDTs Last-Write-Wins

Observability

You can't fix what you can't see. Observability is the ability to understand the internal state of a distributed system from its external outputs. The three pillars — Logs, Metrics, and Traces — flow into the golden collector node above.

Logs

Timestamped, structured records of discrete events. Answer: "What happened?" Tool: Loki, Elasticsearch. Emit in JSON with trace-ID for correlation.

Metrics

Numeric measurements over time (counters, gauges, histograms). Answer: "How fast / how many?" Tool: Prometheus, Mimir. Drive dashboards and alerts.

Traces

Context-propagated, causally-linked spans across service boundaries. Answer: "Why was this slow?" Tool: Jaeger, Tempo, OpenTelemetry. Track a request end-to-end.

SignalCardinalityStorage CostLatency Insight
LogsHighHighVerbose, exact context
MetricsLowLowTrends & rate
TracesMediumMediumRequest-level bottlenecks
OpenTelemetry (OTEL) Prometheus Jaeger / Tempo Grafana

Service Mesh

A service mesh is a dedicated infrastructure layer for service-to-service communication. Each service (blue orb) has a sidecar proxy (gold orb) injected alongside it. All network traffic flows through these proxies — enabling mTLS, retries, circuit breaking, and traffic shaping without any changes to application code.

mTLS — Zero Trust

Mutual TLS ensures both sides authenticate each connection. Even inside a private network, every service proves its identity. Certificates rotate automatically.

Circuit Breaker

When a downstream service fails repeatedly, the proxy opens a circuit and stops sending requests — preventing cascade failures. Retries with exponential back-off are applied first.

Traffic Control

Split traffic by weight (canary deploys), by header (A/B testing), or by fault injection (chaos testing). All configured at the mesh level, not in service code. Tools: Istio, Linkerd, Cilium.

Service A → [Envoy Proxy] ══mTLS══ [Envoy Proxy] → Service B ↓ ↓ Metrics/Traces Metrics/Traces ↓ ↓ [Control Plane — Istiod / xDS API]
Istio Linkerd Cilium Envoy Proxy mTLS

Consistent Hashing

Traditional sharding uses key mod N — any server change remaps almost every key. Consistent hashing places both keys and servers on a circular ring so only 1/N keys move when topology changes. Above: the white dot is a key traversing the ring clockwise, landing on the first server it encounters. Colored arcs show each server's ownership range.

The Ring

Keys and servers both map to positions on a [0, 2π) circle via a hash function. A key is owned by the first server encountered clockwise from its position — no central coordinator needed.

Virtual Nodes

Each physical server owns multiple virtual tokens scattered across the ring. This evens load distribution and means a failure redistributes tokens across many neighbors — not just one — preventing hotspots.

Minimal Disruption

A joining server claims tokens only from its clockwise neighbor. A departing server passes its range to its successor. Only K/N keys move — no full reshuffle, no downtime.

Ring: [ 0 ─── Server A ─── Server B ─── Server C ─── Server D ─── 2π ] hash("user:123") = 0.41 → owned by Server B (next clockwise) hash("tx:456") = 0.78 → owned by Server C Server B leaves: range [A..B] transfers to Server C Only ~1/N keys remapped — no full rehash needed.
Cassandra DynamoDB Redis Cluster Chord DHT Riak

Load Balancing

A load balancer distributes inbound traffic across a pool of backends so no single server becomes a bottleneck. Above: the purple Load Balancer receives requests from clients (blue) and fans out to backends (cyan). Health checks probe each backend — failing nodes are automatically removed from rotation without manual intervention.

Algorithms

Round Robin — rotate through backends sequentially.
Least Connections — route to server with fewest open connections.
Weighted — assign more traffic to higher-capacity servers.
IP Hash — same client always hits the same backend (sticky sessions).

L4 vs L7

L4 (Transport): routes on IP + port, no HTTP awareness — ultra-fast. Example: AWS NLB, HAProxy TCP.
L7 (Application): routes on HTTP headers, paths, cookies — enables canary deploys, A/B testing, URL-based routing. Example: AWS ALB, nginx, Envoy.

Health & Failover

LBs probe backends periodically (HTTP 200 on /health). N consecutive failures pulls a backend from the pool. Passive checks also monitor real-traffic error rates — catching slow-to-respond backends before they cascade.

AlgorithmBest ForLimitation
Round RobinUniform stateless requestsIgnores server capacity
Least ConnectionsWebSockets, long-lived connectionsRequires LB state
IP HashSession-dependent appsUneven on small pools
RandomTruly stateless microservicesNo load awareness
AWS ALB / NLB HAProxy nginx Envoy Traefik

Message Queues & Event Streaming

Message queues decouple producers from consumers — a producer doesn't need to know who reads its messages, or if the consumer is even online. Above: producers (blue) push events into the Kafka broker (gold layered discs), which fans them out to consumer groups (cyan). Messages are persisted on disk and can be replayed from any historical offset.

Delivery Guarantees

At most once — fire & forget. May lose messages, no duplicates.
At least once — retry until acked. May duplicate, never loses.
Exactly once — idempotent producers + transactional consumers. Available in Kafka 0.11+. Highest overhead.

Consumer Groups

A consumer group distributes partitions across its members — each partition owned by exactly one group member at a time. Multiple independent groups can read the same topic simultaneously at different offsets, enabling fan-out to analytics, audit, and downstream services.

Queue vs Stream

Queue (RabbitMQ, SQS): push-based, messages deleted on consumption. Good for task distribution.
Stream (Kafka, Kinesis): pull-based, messages retained for days. Consumers track their own offset — supports replay, event sourcing, and time-travel debugging.

Topic: "orders" (3 partitions, replication-factor=3) +-------------------------------------------+ | Partition 0: [msg0] [msg3] [msg6] ... | <- Consumer Group A: Worker 1 | Partition 1: [msg1] [msg4] [msg7] ... | <- Consumer Group A: Worker 2 | Partition 2: [msg2] [msg5] [msg8] ... | <- Consumer Group A: Worker 3 +-------------------------------------------+ Consumer Group B: reads all partitions (analytics pipeline, own offset) Consumer Group C: replays from offset 0 (new service catching up)
Apache Kafka RabbitMQ AWS SQS / SNS Google Pub/Sub NATS