System Design Research Hub — 18 Case Studies

🔍

URL Shortener

Hashing, consistent hashing, key-value storage at scale

Meta Serverless (11.5M calls/sec)

Serverless architectures, FaaS, cold-start optimization

Google Docs

OT, CRDTs, collaborative editing, real-time sync

Apache Kafka

Distributed log, pub-sub, stream processing

Stock Exchange

Order matching engines, ultra-low latency, financial systems

Uber Driver Matching

Geospatial indexing, S2 cells, 1M req/sec

Vote aggregation, hot rankings, feed generation at scale

Tinder

Geo-based matching, swipe mechanics, recommendation systems

Spotify

CDN for audio, recommendation ML, Backstage architecture

Erlang/OTP, end-to-end encryption, 50B messages/day

Bluesky

AT Protocol, federated social, decentralized identity

Slack

WebSockets, channel fan-out, search at scale

Stripe Idempotent API

Exactly-once payments, idempotency keys, distributed transactions

ChatGPT

Transformer inference, token streaming, GPU orchestration

Twitter Timeline

Fan-out on write vs read, celebrity problem, caching

Amazon S3

Object storage, erasure coding, 11 nines durability

YouTube & MySQL (2.49B users)

Vitess, sharding, video transcoding pipeline

Scale to 10M Users on AWS

Auto-scaling, CDN, load balancing, database tiers

★

Distributed DB Systems

GFS, Dynamo, Spanner, Bigtable, Raft, CRDT, consensus

How URL Shortener Works

Hashing Key-Value Store Consistent Hashing Redirection

▾

System Overview

A URL shortener converts long URLs into short slugs (6–8 characters), stores the mapping in a fast key-value store, and redirects HTTP 301/302 on lookup. Core challenges include generating unique short codes at high write throughput, distributing the lookup globally with low latency (sub-10ms), and handling 100:1 read/write ratios at scale.

Encoding

Base62 (a-z, A-Z, 0-9)

Short Code Length

7 chars = 3.5 trillion URLs

Redirect Type

301 (permanent) vs 302

Storage

Cassandra / DynamoDB / Redis

ID Generation

Snowflake / ZooKeeper range

Write Rate

~100M new URLs/day (bit.ly)

Key Design Decisions

301 vs 302 Redirect: 301 (permanent) is cached by browsers, reducing load on servers but breaking analytics. 302 (temporary) routes every request through your servers, enabling click tracking but increasing server load.

Approach 1 — Counter-based: Use a global distributed counter (ZooKeeper, Redis INCR), convert to Base62. Simple but requires coordination.
Approach 2 — Hash-based: MD5/SHA-256 the long URL, take first 7 chars. Risk of collision requires birthday problem mitigation.
Approach 3 — Snowflake ID: 64-bit ID from datacenter + machine + timestamp + sequence. Decentralized, collision-free, sortable.

Research Papers & Key Resources

Consistent Hashing and Random Trees

Karger et al. · MIT · STOC 1997

Foundation for how to distribute URL keys across nodes with minimal remapping when nodes join/leave. Directly applicable to sharding URL stores. Introduced the concept of virtual nodes on a ring.

→ ACM DL

Chord: A Scalable Peer-to-peer Lookup Service

Stoica et al. · MIT · SIGCOMM 2001

Distributed hash table used to map URL keys to storage nodes. O(log N) lookups. Heavily cited in distributed key-value design for URL shorteners.

→ Free PDF

Dynamo: Amazon's Highly Available Key-Value Store

DeCandia et al. · Amazon · SOSP 2007

The blueprint for key-value stores used in URL shortening backends. Covers consistent hashing, vector clocks, and eventual consistency — exactly what large-scale URL systems use.

→ Free PDF

Snowflake: A Distributed Unique ID Generator

Twitter Engineering Blog · 2010

The canonical approach to generating unique, roughly sortable 64-bit IDs without coordination, used in many URL shortener ID generation strategies.

→ Twitter Blog

How Meta Handles 11.5 Million Serverless Function Calls per Second

FaaS Serverless Cold Start Edge Computing

▾

System Overview

Meta's serverless platform (used for content ranking, ads, and feed computation) runs thousands of isolated function workers, each responding to event triggers. At 11.5M calls/second they need microsecond-level cold starts, automatic scaling, fault isolation, and efficient bin-packing of compute resources. Meta's internal FaaS differs from AWS Lambda in that it uses persistent warm workers and aggressive pre-warming based on traffic prediction.

Function Calls/sec

11.5 Million

Runtime

Hack (PHP-derived), Python

Isolation

Container + cgroups

Scheduling

Tupperware (internal)

Key Insight: The Cold Start Problem

Cold start latency (50–500ms) is the #1 enemy of serverless at scale. Meta solves it via pre-warming (keep N idle workers hot based on traffic forecast), snapshot restore (fork from a frozen process image), and co-location with data dependencies to minimize network round trips.

Research Papers & Key Resources

Serverless Computing: One Step Forward, Two Steps Back

Hellerstein et al. · UC Berkeley · CIDR 2019

Critical analysis of serverless limitations: poor for stateful apps, high cold starts, communication overhead. Sets the research agenda that Meta's platform attempts to solve.

→ Free PDF

SOCK: Rapid Task Provisioning with Serverless-Optimized Containers

Oakes et al. · UW-Madison · USENIX ATC 2018

Introduces optimized container startup for serverless functions. Achieves 18x lower cold-start latency using lean containers. Directly relevant to Meta's approach.

→ Free PDF

Firecracker: Lightweight Virtualization for Serverless Applications

Agache et al. · Amazon · NSDI 2020

AWS Lambda's microVM approach. 125ms boot time, 5MB overhead per VM. Foundation for modern FaaS isolation. Comparable to Meta's container strategy.

→ Free PDF

Faasm: Lightweight Isolation for Efficient Stateful Serverless Computing

Shillaker & Pietzuch · Imperial College · USENIX ATC 2020

WebAssembly-based isolation for serverless with shared memory regions. Pushes towards stateful serverless, a key concern at Meta's scale.

→ Free PDF

Twine: A Unified Cluster Management System for Shared Infrastructure

Tang et al. · Meta · OSDI 2020

Meta's actual cluster management system that underlies their serverless infrastructure. Describes how they schedule millions of tasks across fleets.

→ USENIX

How Google Docs Works

OT / CRDT Real-time Collaboration WebSocket Conflict Resolution

▾

System Overview

Google Docs uses Operational Transformation (OT) to synchronize edits from multiple users in real time. Every character insertion/deletion is represented as an operation that gets transformed against concurrent operations before being applied. The server is the single source of truth and serializes all operations. Google's internal system is called "Wave" OT (open-sourced).

Algorithm

Operational Transformation (OT)

Connection

WebSocket + Long Polling

Storage

Bigtable + Colossus (GFS2)

Alternative

CRDTs (used by Figma, Notion)

OT vs CRDTs

OT requires a central server to serialize operations — simpler reasoning but a bottleneck. CRDTs (Conflict-free Replicated Data Types) are commutative by design and need no coordinator, making them suitable for offline-first and P2P collaboration. Both aim to satisfy convergence (all replicas eventually reach same state) and intention preservation.

Research Papers & Key Resources

Operational Transformation in Real-Time Group Editors

Ellis & Gibbs · MCC · CSCW 1989

The original OT paper. Foundational to Google Docs architecture. Introduces the GROVE (Group Outline Viewing and Editing) system. All collaborative editor systems reference this work.

Google Wave Operational Transformation

Wang et al. · Google · 2010

Google's whitepaper on the OT algorithm used in Wave and subsequently Google Docs. Covers transformation functions for rich text documents.

→ Free PDF

CRDTs: Consistency Without Concurrency Control

Shapiro et al. · INRIA · 2011

Comprehensive introduction to Conflict-free Replicated Data Types — the alternative to OT used in Figma, Notion, and offline-first apps. Covers G-Counter, PN-Counter, LWW-Element-Set.

→ Free PDF

LSEQ: an Adaptive Structure for Sequences in Distributed Collaborative Editing

Nédelec et al. · DocEng 2013

Solves the "tombstone" problem in sequence CRDTs. Highly cited in collaborative editing literature. Used in modern CRDT-based editors.

Real Differences Between OT and CRDT for Co-Editors

Sun & Sun · ACM CSCW 2020

Empirical comparison of OT vs CRDT approaches in production collaborative editors. Breaks down performance, consistency, and implementation complexity trade-offs.

How Apache Kafka Works

Distributed Log Pub-Sub Stream Processing Exactly-Once Semantics

▾

System Overview

Kafka is a distributed commit log. Topics are divided into partitions, each an ordered, immutable sequence of records stored on disk. Producers append to partition tails; consumers read at their own pace using offsets. The key insight: Kafka's performance comes from sequential disk I/O (fast on all storage types), page cache reliance (avoids double buffering), and zero-copy transfer via sendfile() system call.

Original Publisher

LinkedIn (2011)

Throughput

Millions msgs/sec per broker

Retention

Time-based (default 7 days)

Coordination

ZooKeeper → KRaft (2.8+)

Delivery

At-least, At-most, Exactly-once

Replication

ISR (In-Sync Replicas)

The Log: Foundation of Modern Data Systems

Jay Kreps (Kafka co-creator) argues in "The Log: What every software engineer should know about real-time data's unifying abstraction" that the append-only log is the simplest yet most powerful abstraction in distributed systems — it unifies databases, stream processors, and messaging queues.

Research Papers & Key Resources

Kafka: A Distributed Messaging System for Log Processing

Kreps, Narkhede, Rao · LinkedIn · NetDB 2011

The original Kafka paper. Describes the pull-based consumer model, sequential disk access, zero-copy data transfer, and the design decisions that gave Kafka orders-of-magnitude better throughput than ActiveMQ and RabbitMQ. Essential reading.

→ Free PDF

Samza: Stateful Scalable Stream Processing at LinkedIn

Noghabi et al. · LinkedIn · VLDB 2017

LinkedIn's stream processing system built on Kafka. Covers stateful operators, fault tolerance via checkpointing, and how Kafka's log enables replay-based recovery.

→ Free PDF

Exactly-Once Semantics Are Possible: Here's How Kafka Does It

Narkhede · Confluent Engineering Blog · 2017

Explains the idempotent producer + transactional API added in Kafka 0.11. Two-phase commit within Kafka's log. Critical for payment and financial pipelines.

→ Blog

The Log: What Every Software Engineer Should Know

Kreps · LinkedIn Engineering · 2013

Seminal blog post (read as a paper) explaining logs as a unifying abstraction. Explains how databases, distributed systems, and Kafka all reduce to append-only logs.

→ Blog

Apache Flink: Stream and Batch Processing in a Single Engine

Carbone et al. · TU Berlin · IEEE Data Eng. Bulletin 2015

Flink is the most common stream processor on top of Kafka. This paper covers stateful stream operators, event time processing, and exactly-once checkpointing.

→ Free PDF

How Stock Exchange Works

Order Matching Low Latency LMAX Disruptor FIX Protocol

▾

System Overview

A stock exchange is one of the most latency-sensitive systems in the world. NYSE processes ~20M trades/day with sub-microsecond matching. The order book (bid/ask) is a priority queue maintained in memory. The matching engine is single-threaded to avoid locks. Modern exchanges use FPGAs and kernel-bypass networking (DPDK, RDMA) to achieve nanosecond latencies.

Matching Engine

Single-threaded, in-memory

Order Book DS

Price-time priority (heap/tree)

Latency Target

<10µs (NASDAQ), <1µs (HFT)

Network

Kernel-bypass (DPDK, RDMA)

Protocol

FIX, ITCH, OUCH

Persistence

LMAX Disruptor ring buffer

Research Papers & Key Resources

LMAX Disruptor: High Performance Alternative to Bounded Queues

Thompson et al. · LMAX Exchange · 2011

The architecture behind LMAX Exchange — 6 million orders/sec on a single thread. Uses a pre-allocated ring buffer with memory barriers instead of locks. Revolutionary for financial systems engineering.

→ White Paper

Mechanical Sympathy: The Art of Low-Latency Architecture

Martin Thompson · GOTO Conference 2012

Discusses CPU cache lines, false sharing, and memory access patterns. Essential for understanding why exchange matching engines are designed the way they are.

Design Patterns for Low Latency Applications with Intel® DPDK

Intel · 2014

Kernel-bypass networking for sub-microsecond packet processing. Used by exchanges to eliminate OS overhead from the network stack.

→ Intel

High Frequency Trading: Market Microstructure and Economics

Aldridge · John Wiley & Sons · 2013 (Book)

Covers the economic motivation behind HFT and the technical architecture needed. Essential context for why exchanges push to nanosecond latency.

How Uber Finds Nearby Drivers at 1 Million Requests/Second

Geospatial S2 Geometry Real-time Location Dispatch System

▾

System Overview

Uber's dispatch system must find the nearest available driver within milliseconds. The naive approach (iterate all drivers) fails at scale. Uber uses Google's S2 library which maps Earth's surface to 1D using a Hilbert space-filling curve (preserves spatial locality). Each driver location is stored in a key-value store indexed by S2 cell ID. Nearby search = range scan on adjacent cell IDs.

Spatial Index

Google S2 (Hilbert Curve)

Location Update

Every 4 seconds per driver

Storage

Redis (in-memory, geo commands)

Peak Load

1M requests/second

Architecture

DISCO / Ringpop (old); Cadence

Schemaless (MySQL → Cassandra)

Research Papers & Key Resources

Space-Filling Curves: An Introduction With Applications in Scientific Computing

Bader · Springer · 2012 (Book)

Mathematical foundation for Hilbert curves used in S2. Understanding this is key to grasping why Uber's spatial index is efficient — locality-preserving mapping collapses 2D to 1D.

Uber's Schemaless: A Scalable MySQL-Based Key-Value Store

Uber Engineering Blog · 2016

Describes how Uber stored driver/trip data at scale using MySQL with a denormalized key-value layer. Evolved into the Docstore system (Cassandra-backed).

→ Uber Blog

Ringpop: Scalable Fault-Tolerant Application-Layer Sharding

Uber Engineering · 2015

Uber's consistent-hash ring for distributing driver location ownership across their dispatch fleet. Enables Uber to shard "who owns which S2 cell" across thousands of servers.

→ Uber Blog

H3: Uber's Hexagonal Hierarchical Spatial Index

Brodsky · Uber Open Source · 2018

Uber's open-source replacement for S2. Uses hexagonal grids (better for distance approximations) at 16 resolution levels. Now used for surge pricing, demand forecasting, and driver matching.

→ Uber Blog

Reliable, Scalable Dispatching Architecture: Uber's Dispatch System

Uber Engineering Blog · 2015

Covers the full dispatch pipeline from rider request to driver assignment. Discusses the Supply Service, Demand Service, and Dispatch Service architecture with Ringpop sharding.

→ Uber Blog

How Reddit Works

Vote Aggregation Feed Generation Memcached Hot Ranking

▾

System Overview

Reddit's hotness algorithm combines score (upvotes − downvotes) with time decay. The ranking is recomputed periodically and cached aggressively in Memcached. Reddit uses PostgreSQL for persistent storage, Cassandra for vote aggregation, and Kafka for event streaming. The main scaling challenge is "vote bombing" (millions of votes on trending posts) and the frontpage cache invalidation problem.

Hot Algorithm

Wilson score + time decay

Primary DB

PostgreSQL (sharded)

Vote Store

Cassandra (eventual consistency)

Cache

Memcached (Mcrouter)

Elasticsearch

CDN

Fastly

Research Papers & Key Resources

Reddit's Architecture: Scaling a Community Platform

Reddit Engineering Blog · Various Posts 2012–2020

Reddit's engineering blog covers their PostgreSQL sharding strategy, move to AWS, Cassandra adoption for votes, and Kafka for event streaming. Practical case study in iterative scaling.

→ Reddit Eng

How Not to Sort by Average Rating (Wilson Score)

Miller · 2009

Explains the statistical ranking algorithm used by Reddit's hot algorithm. The Wilson score confidence interval is the correct way to rank items by positive/negative ratings at varying sample sizes.

→ Article

Scaling Memcache at Facebook

Nishtala et al. · Facebook · NSDI 2013

Facebook's battle-tested Memcached deployment (analogous to Reddit's Mcrouter). Covers lease-based cache invalidation, regional replication, and thundering herd prevention — all directly applicable to Reddit's caching layer.

→ Free PDF

Cassandra: A Decentralized Structured Storage System

Lakshman & Malik · Facebook · SOSP 2009

The Cassandra paper — used by Reddit for vote storage. Wide-column, tunable consistency. Combines Dynamo's distribution model with BigTable's data model.

→ Free PDF

How Tinder Works

Geo Matching Recommendation ELO Score Swipe Engine

▾

System Overview

Tinder's core challenge: for a given user, return a stack of nearby profiles ordered by desirability, in real time, from a pool of 50M+ active users. Tinder uses a geo-sharded user store (Geosharded MongoDB initially, then Cassandra), an ELO-like attractiveness score for ranking, and a Recs service that pre-computes recommendation stacks. When two users both right-swipe, they match.

Geo Index

S2 Cells / Geohash

Ranking

ELO score (chess-derived)

Storage

Cassandra + Redis

Recs

Pre-computed, cached in Redis

Images

S3 + CloudFront CDN

Streaming

Kafka for match events

Research Papers & Key Resources

Tinder's Architecture: Petabytes of Data, Billions of Swipes

Tinder Engineering Blog · 2020

Direct technical blog post from Tinder engineers covering their petabyte-scale architecture, Cassandra adoption, and how they moved from monolith to microservices.

→ Tinder Blog

Collaborative Filtering for Implicit Feedback Datasets

Hu, Koren, Volinsky · ICDM 2008

Foundation for recommendation systems. Tinder's "Smart Photos" and profile ordering use collaborative filtering variants. Understanding implicit feedback (swipes) vs explicit ratings is critical.

Real-Time Machine Learning for Recommendations at Tinder

Tinder Engineering · 2020

Describes how Tinder moved from ELO to ML-based scoring. Uses online learning with user swipe feedback to improve recommendation stacks in real time.

→ Blog

ELO Rating System in Competitive Environments

Elo, A.E. · Original Chess System · 1960s

Tinder publicly acknowledged using an ELO-like system (called "desirability score") before switching to ML. Understanding the original ELO math is essential context.

How Spotify Works

Audio CDN Recommendation ML Backstage Microservices

▾

System Overview

Spotify serves 100M+ songs to 600M+ users. Audio files are stored on GCS (Google Cloud Storage) and served via CDN with adaptive bitrate streaming. Spotify's recommendation engine (Discover Weekly, Daily Mix) uses a hybrid of collaborative filtering (matrix factorization on listen history) and natural language processing (audio analysis + playlist text). The backend runs on GCP with 800+ microservices.

Audio Storage

Google Cloud Storage

Streaming

OGG Vorbis (bitrate adaptive)

Reco Engine

Matrix Factorization + NLP

Event Stream

Kafka + BigQuery

Service Catalog

Backstage (open-sourced 2020)

ML Platform

Kubeflow + Vertex AI

Research Papers & Key Resources

Deep Content-Based Music Recommendation

van den Oord et al. · Spotify/KU Leuven · NeurIPS 2013

Spotify's approach to cold-start problem: use deep CNNs on raw audio spectrograms to predict latent factors, enabling recommendations for new songs with no listen history.

→ Free PDF

Word2Vec Applied to Recommendations: Spotify's Metadata Embeddings

Spotify Research · 2016

Spotify treats songs as "words" and playlists as "sentences," training Word2Vec to learn song embeddings. Similar songs have similar vectors. Powers Related Artists and Radio.

→ Blog

Podcast Recommendations at Spotify

Spotify Research · 2020

Extends recommendation to audio content with transcripts. Combines BERT-based text understanding with collaborative filtering signals. Describes the multi-modal recommendation architecture.

Backstage: Empowering Product Teams with a Service Catalog

Spotify Engineering · 2020 (Open Source)

Spotify's internal developer portal, open-sourced and now a CNCF project. Solved the "800+ microservices, no way to navigate them" problem via a software catalog and TechDocs.

→ Blog

How WhatsApp Works

Erlang/OTP E2E Encryption XMPP 50B msgs/day

▾

System Overview

WhatsApp serves 50 billion messages/day with just ~50 engineers at acquisition (2014). The secret: Erlang's actor model gives each user a lightweight process (~2KB), enabling millions of concurrent connections per server. The Signal Protocol provides end-to-end encryption. Messages are persisted in Mnesia (Erlang's built-in distributed DB) and deleted after delivery. Media is stored in S3, not the message store.

Backend Language

Erlang (BEAM VM)

Protocol

Modified XMPP over WebSocket

Encryption

Signal Protocol (E2E)

Message DB

Mnesia (Erlang native)

Media

Amazon S3 (deleted post-download)

FreeBSD (for high file handles)

Research Papers & Key Resources

The Signal Protocol: A Modern Cryptographic Standard

Marlinspike & Perrin · Open Whisper Systems · 2016

Technical specification for the Signal Protocol used in WhatsApp, Signal, and Facebook Messenger. Covers the Double Ratchet algorithm (combining X3DH key agreement with message ratcheting) for perfect forward secrecy.

→ Specification

Scaling Erlang for Messaging: WhatsApp's Approach

Ferd Hébért · Erlang Factory · 2014

Explains how Erlang's actor model (gen_server processes per user) allows WhatsApp to handle 2.5M+ concurrent connections per node. Covers OTP supervision trees and hot code reloading.

→ Slides

End-to-End Encryption in Messaging: Technical and Policy Dimensions

Green et al. · 2020

Comprehensive survey of E2E encryption in messaging apps. Covers WhatsApp's multi-device challenges, key management, and the policy debates around backdoors.

Engineering WhatsApp for Two Billion Users

WhatsApp/Facebook Engineering · 2020

Blog post detailing the multi-device architecture introduced in 2021, where the same account can be active on multiple devices simultaneously, requiring a new key distribution model.

→ FB Engineering

How Bluesky Works

AT Protocol Federated Social DID (Decentralized ID) PDS / Relay / AppView

▾

System Overview

Bluesky is built on the AT Protocol (Authenticated Transfer Protocol) — a federated social protocol where users own their data. Each user has a Personal Data Server (PDS) holding their signed repository (a Merkle tree of all posts/follows). A Relay ("Firehose") crawls all PDSes and emits a global event stream. AppViews (e.g., the Bluesky app) subscribe to the relay and build indexes for search/feeds. Users can migrate between PDSes without losing followers.

Identity

DID (Decentralized Identifier)

Data Model

Merkle DAG (signed repos)

Protocol

AT Protocol (atproto.com)

Components

PDS → Relay → AppView

Algorithms

Custom feed generators (open)

Inspiration

ActivityPub, SSB, Fediverse

Research Papers & Key Resources

The AT Protocol (atproto) Specification

Bluesky Team · 2023

The full technical specification for the Authenticated Transfer Protocol. Covers lexicons (schema language), repo structure (Merkle trees), DIDs, handle resolution, and the PDS/Relay/AppView architecture.

→ atproto.com

Decentralized Social Networks

Raman et al. · IMC 2019

Measurement study of Mastodon/Fediverse. Covers federation dynamics, content moderation challenges, and cross-instance data distribution — directly applicable to Bluesky's design decisions.

→ ACM DL

Decentralized Identifiers (DIDs) Specification

W3C · 2022

The W3C standard for DIDs used in Bluesky. A DID is a self-sovereign identifier cryptographically bound to a key pair. Enables portability without a central authority (similar to domain ownership).

→ W3C Spec

Merkle DAGs and Content-Addressed Storage

Benet · IPFS Whitepaper · 2014

IPFS whitepaper covering Merkle DAGs — the data structure Bluesky uses for signed user repositories. Content-addressing ensures data authenticity and enables verifiable history.

→ PDF

How Slack Works

WebSocket Channel Fan-out Message Search Presence

▾

System Overview

Slack maintains persistent WebSocket connections to every active client. When a message is sent to a channel with 1000 members, Slack must fan out that message to all 1000 connections simultaneously. The "presence" system (online/typing indicators) generates 10x more traffic than messages themselves. Slack uses Vitess (MySQL sharding) for message storage and Elasticsearch for search across billions of messages.

Connection

WebSocket (persistent)

Message Store

MySQL + Vitess

Elasticsearch

Cache

Memcached

Fan-out

Channel membership lookup → push

Job Queue

Redis

Research Papers & Key Resources

Slack's Architecture: Moving to a Cellular Design

Slack Engineering · 2020

Describes Slack's "grid" architecture transition where each workspace's message store is isolated in its own cell (MySQL shard), enabling independent scaling and fault isolation per workspace.

→ Slack Eng

Vitess: Scaling MySQL for the Web

Sougou · YouTube/PlanetScale · 2011

Vitess provides MySQL sharding with connection pooling and query routing. Used by Slack and YouTube to scale MySQL horizontally. Covers resharding, vertical splits, and cross-shard queries.

→ Vitess Docs

Flannel: An Application-Level Edge Cache for Slack

Slack Engineering · 2019

Slack built an edge caching layer (Flannel) using Memcached to cache per-workspace session membership data, dramatically reducing database load during message fan-out to large channels.

→ Blog

Presence at Scale: Tracking Millions of Users

Slack Engineering · 2017

The presence system (green dot = online) generates enormous traffic. Slack's solution: gossip-based presence aggregation with hysteresis (delay offline status to avoid flicker).

How Stripe Prevents Double Payment Using Idempotent API

Idempotency Keys Exactly-Once Distributed Transactions Payment Systems

▾

System Overview

In payment systems, network retries can cause duplicate charges. Stripe's solution: clients include an idempotency key (a UUID) with each request. Stripe stores the key + response in a database. If a retry comes in with the same key, Stripe returns the cached response without re-executing. The key must be scoped to a user account, stored before the charge, and garbage collected after 24 hours.

Idempotency Key

UUID, 24hr TTL

Storage

Database (before charge)

Retry Strategy

Exponential backoff + jitter

Consistency

Serializable isolation (Postgres)

Core DB

PostgreSQL + Vitess

Events

Kafka for async webhooks

The Danger of Retries Without Idempotency

In a distributed system, "did the request succeed?" is unknowable if you lose the connection after sending but before receiving the response. Without idempotency keys, every retry risks a double charge. The fix: generate the idempotency key client-side, store it server-side before acting, and return the same result for any retry with that key.

Research Papers & Key Resources

Idempotency

Stripe Engineering Blog · Brandur Leach · 2017

The definitive blog post on idempotency in payments. Covers idempotency keys, database-level atomicity, API design patterns, and how to handle partial failures in multi-step payment workflows.

→ Stripe Blog

Exactly-Once Delivery and Transactional Messaging at Scale

Kafka Engineering · 2017

Explains the producer idempotency + transactions API in Kafka 0.11, enabling exactly-once message delivery — directly used in Stripe's event pipeline for webhook delivery.

→ Blog

Designing Data-Intensive Applications (Chapter 9: Consistency and Consensus)

Kleppmann · O'Reilly · 2017 (Book)

The industry's most-read book on distributed systems. Chapter 9 covers linearizability, distributed transactions, and exactly-once semantics — the theoretical foundation for idempotent APIs.

→ O'Reilly

Patterns for Distributed Transactions within a Microservices Architecture

Richardson · 2019

Covers the Saga pattern (choreography and orchestration) for multi-step financial workflows like a payment: reserve funds → charge card → update ledger → issue receipt. Each step must be idempotent.

→ Article

How ChatGPT Works

Transformer Inference Token Streaming RLHF GPU Clusters

▾

System Overview

ChatGPT is a GPT-4-class autoregressive Transformer that generates responses token by token. Inference is computationally dominated by matrix multiplications on GPU tensors. To serve millions of requests, OpenAI runs massive GPU clusters (A100/H100) with tensor parallelism (split model layers across GPUs) and KV-cache (cache K/V attention matrices to avoid recomputation per token). Responses are streamed via Server-Sent Events (SSE).

Architecture

GPT (Transformer Decoder)

Training

RLHF (PPO + SFT)

Parallelism

Tensor + Pipeline + Data

Hardware

NVIDIA A100/H100 clusters

Optimization

KV Cache, FlashAttention

Streaming

Server-Sent Events (SSE)

Research Papers & Key Resources

Attention Is All You Need

Vaswani et al. · Google Brain · NeurIPS 2017

The Transformer paper — the architectural foundation of GPT, BERT, and all modern LLMs. Introduces multi-head self-attention, positional encoding, and the encoder-decoder architecture. Arguably the most important ML paper of the decade.

→ arXiv

Language Models are Few-Shot Learners (GPT-3)

Brown et al. · OpenAI · NeurIPS 2020

GPT-3 paper demonstrating in-context learning at scale. Directly precedes ChatGPT's architecture. Shows that scale alone (175B params) produces remarkable few-shot capabilities without fine-tuning.

→ arXiv

Training Language Models to Follow Instructions with Human Feedback (InstructGPT)

Ouyang et al. · OpenAI · NeurIPS 2022

The RLHF paper behind ChatGPT. Shows how SFT + reward model + PPO fine-tuning aligns LLMs with human preferences. A 1.3B InstructGPT model outperforms 175B GPT-3 on human evaluations.

→ arXiv

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Dao et al. · Stanford · NeurIPS 2022

Reduces the memory complexity of attention from O(n²) to O(n) by tiling GPU SRAM operations. Enables 2-4x faster training and inference for long-context models. Used by every major LLM serving system.

→ arXiv

Efficient Large Scale Language Modeling with Mixtures of Experts

Artetxe et al. · Meta AI · EMNLP 2021

MoE architecture (suspected in GPT-4) where only a subset of parameters activate per token. Achieves dense model quality at a fraction of the compute cost. Key to serving trillion-parameter models economically.

→ arXiv

Orca: A Distributed Serving System for Transformer-Based Generative Models

Yu et al. · USENIX OSDI 2022

Introduces continuous batching for LLM inference — serving multiple requests simultaneously, inserting new requests mid-batch. Used by vLLM and TGI. Dramatically improves GPU utilization.

→ USENIX

How Twitter Timeline Works

Fan-out on Write Celebrity Problem Redis Timeline Cache Hybrid Approach

▾

System Overview

Twitter's timeline is a merge of tweets from all accounts you follow, sorted by recency. Twitter uses a hybrid fan-out: when a normal user tweets, their tweet is immediately pushed (fan-out on write) to all followers' Redis timeline caches. But for celebrities (Obama, Beyoncé — millions of followers), fan-out would take minutes. So Twitter uses fan-out on read for celebrities, merging their tweets at query time.

Timeline Cache

Redis (sorted set per user)

Fan-out (normal)

Write-time push to followers

Fan-out (celebrity)

Read-time merge

Tweet Store

Manhattan (Twitter's own KV)

Media

S3 + Fastly CDN

Earlybird (Lucene-based)

Research Papers & Key Resources

Rainbird: Realtime Analytics at Twitter

Twitter Engineering · 2011

Twitter's real-time counters system (like/retweet counts). Uses Cassandra with counter columns. Foundational to understanding Twitter's data infrastructure that serves the timeline system.

→ Twitter Blog

The Architecture Twitter Uses to Deal with 150M+ Active Users

High Scalability · 2013

Comprehensive breakdown of Twitter's 2013 architecture: Flock (social graph), Gizzard (sharding), Snowflake (ID generation), Manhattan (NoSQL), and the timeline fanout service.

→ Article

Earlybird: Real-Time Search at Twitter

Busch et al. · Twitter · ICDE 2012

Twitter's real-time search engine based on Lucene. Tweets are indexed in seconds. Uses an inverted index with 100% in-memory segments for recency, and disk for older tweets.

Manhattan: Twitter's Real-Time, Multi-Tenant Distributed Database

Twitter Engineering · 2014

Twitter's custom distributed NoSQL database replacing Cassandra for their most critical workloads. Covers their approach to multi-tenancy, strong consistency, and predictable latency.

→ Blog

How Amazon S3 Works

Object Storage Erasure Coding 11-Nine Durability Bitrot Detection

▾

System Overview

S3 stores trillions of objects with 99.999999999% (11 nines) durability by distributing erasure-coded chunks across multiple Availability Zones. An object is split into k data chunks + m parity chunks using Reed-Solomon coding; any m failures can be tolerated. S3's metadata (bucket/key → object location) is stored in a separate distributed database. S3 uses "scrubbing" to detect and repair bitrot proactively.

Durability

99.999999999% (11 nines)

Coding

Reed-Solomon erasure coding

Replication

≥3 Availability Zones

Consistency

Strong read-after-write (2020)

Storage Classes

Standard, IA, Glacier, etc.

Objects Stored

~280 Trillion (2023 estimate)

Research Papers & Key Resources

Amazon S3 — How It Works (AWS Documentation)

Amazon Web Services · 2023

Official documentation covering bucket policies, storage classes, lifecycle rules, and replication. The most authoritative source on S3's operational behavior and architecture decisions.

→ AWS Docs

The Google File System

Ghemawat, Gobioff, Leung · Google · SOSP 2003

GFS inspired S3's chunked storage model. First paper to propose treating component failures as the norm, not the exception. Pioneered chunk replication and a master for metadata. Essential reading for object storage.

→ Free PDF

Erasure Coding in Windows Azure Storage

Huang et al. · Microsoft · USENIX ATC 2012

Azure's erasure coding approach (similar to S3). Reduces storage overhead by 1.33x vs 3x replication while maintaining durability. Describes the Local Reconstruction Code (LRC) design.

→ Free PDF

Building and Operating a Large Distributed Storage System

Amazon re:Invent Talks · 2022

S3 Principal Engineer Marc Brooker's talks on S3's shuffle sharding, availability design, and how they achieve 11 nines durability. Includes the "shuffle sharding" technique to limit blast radius.

→ Werner's Blog

How YouTube Supports 2.49 Billion Users with MySQL

Vitess / MySQL Video Transcoding CDN Bigtable

▾

System Overview

YouTube is one of the world's largest MySQL deployments, made possible by Vitess — a sharding middleware that makes MySQL horizontally scalable. Uploaded videos go through a transcoding pipeline (500 formats/resolutions per video) using Google's Zencoder-like service on GCP. Video chunks are served from Google's CDN. YouTube uses Bigtable for view counts and recommendation features, and Spanner for globally consistent data.

Primary DB

MySQL + Vitess (sharding)

Video Storage

Google Colossus (GFS2)

Transcoding

500+ format/resolution combos

Streaming

DASH / HLS (adaptive bitrate)

Counts

Bigtable (approx. at scale)

Reco

Deep Neural Nets (DNN-based)

Research Papers & Key Resources

Vitess: Scalable MySQL for the Web

Sougou · YouTube · VLDB 2012

Original paper on Vitess by its YouTube creator. Describes connection pooling, query routing to shards, row-based replication management, and how it makes MySQL horizontally scalable without application changes.

→ Vitess

Deep Neural Networks for YouTube Recommendations

Covington, Adams, Sargin · Google · RecSys 2016

YouTube's 2-stage recommendation system: candidate generation (recall ~100 videos from millions) + ranking (precision scoring). Uses DNN with user watch history embeddings. One of the most-cited recommendation papers.

→ Google Research

Bigtable: A Distributed Storage System for Structured Data

Chang et al. · Google · TOCS 2008

Bigtable powers YouTube's view counter, metadata, and ML feature store. Wide-column store with row-key range scans. Handles millions of reads/writes per second. Inspired Cassandra, HBase.

→ Free PDF

YouTube View Counts: Accuracy vs Speed Trade-off

YouTube Engineering · 2012

Why YouTube freezes view counts at 300 and batch-updates them. At scale, approximate counting (HyperLogLog) and deferred writes are necessary to avoid hot-row contention on viral videos.

→ YouTube Blog

MPEG-DASH Standard: Dynamic Adaptive Streaming over HTTP

ISO/IEC 23009-1 · 2012

The international standard for adaptive bitrate video streaming used by YouTube. Client switches between quality levels based on available bandwidth, eliminating buffering.

How to Scale an App to 10 Million Users on AWS

Auto Scaling CDN / CloudFront Database Tiers Multi-Region

▾

Scaling Progression

1 user: Single EC2 + single RDS.
1K users: Add ELB, move to RDS Multi-AZ, add ElastiCache (Redis).
100K users: Add CloudFront CDN, read replicas on RDS, separate static asset servers, Route 53 health checks.
1M users: RDS Aurora (auto-scaling), S3 for media, SQS/SNS for async, EC2 Auto Scaling Groups, Lambda for edge compute.
10M users: Multi-region Active-Active, Global Accelerator, DynamoDB Global Tables, database sharding or migration to Aurora Global, chaos engineering.

Research Papers & Key Resources

On Designing and Deploying Internet-Scale Services

Hamilton · Microsoft · USENIX LISA 2007

James Hamilton's seminal paper on what it takes to build internet-scale services. Covers operability, health monitoring, graceful degradation, automatic recovery, and avoiding single points of failure. Required reading.

→ Free PDF

The AWS Well-Architected Framework

Amazon Web Services · 2023

AWS's official framework covering Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. The practical blueprint for scaling applications on AWS.

→ AWS

Chaos Engineering: Building Confidence in System Behavior through Experiments

Basiri et al. · Netflix · IEEE Software 2016

Netflix Chaos Monkey and the chaos engineering discipline. At 10M+ users, you must assume failures are happening at all times. This paper establishes the principles of proactive failure injection.

→ IEEE

Harvest, Yield, and Scalable Tolerant Systems

Fox & Brewer · HOTOS 1999

Introduces the harvest/yield framing: degrade harvest (return partial data) before yield (availability) drops. Real-world application of CAP theorem for 10M-user systems.

→ Free PDF

Showing topics related to messaging, streaming, and event-driven architecture.

Showing topics related to storage, databases, and data systems.

Showing topics related to platforms and consumer applications.

Showing topics related to infrastructure, cloud, and large-scale systems.

★

Distributed Database Systems — Deep Dive

Consensus CAP Theorem Replication Sharding

▾

Foundational Concepts

The CAP Theorem (Brewer, 2000) states a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance. In practice, network partitions are unavoidable, so systems choose between CP (strong consistency, may be unavailable) or AP (always available, may be inconsistent). Most modern systems are tunable between the two.

The Essential 10 Papers — Every Distributed Systems Engineer Must Read

1. The Google File System (GFS)

Ghemawat, Gobioff, Leung · Google · SOSP 2003

Introduced fault-tolerant distributed file storage on commodity hardware. Chunked storage (64MB), master-slave coordination, pipelined replication. Inspired HDFS, Amazon S3's architecture. The paper that started modern distributed storage.

→ Free PDF

2. MapReduce: Simplified Data Processing on Large Clusters

Dean & Ghemawat · Google · OSDI 2004

Programming model for parallel processing of large datasets. Map phase (transform) → shuffle (redistribute by key) → Reduce (aggregate). Directly inspired Apache Hadoop. Revolutionized big data processing.

→ Free PDF

3. Bigtable: A Distributed Storage System for Structured Data

Chang et al. · Google · TOCS 2008

Wide-column NoSQL database. Data model: (row key, column family:qualifier, timestamp) → value. Tablets split at 200MB, assigned to Tablet Servers via Master. Inspired Cassandra, HBase, DynamoDB.

→ Free PDF

4. Dynamo: Amazon's Highly Available Key-Value Store

DeCandia et al. · Amazon · SOSP 2007

Eventual consistency, consistent hashing, vector clocks, quorum reads/writes (R+W>N), Merkle trees for anti-entropy, gossip for membership. Every NoSQL database designer references this paper.

→ Free PDF

5. Cassandra: A Decentralized Structured Storage System

Lakshman & Malik · Facebook · SOSP 2009

Combines Bigtable's data model with Dynamo's distribution model. Tunable consistency (ONE/QUORUM/ALL), wide-column, ring topology, no single point of failure. Now the most-deployed open-source NoSQL DB.

→ Free PDF

6. Spanner: Google's Globally-Distributed Database

Corbett et al. · Google · OSDI 2012 / TOCS 2013

First globally distributed database with external consistency. TrueTime API (atomic clocks + GPS) assigns globally ordered timestamps. Enables serializable distributed transactions at global scale. Influenced CockroachDB, YugabyteDB.

→ Free PDF

7. In Search of an Understandable Consensus Algorithm (Raft)

Ongaro & Ousterhout · Stanford · USENIX ATC 2014

Raft was designed to be easier to understand than Paxos. Leader election + log replication + safety guarantees. Used in etcd (Kubernetes), CockroachDB, TiKV. The go-to consensus algorithm for new systems.

→ Free PDF

8. Paxos Made Simple

Lamport · Microsoft Research · 2001

The simplest explanation of the Paxos consensus algorithm. Paxos underpins Google Chubby, Google Spanner, Apache Zookeeper, and many distributed databases. A 2-phase protocol: Prepare/Promise then Accept/Accepted.

→ Free PDF

9. ZooKeeper: Wait-free Coordination for Internet-Scale Systems

Hunt et al. · Yahoo · USENIX ATC 2010

Distributed coordination service: configuration management, leader election, distributed locks, barriers. Used by Kafka (pre-KRaft), HBase, Storm. ZAB protocol (Zookeeper Atomic Broadcast) is a Paxos variant.

→ Free PDF

10. Chord: A Scalable P2P Lookup Service for Internet Applications

Stoica et al. · MIT · SIGCOMM 2001

Distributed Hash Table using consistent hashing on a ring. O(log N) lookup, O(log N) routing table per node. Foundational for understanding how Dynamo, Cassandra, and Riak distribute data without a central coordinator.

→ Free PDF

Advanced Topics — Consistency, Replication & Transactions

CAP Twelve Years Later: How the "Rules" Have Changed

Brewer · IEEE Computer · 2012

Brewer himself refines the CAP theorem: the choice between C and A is not binary, partitions are rare in practice, and systems should focus on minimizing impact rather than picking one side permanently.

→ IEEE

Lamport Clocks — Time, Clocks, and the Ordering of Events in a Distributed System

Lamport · ACM · 1978

Foundational paper on logical clocks. In a distributed system, there's no global clock. Lamport clocks assign logical timestamps to events, establishing a partial order (happens-before relation). Prerequisite for understanding vector clocks.

→ Free PDF

Impossibility of Distributed Consensus with One Faulty Process (FLP)

Fischer, Lynch, Paterson · ACM JACM · 1985

The FLP impossibility result: in an asynchronous system with even one potentially faulty process, there is no deterministic algorithm that guarantees consensus. This is why practical systems use timeouts and randomization.

A Comprehensive Study of Convergent and Commutative Replicated Data Types

Shapiro et al. · INRIA · 2011 (CRDTs)

The seminal CRDT paper. Introduces state-based (CvRDT) and operation-based (CmRDT) replicated data types that converge without coordination. Used in Riak, Redis, Figma, Notion, and Bluesky's data model.

→ Free PDF

The Byzantine Generals Problem

Lamport, Shostak, Pease · ACM TPLS · 1982

Consensus in the presence of traitors (malicious nodes). BFT (Byzantine Fault Tolerant) consensus requires 3f+1 nodes to tolerate f Byzantine failures. Foundation for blockchain consensus (PBFT, HotStuff).

→ Free PDF

Percolator: Large-Scale Incremental Processing Using Distributed Transactions and Notifications

Peng & Dabek · Google · OSDI 2010

Google's distributed transaction layer over Bigtable (used by Google Search's index). Introduces optimistic concurrency via MVCC with a two-phase commit protocol backed by Chubby locks.

→ Free PDF

Modern Distributed Databases — Research & Architecture

CockroachDB: The Resilient Geo-Distributed SQL Database

Taft et al. · Cockroach Labs · SIGMOD 2020

Open-source Spanner-inspired database. Uses Raft for consensus, MVCC for transactions, and a SQL layer on top of a distributed KV store. Describes geo-partitioning for data residency compliance.

→ ACM DL

TiDB: A Raft-based HTAP Database

Huang et al. · PingCAP · VLDB 2020

Hybrid Transactional/Analytical Processing (HTAP) database. TiKV (Raft-based distributed KV) + TiFlash (columnar replica). Enables real-time analytics on fresh transactional data without ETL.

→ Free PDF

Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases

Verbitski et al. · Amazon · SIGMOD 2017

Aurora's key innovation: the log IS the database. Redo log records are shipped to 6 storage nodes across 3 AZs; storage nodes apply log records lazily. Eliminates the write-amplification of traditional MySQL replication.

→ ACM DL

F1: A Distributed SQL Database That Scales

Shute et al. · Google · VLDB 2013

Google's distributed SQL database built on Spanner, used for AdWords (billions of dollars). Full SQL, ACID transactions, automatically maintained indexes, optimistic locking. Describes SQL at global scale.

→ Free PDF

Megastore: Providing Scalable, Highly Available Storage for Interactive Services

Baker et al. · Google · CIDR 2011

Google's database bridging the gap between OLTP and scalable NoSQL. Entity groups (mini-databases with ACID transactions) replicated via Paxos. Precursor to Spanner. Powers Gmail, App Engine.

→ Free PDF

CRAQ: Chain Replication with Apportioned Queries

Terrace & Freedman · Princeton · USENIX ATC 2009

Extends chain replication to allow reads at any replica (not just tail), dramatically improving read throughput. Highly relevant for understanding replication strategies in distributed databases.

→ Free PDF

Sharding, Partitioning & Data Distribution

Consistent Hashing and Random Trees

Karger et al. · MIT · STOC 1997

Introduces consistent hashing — when N nodes change to N±1, only 1/N of keys move. Used in every distributed cache and database (Dynamo, Cassandra, Redis Cluster, CDNs).

→ ACM DL

Slicer: Auto-Sharding for Datacenter Applications

Adya et al. · Google · OSDI 2016

Google's automatic sharding system for stateful services. Auto-rebalances load across servers based on observed traffic, avoiding hot shards that plague manual sharding approaches.

→ USENIX

PNUTS: Yahoo!'s Hosted Data Serving Platform

Cooper et al. · Yahoo · VLDB 2008

Yahoo's geographically distributed database. Timeline consistency (per-record sequential writes), regional master per record, pub-sub for replication. Alternative consistency model between Dynamo's AP and Spanner's CP.

→ ACM DL

Essential Books for Distributed Systems

Kleppmann (2017)

Designing Data-Intensive Applications — The bible

Tanenbaum (2017)

Distributed Systems: Principles and Paradigms

van Steen (2023)

Distributed Systems (free online)

Coulouris (2011)

Distributed Systems: Concepts & Design

Lamport (multiple)

All papers at lamport.azurewebsites.net

Martin Fowler

Patterns of Enterprise Application Architecture

Key Research Venues to Follow

SOSP

ACM Symposium on Operating Systems Principles

OSDI

USENIX Symposium on OS Design & Impl.

VLDB

Very Large Data Bases Conference

SIGMOD

ACM Management of Data Conference

NSDI

USENIX Networked Systems Design & Impl.

EUROSYS

European Conference on Computer Systems