System Design
RESHADED Framework
A step-by-step structure for system design interviews. Covers all the bases an interviewer expects you to address.
R — Requirements Functional + non-functional. Ask before designing.
E — Estimation Scale: QPS, storage, bandwidth. Justify your numbers.
S — Storage Data model, DB choice (SQL vs NoSQL), schemas.
H — High-Level Design Draw the main components and how they connect.
A — APIs Define key endpoints or interfaces.
D — Deep Dive Pick 2-3 critical components and go deep.
E — Evaluation Discuss trade-offs, failure modes, and improvements.
D — Done? Confirm you've addressed all requirements.When to use: Any open-ended system design question (“Design Twitter”, “Design a rate limiter”).
Interview tip: Do not skip Requirements. Spending the first 5 minutes asking clarifying questions signals seniority and prevents designing the wrong system.
CAP Theorem — Quick Reference
When designing distributed systems, you can only guarantee two of three properties during a network partition.
C — Consistency Every read returns the most recent write (or an error).
A — Availability Every request gets a response (may not be the latest data).
P — Partition Tol. System continues operating despite dropped network messages.
CP systems: Prioritise correctness. (e.g., ZooKeeper, HBase)
AP systems: Prioritise uptime. (e.g., Cassandra, DynamoDB in default mode)When to use: Justifying a database choice, discussing trade-offs in distributed design, replication strategies.
Interview tip: Frame it as a trade-off: “Given that network partitions are unavoidable, I’m choosing between consistency and availability. For this use case I’d prefer AP because…”
Scaling Ladder
When asked “how would you scale this?”, walk up the ladder one rung at a time rather than jumping straight to microservices.
1. Vertical scaling Bigger machine. Easiest, hits a ceiling fast.
2. Caching Redis/Memcached. Eliminate redundant computation.
3. DB read replicas Separate read and write load.
4. Load balancing Distribute traffic across multiple app servers.
5. DB sharding / partitioning Horizontal split of data.
6. Async processing Message queues (Kafka, SQS) for bursty workloads.
7. Microservices Split by domain for independent scaling.
8. CDN / edge caching Push static or cacheable content closer to users.When to use: “The system works — now how would you handle 10× the traffic?”
Back-of-the-Envelope Estimation Cheat Sheet
Useful numbers to memorise for estimation questions.
Latency
L1 cache reference ~1 ns
Main memory reference ~100 ns
SSD random read ~100 µs
HDD seek ~10 ms
Network roundtrip (same DC) ~500 µs
Network roundtrip (cross-region) ~150 ms
Throughput
Single server HTTP ~10K–50K RPS (varies widely)
Kafka topic ~100K–1M msgs/s
Storage
1 char = 1 byte
1 UUID = 16 bytes
1 tweet (280 chars) ≈ 300 bytes
1 hour HD video ≈ 1 GB (compressed)
Time
1 day = 86,400 s ≈ 100K s
1 month = ~2.6M s ≈ 3M s
1 year = ~31.5M s ≈ 30M sWhen to use: Sizing storage, estimating QPS from DAU, justifying the need for caching or sharding.
Data Store Decision Tree
Need ACID transactions?
Yes → Relational DB (PostgreSQL, MySQL)
No ↓
Need flexible / evolving schema?
Yes → Document store (MongoDB, DynamoDB)
No ↓
Need ultra-low latency key-value access?
Yes → In-memory store (Redis, Memcached)
No ↓
Need time-series / append-only data?
Yes → Time-series DB (InfluxDB, Prometheus, TimescaleDB)
No ↓
Need full-text search?
Yes → Search engine (Elasticsearch, OpenSearch)
No ↓
Need graph relationships?
Yes → Graph DB (Neo4j, Amazon Neptune)
No → Probably still a relational DBWhen to use: Justifying any storage technology choice in system design.
API Design Decision Guide
Start: What are your primary constraints?
↓
Need simple, broad client compatibility (browsers, third-party)?
Yes → REST
- Use when: public APIs, CRUD resources, diverse clients
- Trade-offs: verbose payloads, multiple round-trips for complex data
Need high-performance internal service communication?
Yes → gRPC
- Use when: microservice-to-microservice, low latency critical, streaming
- Trade-offs: harder to debug (binary protocol), poor browser support
Need flexible queries, multiple clients with different data needs?
Yes → GraphQL
- Use when: mobile + web with different payload needs, rapid UI iteration
- Trade-offs: complex caching, N+1 query risk, schema overhead
Need real-time push from server to client?
Yes → WebSocket (or SSE for one-way)
- Use when: live feeds, notifications, collaborative editing
- Trade-offs: stateful connections, harder to scale horizontallyQuick reference:
Use case → Best choice
─────────────────────────────────────────────────
Public API (external developers) → REST
Mobile + web with different needs → GraphQL
Internal microservices → gRPC
Real-time notifications → WebSocket / SSE
Simple event streaming → SSE (Server-Sent Events)When to use: The “A — APIs” step of RESHADED, any “how would you design the API layer?” question.
Interview tip: Lead with REST as your default, then justify deviating. “I’d start with REST for broad compatibility. If internal latency becomes a bottleneck between services, I’d switch those calls to gRPC.”
Observability & Alerting Design Framework
A structured approach for answering “how would you monitor and alert on this system?” — common in system design interviews and oncall readiness discussions.
Quick reference:
1. User Experience Anchor everything to what users actually feel.
2. SLIs Pick 2–4 measurable indicators of user experience.
3. SLOs Set reliability targets and define error budgets.
4. Signals Golden Signals + dimensions (region, endpoint, path).
5. Instrumentation Metrics (aggregate), logs (debug), traces (latency).
6. Alerting SLO-based burn-rate alerts — page only on user impact.
7. Failure Modes Identify likely failure patterns and verify detectability.
8. Scaling & Cost Control cardinality, sample traces, tier storage.When to use: Any “design the observability strategy” question, the “D — Deep Dive” or “E — Evaluation” steps of RESHADED, or oncall system design discussions.
Interview tip: Lead with user experience, not infrastructure metrics. “I’d start by defining what good looks like for the user, pick a small set of SLIs, attach SLOs, and build alerting around burn-rate violations — not raw thresholds.”
Step 1 — User Experience
Goal: Anchor everything to what users actually feel, not what is easy to measure.
- Define what “good” looks like for this system
- Identify the critical user journeys (e.g., checkout, search, feed load)
Key questions:
- What does success look like to the user?
- What matters most: latency, availability, or data freshness?
- Are there different user tiers with different expectations (cached vs real-time)?
Step 2 — SLIs (Service Level Indicators)
Goal: Choose measurable proxies for user experience.
- Pick 2–4 SLIs maximum — keep it tight and user-facing
- Avoid infra metrics (CPU, disk) as primary SLIs
Key questions:
- What metric reflects user happiness?
- How do we distinguish success from failure?
- Do we need per-region or per-endpoint SLIs?
Common SLIs:
Availability → % of requests returning a successful response
Latency → p95 or p99 response time
Freshness → % of responses containing data within an acceptable age
Error rate → % of requests resulting in an errorStep 3 — SLOs (Service Level Objectives)
Goal: Set reliability targets and define acceptable failure budgets.
- Assign a numeric target to each SLI
- Define the error budget: how much failure is tolerable in the window
Key questions:
- What is acceptable failure for this system?
- What is the business tolerance for downtime or degradation?
- Do critical endpoints need tighter SLOs than others?
Examples:
99.9% availability (allows ~43 min downtime/month)
99% of requests < 300ms at p95
95% of responses contain data < 60 seconds oldStep 4 — Signals (Golden Signals + Dimensions)
Goal: Break down behavior along dimensions so you can isolate problems quickly.
Golden Signals:
Latency — How long requests take (split p50 / p95 / p99)
Errors — Rate of failed requests
Traffic — Request volume (RPS or events/sec)
Saturation — Resource pressure (CPU, memory, queue depth)Add dimensions — this is the critical step:
Region / AZ — Is degradation localized?
Endpoint / path — Is one API route affected?
Dependency — Is it the DB, cache, or downstream service?
Customer segment — Is one tenant affected?
Cache vs DB path — Did a miss spike cause the issue?Metrics without dimensions are nearly useless at scale. A single error rate number tells you nothing about where to look.
Step 5 — Instrumentation
Goal: Get full visibility across the system using the right tool for each job.
Metrics → Aggregation and alerting. Track SLIs, golden signals, saturation.
Logs → Debugging. Capture error context, request details, stack traces.
Traces → Latency breakdown. Show which service or span is the bottleneck.Key questions:
- Where is latency introduced along the request path?
- Can I correlate a metric anomaly to a specific trace or log line?
- Do I have request-level visibility across service boundaries?
Step 6 — Alerting Strategy
Goal: Page only when real users are impacted — avoid alert fatigue.
Use SLO-based burn-rate alerting:
- Alert when the error budget is being consumed faster than it can recover
- Fast burn (1-hour window): urgent — something is actively degrading the SLO
- Slow burn (6-hour window): warning — gradual erosion that needs attention soon
Separate alert tiers:
Page (on-call wake) → SLO burn rate exceeded, confirmed user impact
Ticket (non-urgent) → Elevated error rate below burn threshold, capacity warning
Log only → Transient anomalies, auto-remediated eventsKey questions:
- Does this alert reflect actual user pain?
- Can this be auto-remediated before it requires human intervention?
- Will this alert fire during normal traffic spikes?
Alert on SLO violations, not on symptoms. A slow DB query is a symptom — a burn rate alert on latency SLO is user impact.
Step 7 — Failure Modes
Goal: Verify your observability can detect the most likely real-world failures.
Common failure patterns to cover:
DB slowdown → Latency SLI degrades; trace shows slow DB span
Cache miss spike → Latency increases; hit ratio metric drops
Single region down → Availability SLI drops; sliced by region dimension
Dependency timeout → Error rate rises; trace shows timeout on external call
Traffic surge → Saturation metric rises; queue depth or CPU climbsKey questions:
- What happens to my SLIs if the DB slows down?
- Can I tell the difference between a cache miss spike and a downstream timeout?
- Does a regional failure show up as a global alert, or a localized one?
Step 8 — Scaling & Cost Controls
Goal: Keep telemetry cost and cardinality manageable as the system grows.
Key levers:
Metric cardinality — Avoid high-cardinality label combinations (e.g., user_id as a label)
Trace sampling — Head-based sampling for high volume; tail-based for errors
Log tiering — DEBUG logs to cheap storage; ERROR logs to hot storage
Retention policy — Short window for raw data; long window for aggregated rollupsKey questions:
- Will adding this label to a metric cause a cardinality explosion?
- Can I sample 1% of healthy traces and keep 100% of error traces?
- What is the per-GB cost at 10× current traffic?
Condensed Interview Answer
When asked “how would you design observability for this system?”:
“I’d start by defining the user experience — what does success look like for the user. Then I’d pick 2–3 SLIs like availability and p95 latency, attach SLOs to them, and derive an error budget. For signals, I’d instrument the Golden Signals — latency, errors, traffic, saturation — broken down by region and endpoint so I can isolate failures quickly. Instrumentation would use metrics for aggregation and alerting, logs for debugging, and traces for latency attribution. Alerting would be SLO-based using burn-rate alerts to avoid noise. Finally, I’d think through likely failure modes to validate the observability covers them, and control cardinality and sampling to keep costs reasonable at scale.”
SRE System Design Framework
A reliability-first structure for designing highly available backends. Use this when the question is about building something that must stay up — not just something that works. Complements RESHADED by going deeper on failure modes, cascading-failure controls, and operational safety.
Quick reference:
1. User Experience What matters most to users?
2. SLIs / SLOs How do we measure success?
3. Request Path How does traffic flow end to end?
4. Core Components What does each layer do and why?
5. HA Design How do we survive instance, AZ, and region failures?
6. Failure Modes What breaks, what is the blast radius, how do we contain it?
7. Cascading Failures Timeouts, retries, circuit breakers, backpressure.
8. Scaling How does the system grow safely?
9. Observability Metrics, logs, traces, alerting.
10. Operations Deploy, rollback, runbooks, game days.When to use: “Design a highly available X” questions, any question where reliability is the core constraint, the “D — Deep Dive” or “E — Evaluation” steps of RESHADED for HA systems.
Interview tip: Lead with user experience and SLOs before touching infrastructure. Interviewers want to see that you design for reliability outcomes, not just for components.
Step 1 — User Experience
Goal: Anchor the design on what the user actually needs, not on infrastructure first.
- Clarify what the system does
- Identify whether it is read-heavy, write-heavy, latency-sensitive, availability-sensitive, or freshness-sensitive
- Define what “good” looks like for the user
Key questions:
- Who are the users?
- What is the critical path?
- Is the system mostly reads, writes, or both?
- What matters most: latency, availability, correctness, freshness, throughput?
SRE design should start from user-visible reliability, because SLOs and alerting should reflect what customers experience.
Step 2 — SLIs / SLOs
Goal: Make reliability measurable before designing the internals.
- Pick 2–4 user-facing SLIs, then define realistic SLOs
Common SLIs:
Availability → % of requests returning a successful response
Latency → p95 or p99 response time
Freshness → % of responses containing data within an acceptable age
Correctness → % of responses with valid, expected output
Durability → if stateful: data loss rateKey questions:
- What do we measure to know users are happy?
- What would page the on-call?
- What can degrade without paging?
If you cannot define the SLI/SLO, you do not yet know what the system is optimising for.
Step 3 — Request Path
Goal: Show the end-to-end flow before diving into components.
- Draw the happy path from user to backend and back
- Show both cache-hit and cache-miss paths
Typical flow:
User → DNS / GSLB → CDN / LB → API gateway / ingress → service → cache / DB / dependenciesKey questions:
- How does traffic enter?
- Where is traffic routed globally?
- Where is TLS terminated?
- What happens on a cache hit vs cache miss?
- Where does state live?
The request path gives the interviewer a map of the system before failure analysis begins.
Step 4 — Core Components
Goal: Explain each component by purpose, not by name-dropping.
For each major part, say: why it exists, what problem it solves, and what happens if it fails.
Key questions:
- Why CDN?
- Why cache?
- Why service mesh / ingress?
- Why this datastore?
- Why multi-region?
Don't say: "Redis, Istio, NLB."
Do say: "Regional cache to absorb read traffic and serve stale data
during dependency degradation."Step 5 — High Availability Design
Goal: Show how the system survives common failures at every layer.
Failure layers to address:
Instance / pod failure → Health checks, restarts, redundant replicas
Node failure → Multi-node scheduling, PodDisruptionBudgets
AZ failure → Multi-AZ replicas, cross-AZ load balancing
Region failure → Multi-region active-active or active-passive failover
Dependency failure → Degraded mode, stale serving, fallback pathsTypical HA controls:
- Multi-AZ / multi-region replicas
- Load balancing with health checks
- Regional failover via GSLB
- Read replicas and replicated caches
- Rolling deploys with readiness gates
Step 6 — Failure Modes
Goal: Show how the system behaves when things go wrong — this is the most important SRE step.
For each likely failure, explain: detection, impact, mitigation, and user experience.
High-signal failure modes:
Regional outage → GSLB reroutes; serves from remaining regions
Dependency latency spike → Timeout triggers fallback; stale cache served
Cache miss storm → DB absorbs load; autoscaling kicks in
Retry storm → Circuit breaker opens; upstream protected
Node exhaustion → Autoscaler adds capacity; requests queue briefly
Bad rollout → Canary SLO alert fires; automated rollback triggered
Network partition → Partition-tolerant path serves cached dataKey questions:
- What if one region dies?
- What if a dependency becomes slow?
- What if cache is unavailable?
- What if traffic spikes suddenly?
- What if a deploy introduces a bad config?
SRE interviews care more about graceful degradation and blast-radius control than the happy path. Large outages typically come from retry amplification, dependency stress, and cascading failures — not a single component simply going down.
Step 7 — Cascading-Failure Controls
Goal: Prevent one degraded dependency from taking down the whole service.
Key reliability controls:
Timeouts → Every outbound call has a deadline; fail fast, do not hang
Retries → Bounded retries with exponential backoff and jitter
Circuit breakers → Open after N failures; stop hammering a degraded dependency
Rate limiting → Protect the service and its dependencies from overload
Backpressure → Signal upstream when the service is at capacity
Stale serving → Return cached or degraded responses rather than errorsKey questions:
- What prevents retry storms?
- What happens when upstream is slow?
- How do we protect the core path?
- Can we serve partial or stale results?
A good design does not just handle failure — it contains failure.
Step 8 — Scaling Strategy
Goal: Explain how the system grows safely under both gradual growth and sudden spikes.
Traffic scaling:
HPA / KEDA → Scale service replicas on RPS, latency, or queue depth
Cluster autoscaler → Add nodes as pod demand grows
CDN + cache → Absorb read traffic before it hits the service layerData scaling:
Read replicas → Spread read load across multiple DB instances
Sharding / partitioning → Horizontal data split for write-heavy workloads
Cache tiering → Local in-process cache → regional cache → DBKey questions:
- What scales horizontally?
- What are the bottlenecks?
- What does autoscaling key off?
- What happens during sudden spikes?
Step 9 — Observability
Goal: Make the system debuggable and operable from day one.
Cover all four layers:
Metrics → Golden Signals (latency, errors, traffic, saturation) by region and endpoint
Logs → Error context, request details, dependency call outcomes
Traces → End-to-end latency attribution across service boundaries
Alerts → SLO burn-rate alerts; page on user impact, not raw thresholdsKey questions:
- What are the user-facing golden signals?
- What dimensions matter: region, endpoint, dependency, status class?
- What should page vs ticket?
- How do we avoid alert fatigue?
SLO-based burn-rate alerting is stronger than threshold alerting because it reflects customer impact and error-budget consumption. See the Observability & Alerting Design Framework for full detail.
Step 10 — Operational Model
Goal: Show that the system can be safely run, not just built.
Deploy safety:
Canary rollout → 1% → 10% → 100% traffic shift with SLO gates at each step
Automated rollback → Roll back automatically on burn-rate alert during canary
Feature flags → Decouple deploy from release; disable at runtime without redeploy
Config validation → Validate config schema at deploy time, not at runtimeRunbooks:
- One runbook per high-signal alert, linked directly from the alert
- Each runbook covers: detection, blast radius, mitigation steps, escalation path
- Keep runbooks short and action-oriented — they are read under pressure
Operational hygiene:
Game days → Quarterly failure drills (region failover, dependency kill, traffic spike)
Chaos testing → Controlled fault injection in staging or production canary
Postmortems → Blameless, focused on system gaps not individuals
On-call rotation → Team-owned with clear escalation to platform for infra failuresIf pressed for time, say:
“I’d use a canary deploy strategy with automated rollback triggered by SLO burn-rate violations. Every alert would link to a runbook covering detection, mitigation, and escalation. We’d run quarterly game days to validate regional failover and dependency-degradation scenarios. On-call rotation is team-owned, with a clear escalation path to the platform team for infrastructure failures.”
Condensed Interview Answer
When asked “design a highly available backend for X”:
“I’d start from user experience — what does success look like for the user — then define 2–3 SLIs and attach SLOs so reliability is measurable before we touch infrastructure. I’d sketch the request path first: DNS, GSLB, CDN, API gateway, service, cache, DB. For each component I’d explain its purpose and failure mode, not just its name. HA comes from multi-AZ replicas, regional failover via GSLB, and replicated caches. The most important SRE step is failure mode analysis: what breaks, what is the blast radius, and how does the system degrade gracefully rather than fail hard. I’d add circuit breakers and bounded retries to contain cascading failures. Scaling is HPA on services, sharding on data, and CDN to absorb reads. Observability is golden signals by region and endpoint, with SLO burn-rate alerting. Finally I’d cover the operational model: canary deploys with automated rollback, runbooks linked from alerts, and quarterly game days.”