Frameworks Quiz
Quiz
Request Path — Typical Flow:
- User — Request originates from the client.
- DNS / GSLB — DNS resolves the hostname; GSLB routes traffic to the nearest healthy region.
- CDN / LB — CDN serves cached content at the edge; load balancer distributes traffic across backend instances.
- API gateway / ingress — TLS is typically terminated here; handles auth, rate limiting, and routing to services.
- Service — Application logic processes the request.
- Cache / DB / dependencies — Service reads from cache (hit) or falls through to the database or downstream dependencies (miss).
Request Path — Typical Flow:
- User — Request originates from the client.
- DNS / GSLB — DNS resolves the hostname; GSLB routes traffic to the nearest healthy region.
- CDN / LB — CDN serves cached content at the edge; load balancer distributes traffic across backend instances.
- API gateway / ingress — TLS is typically terminated here; handles auth, rate limiting, and routing to services.
- Service — Application logic processes the request.
- Cache / DB / dependencies — Service reads from cache (hit) or falls through to the database or downstream dependencies (miss).
| Control | Purpose |
|---|---|
| Canary rollout | 1% → 10% → 100% traffic shift with error rate checks at each step |
| Automated rollback | Roll back automatically on error rate spike during canary |
| Feature flags | Decouple deploy from release; disable at runtime without redeploy |
| Config validation | Validate config schema at deploy time, not at runtime |
Did you get it right?
Common user-facing SLIs:
- Availability — % of requests returning a successful response
- Latency — p95 or p99 response time
- Freshness — % of responses containing data within an acceptable age
- Error rate — % of requests resulting in an error
- Durability — data loss rate (stateful systems only)
CPU utilization and Disk I/O are infrastructure metrics — they are useful for debugging and capacity planning but should not be primary SLIs because they don’t directly reflect what users experience.
Common user-facing SLIs:
- Availability — % of requests returning a successful response
- Latency — p95 or p99 response time
- Freshness — % of responses containing data within an acceptable age
- Error rate — % of requests resulting in an error
- Durability — data loss rate (stateful systems only)
CPU utilization and Disk I/O are infrastructure metrics — they are useful for debugging and capacity planning but should not be primary SLIs because they don’t directly reflect what users experience.
| Failure Mode | Mitigation |
|---|---|
| Regional outage | GSLB reroutes; serves from remaining regions |
| Dependency latency spike | Timeout triggers fallback; stale cache served |
| Cache miss storm | DB absorbs load; autoscaling kicks in |
| Retry storm | Circuit breaker opens; upstream protected |
| Network partition | Partition-tolerant path serves cached data |
Did you get it right?
Incident Response Framework:
- Confirm — Verify the issue is real, current, and user-impacting.
- Scope — Determine blast radius: who, what, where, since when.
- Correlate — Identify recent changes that could explain the issue.
- Stabilize — Pause risky changes and stop the incident from expanding.
- Locate — Narrow the fault domain using golden signals and dependency tracing.
- Mitigate — Take the fastest safe action to reduce user impact.
- Root Cause — Identify both the trigger and the missing safeguard.
- Recover — Confirm the system is truly healthy, not just quieter.
- Prevent — Add fixes, guardrails, and learnings to avoid recurrence.
Incident Response Framework:
- Confirm — Verify the issue is real, current, and user-impacting.
- Scope — Determine blast radius: who, what, where, since when.
- Correlate — Identify recent changes that could explain the issue.
- Stabilize — Pause risky changes and stop the incident from expanding.
- Locate — Narrow the fault domain using golden signals and dependency tracing.
- Mitigate — Take the fastest safe action to reduce user impact.
- Root Cause — Identify both the trigger and the missing safeguard.
- Recover — Confirm the system is truly healthy, not just quieter.
- Prevent — Add fixes, guardrails, and learnings to avoid recurrence.
Traffic scaling:
| Strategy | Purpose |
|---|---|
| HPA / KEDA | Scale service replicas on RPS, latency, or queue depth |
| CA (ASG / Karpenter) | Add nodes as pod demand grows |
| CDN + cache | Absorb read traffic before it hits the service layer |
Data scaling:
| Strategy | Purpose |
|---|---|
| Read replicas | Spread read load across multiple DB instances |
| Sharding / partitioning | Horizontal data split for write-heavy workloads |
| Cache tiering | Local in-process cache → regional cache → DB |
Did you get it right?
4 Golden Signals (LETS):
- Latency — How long requests take (split p50 / p95 / p99)
- Errors — Rate of failed requests
- Traffic — Request volume (RPS or events/sec)
- Saturation — Resource pressure (CPU, memory, queue depth)
Break each signal down by dimensions — region, endpoint, dependency, customer segment — so that when an SLO fires, you can isolate where the problem is, not just that something is wrong.
4 Golden Signals (LETS):
- Latency — How long requests take (split p50 / p95 / p99)
- Errors — Rate of failed requests
- Traffic — Request volume (RPS or events/sec)
- Saturation — Resource pressure (CPU, memory, queue depth)
Break each signal down by dimensions — region, endpoint, dependency, customer segment — so that when an SLO fires, you can isolate where the problem is, not just that something is wrong.
SRE System Design Framework:
- User Experience — What matters most to users?
- SLIs / SLOs — How do we measure success?
- Request Path — How does traffic flow end to end?
- Core Components — What does each layer do and why?
- HA Design — How do we survive instance, AZ, and region failures?
- Failure Modes — What breaks, what is the blast radius, how do we contain it?
- Cascading Failures — Timeouts, retries, circuit breakers, backpressure.
- Scaling — How does the system grow safely?
- Observability — Metrics, logs, traces, alerting.
- Operations — Deploy, rollback, runbooks, game days.
SRE System Design Framework:
- User Experience — What matters most to users?
- SLIs / SLOs — How do we measure success?
- Request Path — How does traffic flow end to end?
- Core Components — What does each layer do and why?
- HA Design — How do we survive instance, AZ, and region failures?
- Failure Modes — What breaks, what is the blast radius, how do we contain it?
- Cascading Failures — Timeouts, retries, circuit breakers, backpressure.
- Scaling — How does the system grow safely?
- Observability — Metrics, logs, traces, alerting.
- Operations — Deploy, rollback, runbooks, game days.
| Failure Layer | HA Controls |
|---|---|
| Instance / pod failure | Health checks, restarts, redundant replicas |
| Node failure | Multi-node scheduling, PodDisruptionBudgets |
| AZ failure | Multi-AZ replicas, cross-AZ load balancing |
| Region failure | Multi-region active-active or active-passive failover |
| Dependency failure | Degraded mode, stale serving, fallback paths |
Did you get it right?
| Failure | Signal |
|---|---|
| DB slowdown | Latency SLI degrades; trace shows slow DB span |
| Cache miss spike | Latency increases; hit ratio metric drops |
| Region down | Availability SLI drops; sliced by region dimension |
| Dependency timeout | Error rate rises; trace shows timeout on external call |
| Traffic surge | Saturation metric rises; queue depth or CPU climbs |
Did you get it right?
Observability & Alerting Design Framework:
- User Experience — Anchor everything to what users actually feel.
- SLIs — Pick 2–4 measurable indicators of user experience.
- SLOs — Set reliability targets and define error budgets.
- Signals — Golden Signals + dimensions (region, endpoint, path).
- Instrumentation — Metrics (aggregate), logs (debug), traces (latency).
- Alerting — SLO-based burn-rate alerts — page only on user impact.
- Failure Modes — Identify likely failure patterns and verify detectability.
- Scaling & Cost — Control cardinality, sample traces, tier storage.
Observability & Alerting Design Framework:
- User Experience — Anchor everything to what users actually feel.
- SLIs — Pick 2–4 measurable indicators of user experience.
- SLOs — Set reliability targets and define error budgets.
- Signals — Golden Signals + dimensions (region, endpoint, path).
- Instrumentation — Metrics (aggregate), logs (debug), traces (latency).
- Alerting — SLO-based burn-rate alerts — page only on user impact.
- Failure Modes — Identify likely failure patterns and verify detectability.
- Scaling & Cost — Control cardinality, sample traces, tier storage.
| Control | Purpose |
|---|---|
| Timeouts | Every outbound call has a deadline; fail fast, do not hang |
| Retries | Bounded retries with exponential backoff and jitter |
| Circuit breakers | Open after N failures; stop hammering a degraded dependency |
| Rate limiting | Protect the service and its dependencies from overload |
| Backpressure | Signal upstream when the service is at capacity |
| Stale serving | Return cached or degraded responses rather than errors |
Did you get it right?