Problem Solving

Incident Response Framework

Use when something is broken and you need to diagnose it systematically without jumping to conclusions.

Quick reference:

1. Confirm    — Verify the issue is real, current, and user-impacting.
2. Scope      — Determine blast radius: who, what, where, since when.
3. Correlate  — Identify recent changes that could explain the issue.
4. Stabilize  — Pause risky changes and stop the incident from expanding.
5. Locate     — Narrow the fault domain using golden signals and dependency tracing.
6. Mitigate   — Take the fastest safe action to reduce user impact.
7. Root Cause — Identify both the trigger and the missing safeguard.
8. Recover    — Confirm the system is truly healthy, not just quieter.
9. Prevent    — Add fixes, guardrails, and learnings to avoid recurrence.

When to use: Incidents, production bugs, mysterious failures, on-call escalations.

Interview tip: Narrate each step out loud. Interviewers want to see your reasoning process, not just the answer.

Step 1 — Confirm

Goal: Ensure the issue is real, current, and user-impacting.

Validate via dashboards (latency, errors, traffic, saturation), alerts, synthetic checks, or user reports
Determine if the issue is ongoing, intermittent, or already self-resolved
Separate signal from noise — filter out false positives before escalating

Key questions:

Is this still happening right now?
What exactly is degraded: latency, errors, availability, or correctness?
Is this real user impact or a noisy signal?

This is a quick golden signal check to confirm the issue exists. Step 5 uses the same signals analytically to locate the fault domain.

Step 2 — Scope

Goal: Understand blast radius and urgency.

Identify who is affected: one user, one tenant, one region, one AZ, all customers
Establish a timeline of when it started
Compare healthy vs unhealthy slices to find the boundary

Scope dimensions:

Region / AZ
Service / endpoint
Version / deploy group
Customer segment
Dependency path

Key questions:

Is this global or isolated?
Who is impacted and how severely?
When did this begin?

Step 3 — Correlate

Goal: Identify recent changes that could explain the issue.

Check recent deploys (app, infra, config)
Check feature flag changes, traffic pattern shifts, dependency updates
Check certificate rotations, DNS updates, quota changes
Build a timeline: incident start vs change events — look for overlap

Key questions:

What changed around the time this started?
Did this begin right after a deploy or config update?
Is the change localized to the affected scope (e.g., only one region)?

Changes often tell you why before you even start locating where.

Step 4 — Stabilize

Goal: Prevent further impact while investigation continues.

Pause or roll back recent deploys
Freeze config and feature flag changes
Isolate affected region or service if needed
Page relevant owners and protect system capacity

Key questions:

Do I need to stop an ongoing rollout?
Is the incident still expanding?
Can I reduce risk while investigating?

This step shows interview maturity: in real incidents you do not always wait until root cause before taking action.

Step 5 — Locate

Goal: Narrow the fault domain using signals and system behavior.

Step 1 — Check golden signals:

Latency, Errors, Traffic, Saturation

Step 2 — Interpret signals:

High latency + low CPU  → dependency or network issue
High CPU                → compute bottleneck
High error rate         → failing service or downstream
Traffic drop            → upstream routing or client-side issue

Step 3 — Isolate by layer (let signals guide you, not the other way around):

Edge          — DNS, CDN, GSLB
Entry         — LB, ingress, API gateway
Platform      — nodes, kube-proxy, service mesh, networking
Service       — pods, app instances, queues
Dependency    — DB, cache, downstream API
Control plane — deploys, feature flags, certs, config changes

Step 4 — Use observability:

Metrics → detect the anomaly
Logs → find errors and timeouts
Traces → identify the slow span

Key questions:

Where is the time being spent?
Where do healthy and unhealthy paths diverge?
Is this infra, application, or dependency?

Signals tell you where; the Correlate step tells you why.

Step 6 — Mitigate

Goal: Reduce user impact with the fastest safe action.

Roll back or disable the problematic change
Shift traffic to healthy regions
Scale out the hot component
Fail over dependencies
Increase cache TTL or disable expensive non-critical features
Drain bad nodes

Key questions:

What is the fastest safe action?
Can I reduce impact before I have full root cause?
Is this action reversible?

Target the smallest safe action that buys time. Stabilize is “hold the line”; Mitigate is “repair the damage.”

Step 7 — Root Cause

Goal: Identify the actual trigger and why the system allowed it.

Two parts:

Trigger — What changed or failed first?
System gap — Why did this become user impact? Why did safeguards not catch it?

Key questions:

What was the first bad event?
Why did detection or protection mechanisms fail?
Was it capacity, config, dependency, rollout, or design weakness?

Avoid shallow answers:

Shallow:  "DB was slow."
Deep:     "DB was slow because one region had connection pool exhaustion
           after a cache miss spike caused by config drift."

Step 8 — Recover Fully

Goal: Ensure the system is completely healthy, not just quieter.

Verify all golden signals return to normal baseline
Check backlogs, queue depth, retry storms, cache warm-up
Confirm replica health across all regions
Verify no partial degradation remains hidden

Key questions:

Are we truly healthy or just temporarily quiet?
Is there hidden backlog or delayed impact?
Are all regions and replicas healthy?

Most candidates stop at Mitigate. Explicitly calling out recovery signals operational maturity.

Step 9 — Prevent Recurrence

Goal: Reduce the likelihood and impact of similar incidents.

Short-term (this sprint):

Tune alerting to catch this class of failure earlier (SLO-based preferred)
Update runbooks and dashboards
Add or improve canary / rollout safety checks

Long-term (this quarter):

Add circuit breakers, rate limiting, or retry guardrails
Strengthen regional isolation and capacity planning
Increase observability coverage for the affected path

Key questions:

What would have detected this earlier?
What would have reduced blast radius?
What should be automated or enforced by policy?

UPER Problem-Solving Framework

A structured approach to algorithm and coding problems — especially useful when you are stuck.

U — Understand   Restate the problem. Ask clarifying questions. Check edge cases.
P — Plan         Sketch an approach before writing code. Think aloud.
E — Execute      Write clean code following the plan.
R — Review       Test with examples. Check edge cases. Analyze complexity.

When to use: LeetCode-style problems, open-ended coding challenges, any time you feel the urge to start typing immediately.

Interview tip: Spending 5 minutes on Understand + Plan almost always beats diving straight into code.

Divide and Conquer (Binary Search Mindset)

When facing a complex system or a large search space, split it in half repeatedly to locate the fault or solution faster.

1. Identify the full search space (e.g., the entire call chain).
2. Pick a midpoint and observe behaviour there.
3. Eliminate the half that is clean.
4. Repeat until isolated.

When to use: Debugging a multi-step pipeline, narrowing down a regression, tracing where data gets corrupted.

Example: “The data is wrong by the time it reaches the UI — is it wrong at the API response, at the service layer, or at the DB query? Let me check the API response first.”

5 Whys

Repeatedly ask “why?” to peel away symptoms and reach the underlying root cause.

Problem:  The deployment failed.
Why 1?    The health check timed out.
Why 2?    The app took too long to start.
Why 3?    It was waiting on a slow database connection.
Why 4?    The connection pool was exhausted.
Why 5?    A recent change removed the connection limit config.
Root cause: Missing connection pool configuration in the new environment.

When to use: Post-mortems, root cause analysis, any time a fix feels shallow (“we just need to restart it”).

Interview tip: Five iterations is a guideline, not a rule. Stop when you reach something actionable and preventable.

Edge Case Checklist

Before finalising any solution, run through this checklist mentally.

□  Empty / null / zero inputs
□  Single element / minimum valid input
□  Maximum input (scale / size limits)
□  Duplicate values
□  Negative numbers or invalid types
□  Already-sorted or reverse-sorted data
□  Off-by-one boundaries
□  Concurrent access / race conditions (for systems)

When to use: After writing any function or designing any API — before saying “I’m done.”

Algorithm Pattern Recognition

When you see a problem, use these signals to identify the right technique before writing any code.

Problem Signal                          → Technique
─────────────────────────────────────────────────────────────
Contiguous subarray / substring         → Sliding window
Pair / triplet sum, sorted array        → Two pointers
Shortest path, level-order traversal    → BFS
All paths, connected components, DFS    → DFS / recursion
Search in sorted array / rotated array  → Binary search
Search answer space ("minimum max...")  → Binary search on answer
Optimal substructure, overlapping sub   → Dynamic programming
All combinations / permutations         → Backtracking
Top K elements, streaming median        → Heap (min/max)
Intervals (merge, insert, overlap)      → Sort + sweep
Parentheses, undo/redo, monotonic       → Stack
Prefix sums, range queries              → Prefix sum / Fenwick tree
String matching, repeated substrings    → Sliding window or KMP
Graph with weights, shortest path       → Dijkstra / Bellman-Ford

Decision flow:

Is the input sorted (or can you sort it)?  → Two pointers / binary search
Does it ask for all possibilities?         → Backtracking / DFS
Does it ask for the shortest/fewest?       → BFS (unweighted) or Dijkstra (weighted)
Does it ask for max/min of a subarray?     → Sliding window / DP
Does it ask for "number of ways"?          → DP
Does it involve a fixed-size window?       → Sliding window
Does it involve nested intervals?          → Stack or sorting

When to use: Technical screens, LeetCode-style problems — apply before writing a single line of code.

Interview tip: Say the pattern out loud: “This looks like a sliding window problem because we need a contiguous subarray and want to avoid recomputation.” Naming the pattern signals experience even before solving.

Last updated on March 28, 2026