Debugging

Systematic Debugging Framework

Avoid the most common debugging mistake: changing things randomly and hoping something works.

1.  Reproduce     Get a reliable, minimal reproduction. Can you trigger it on demand?
2.  Observe       Gather data — logs, metrics, stack traces. Do not guess yet.
3.  Hypothesise   Form one specific hypothesis about the cause.
4.  Test          Design an experiment to confirm or deny it. Change one variable.
5.  Fix           Apply the fix only after the hypothesis is confirmed.
6.  Verify        Confirm the fix works and did not introduce regressions.
7.  Document      Record what happened and why, for future reference.

When to use: Any bug that is not immediately obvious — especially intermittent or production issues.

Interview tip: If asked to debug a scenario, narrate this loop out loud. Even if you do not reach the answer, demonstrating a disciplined process is what interviewers evaluate.

Rubber Duck Debugging

Explain the problem, step by step, out loud — to a rubber duck, a colleague, or yourself.

1. State what the code is supposed to do.
2. Walk through what it actually does, line by line.
3. The mismatch between 1 and 2 is almost always where the bug is.

Why it works: Explaining forces you to slow down and make implicit assumptions explicit. The act of articulating the problem often reveals the answer.

When to use: When you have been staring at the same bug for 20+ minutes. Also useful before asking a colleague for help — you may solve it in the process.

Layer-by-Layer Debugging (OSI Mindset)

For network or distributed system issues, walk up the stack from the lowest layer to the highest.

Layer 7 — Application    Is the app logic correct? Are request payloads valid?
Layer 6 — Presentation   Encoding issues? Serialisation mismatches (JSON, protobuf)?
Layer 5 — Session        Auth tokens valid? TLS handshake completing?
Layer 4 — Transport      TCP connection established? Port open? Firewall rule?
Layer 3 — Network        Can the host route to the target? DNS resolving correctly?
Layer 2 — Data Link      (Rarely relevant unless debugging bare metal / VPC)
Layer 1 — Physical       (Rarely relevant unless on-prem hardware issue)

When to use: “Service A cannot reach Service B”, latency spikes, connection timeouts, TLS errors.

Practical shortcuts:

# DNS
nslookup <hostname>
dig <hostname>

# Connectivity
ping <host>
curl -v <url>
telnet <host> <port>

# Firewall / routing
traceroute <host>
nc -zv <host> <port>

Debugging Checklist for Distributed Systems

□  Is the issue isolated to one instance or all instances?
□  Did anything deploy recently? (Check deployment timestamps vs. incident start)
□  Is the issue correlated with a time-of-day traffic pattern?
□  Are there error spikes in logs? What is the first error in the chain?
□  Are downstream dependencies healthy? (Check their status / dashboards)
□  Is the database connection pool exhausted?
□  Are there resource limits being hit? (CPU, memory, file descriptors, disk)
□  Are there clock skew issues between services? (affects JWT expiry, Kafka offsets)
□  Is the issue in a specific region, AZ, or node?
□  Is a third-party API or external dependency degraded?

The “Change What Changed” Heuristic

When a system that was working suddenly breaks:

1. Find the exact time the problem started.
2. List everything that changed around that time:
   - Deployments, config changes, feature flags
   - Infrastructure changes (scaling events, node replacements)
   - External dependency updates
   - Traffic pattern changes (marketing campaign, cron job)
   - Certificate / secret rotation
3. For each change, ask: "Could this cause the observed symptom?"
4. Roll back or test the most likely candidate first.

When to use: Regression bugs, “it was working yesterday” incidents.

Interview tip: This heuristic alone has solved the majority of production incidents. It is worth stating explicitly in a troubleshooting discussion.

Last updated on