Debugging
How to answer 'tell me about a tough bug' and 'how would you debug X?' — a repeatable method, not war stories.
Free reference · last reviewed
Debug with a method, not luck: reproduce it reliably, isolate the smallest failing case, form one hypothesis, test it, repeat — then fix the cause, not the symptom. "Tell me about a tough bug" is really asking whether you debug systematically, so narrate that loop. Below: the universal algorithm, the toolbox, the common bug categories, and a postmortem template.
The Universal Algorithm
Predict the pattern
You receive a fresh bug report. What is the first disciplined step before touching any code?
Reproduce it reliably — if you cannot make the bug happen on demand, you have no feedback loop: every "fix" is a guess.
- Reproduce: can you make the bug happen reliably?
- Isolate: minimize the input/conditions that trigger it
- Hypothesize: what could explain exactly this behavior?
- Test the hypothesis: change one thing, predict the result
- Fix the root cause, not the symptom
- Verify the fix: add a regression test
- Look for siblings: same root cause elsewhere?
Most engineers skip steps 2 and 7. Don't.
Mindset rules
- Read the actual error. Not the type, the full message + stack.
- Question every assumption. "I know X is true" → prove it with a log/breakpoint.
Predict the pattern
A regression appeared sometime in the last 200 commits. Which technique pinpoints the exact commit that introduced it with the fewest manual steps?
git bisect — it binary-searches the commit history; you only mark each midpoint good or bad, reaching the culprit in O(log n) steps.
- Bisect: git bisect, binary-search through code, comment-out halves.
- Latest change is suspect #1. Recent diff > stale code.
- Two known states + a transition. If A worked and B doesn't, the bug is in what changed.
- The bug is your friend: every reproduction is information.
- Talk to the rubber duck. Explaining out loud reveals the false assumption.
Red flags to watch for
- "It works on my machine" → environment differs (versions, env vars, data, OS, TZ).
- "It's intermittent" → race condition, timing, caching, or external dependency.
- "I just changed one small thing" → that's the change.
- "It can't be the database" → it's the database.
Toolbox
Logging
- Add logs at decision points, not everywhere.
- Log the inputs, the branch taken, and the output.
- Use structured logging (
{"user_id": 42, "step": "auth", "result": "ok"}). - Log levels: ERROR (alert), WARN (suspicious), INFO (lifecycle), DEBUG (firehose).
- For prod issues: turn on DEBUG for one request via correlation ID.
Debuggers
- Python:
pdb.set_trace()/breakpoint() - Node:
node inspect, Chrome DevTools - Browser:
debugger;, breakpoints, conditional breakpoints - Go:
dlv - C/C++:
gdb,lldb - IDE breakpoints with conditions:
if user_id == 42
Profilers
- CPU:
py-spy,perf, Chrome Performance,pprof - Memory:
memray,valgrind,heaptrack, Chrome Memory - DB:
EXPLAIN ANALYZE, slow query log,pg_stat_statements - Network:
tcpdump, Wireshark, Chrome Network tab
Tracing (distributed)
- OpenTelemetry, Jaeger, Datadog APM
- Correlation IDs propagated through services
- Tells you which span/service is slow
System tools
strace(Linux) /dtruss(macOS): syscalls a process makeslsof -p <pid>: files & sockets opennetstat -an/ss: open connectionstop/htop: CPU, memoryiotop: disk I/Ovmstat,iostat,dstat: system stats
Git as a debugging tool
git bisect start <bad> <good>then test →git bisect good/badgit log -p <file>: change history with diffsgit blame -L 100,110 <file>: who last touched these lines & why (commit msg)git log -S "buggyFn", pickaxe: find commits that added/removed a string
Common Bug Categories
1. Memory leak
Symptoms: process RAM grows over time, OOM kill eventually.
Causes:
- Long-lived collections that only grow (caches without eviction)
- Event listeners not removed
- Circular references (less common with GC, still possible)
- Holding references to large objects (closures!)
- Native allocations not freed
Debug:
- Heap snapshot at T=0, T=1h, T=4h → diff
memray/ Chrome Memory profiler- Look for top retainers
- Check growing collections
Fix: weak refs, explicit eviction (LRU), unsubscribe in teardown, scope down captures.
2. Race condition
Symptoms: works in test, fails in prod; works under load, fails under more load; passes alone, fails in parallel.
Causes:
- Read-modify-write without lock
- Time-of-check-to-time-of-use (TOCTOU)
- Async ordering assumptions
- Two requests for "same" resource (double-charge!)
Debug:
- Add log timestamps with thread/coroutine IDs
- Force the bad ordering with
sleep()in suspected window - Stress test with concurrency (
ab,wrk,hey)
Fix:
- Locks (and watch for deadlock)
- Atomic ops (
compareAndSwap) - DB row-level lock (
SELECT ... FOR UPDATE) - Unique constraints in DB
- Idempotency keys
- Single-writer queue per resource
3. Deadlock
Symptoms: requests hang, threads stuck, eventual timeout.
Causes: lock A → lock B in one path, lock B → lock A in another.
Debug:
- Thread dump (jstack,
py-spy dump,gdbinfo threads) - DB: query for waiting locks (
pg_locks,SHOW ENGINE INNODB STATUS)
Fix:
- Always acquire locks in a fixed global order
- Use try-lock with timeout + retry
- Shorten critical sections
- Prefer optimistic concurrency where possible
4. Performance regression
Symptoms: p99 latency up after deploy.
Approach:
- What changed?
git log <deploy-time>..HEAD - Profile both versions; compare flame graphs
- Check new DB queries with
EXPLAIN - Check new external calls (added an API call to a hot path?)
- N+1 query? Loop calling per-item what could be a batch?
Common causes: missing index, N+1 query, blocking sync call in async code, cache miss, GC pressure (allocations in hot loop).
5. Slow DB query
- Run
EXPLAIN ANALYZE, look forSeq Scanon big tables, nested loops with high cost. - Missing index? Add and re-explain.
- Stale stats?
ANALYZE table_name. - Too much data scanned? Add
WHEREselectivity, use partial indexes. - Many small queries vs one big one? Batch via
IN (...)orJOIN. - Lock contention? Check
pg_stat_activity.
6. Heisenbug / flaky test
Causes (in order of likelihood):
- Time-dependent:
now()in test vs fixture; TZ issues; DST. - Order-dependent: test depends on previous test's state; use isolated fixtures.
- Concurrency: shared resource, race.
- External dependency: network, time, randomness. Mock it.
- Resource leak: previous test left file handle / DB connection open.
Fix:
- Run test 100x in isolation:
pytest test_x.py --count=100→ reproduce locally - Freeze time:
freezegun - Seed randomness
- Reset DB between tests
7. Production crash (panic / unhandled exception)
- Capture: error message, stack trace, request ID, user ID, timestamp.
- Find related logs by correlation ID.
- Reproduce locally with same input.
- Fix and add regression test.
- Did it affect other users? Pull list of affected requests for follow-up.
- Postmortem if customer-impacting.
8. Slow API endpoint
Layer by layer:
- Client → server: DNS, TLS, network. Compare from multiple regions.
- LB → app: app health, queue depth, thread pool exhausted?
- App processing: profile the request handler.
- App → DB: query slow? Connection pool exhausted?
- App → cache: miss? Connection slow?
- App → external API: third party slow?
Distributed trace tells you which span dominates.
9. Wrong data / data corruption
- Trace how data got that way: audit log, version history.
- Was it bad on insert, or modified after?
- Look for off-by-one, integer overflow, encoding (UTF-8 vs latin-1), TZ conversions.
- Check all code paths writing to that field, not just the obvious one.
10. Browser / UI bug
- DevTools: Console for JS errors; Network for failed/slow requests; Elements for CSS.
- Doesn't render: check React DevTools, state, props; is component mounted?
- Renders too often: profile, check
useEffectdeps; memoize. - CSS issue: computed styles tab; specificity wars; box model.
- Doesn't reproduce: clear cache, incognito, try another browser, check viewport size.
Common Interview Questions
"Walk me through debugging a tough bug."
Structure: Situation → Hypothesis → Verification → Root cause → Fix → Lesson.
Good answer has:
- A real bug (specific, technical)
- A wrong hypothesis you ruled out → shows scientific method
- The "aha" moment
- The actual root cause, not just the symptom
- What you changed to prevent it next time
Bad answer: "I read the logs, found the issue, fixed it". Too vague.
"How would you debug a slow API?"
Hit the layers (see #8). Mention: tracing, profiling, EXPLAIN, recent changes. Explicitly ask: is it slow for everyone, or one user/region? Always slow, or burst? When did it start?
"Production is down. What do you do?"
Predict the pattern
Production is down and a deploy went out 20 minutes ago. What is the right priority order?
Stabilize first — roll back the deploy (or flip a feature flag) to restore service, then investigate root cause from logs once users are unaffected. Debugging a live broken system under pressure increases MTTR and risks making things worse.
- Acknowledge in incident channel; assign roles (incident commander, comms, investigator).
- Stabilize first, debug second: roll back recent deploy, scale up, flip feature flag.
- Communicate status to users / stakeholders.
- Once stable, investigate root cause.
- Postmortem: blameless, action items.
"Roll back before debugging" is the right answer most of the time.
"How would you debug high memory usage in production?"
- Confirm: which process, how fast growing, when did it start.
- Correlate: traffic spike? Deploy? Bad input?
- Capture heap snapshot (live tool depends on runtime).
- Diff snapshots over time → top growers.
- Identify retention path: who's holding the reference.
- Fix: bound the collection, evict, scope down.
"An async job runs fine standalone but fails in production. Why?"
Likely causes:
- Different env vars / secrets
- Different DB (more data, different schema version)
- Concurrent runs racing (no lock)
- Resource starvation (memory, file handles)
- Timezone (local TZ vs UTC server)
- Network egress / firewall in prod
Ask for: logs, exit code, runtime, when it started failing.
"Tests pass locally, fail in CI. Why?"
- Order-dependence (CI runs in different order or parallel)
- Different OS/file system (case sensitivity, line endings)
- Timezone (CI usually UTC)
- Network calls not mocked (CI has no internet)
- Resource limits (less RAM/CPU)
- Hidden state from other test suites
- Time-of-day (test that uses
now())
Postmortem template
# Incident: <title>
Date: ...
Duration: ... (start → mitigated → fully resolved)
Severity: SEV-1/2/3
Impact: users affected, $$, SLO consumed
## Timeline
HH:MM - first alert
HH:MM - engineer ack
HH:MM - root cause identified
HH:MM - mitigation deployed
HH:MM - fully resolved
## Root cause
What broke and why.
## Contributing factors
Things that made it worse or harder to detect.
## What went well
- ...
## What went poorly
- ...
## Action items (each with owner + due date)
- [ ] Add alert for X (owner, date)
- [ ] Fix dashboard Y (owner, date)
- [ ] Regression test Z (owner, date)
Blameless. The system failed, not the person.
Sentence templates that win interviews
- "Before changing anything, I'd want to reproduce the bug reliably."
- "My first hypothesis would be X, I'd test it by ..."
- "I want to be careful not to fix the symptom; the root cause is ..."
- "I'd add a regression test so we know if this comes back."
- "I'd check if this same root cause could affect other features."
- "In prod, I'd roll back first and debug from logs, not from a live system."
Anti-patterns to call out
| Anti-pattern | Why it's bad |
|---|---|
try/except: pass | Swallows the bug, makes future debugging harder |
| Adding retries to mask a bug | Bug still there, now also non-deterministic |
| Restarting the server "to make it work" | Doesn't fix the leak/state issue |
| Increasing the timeout | Hides the slowness rather than fixing it |
if buggy_case: do_workaround() | Patches the symptom, root cause spreads |
| Editing prod data directly | No audit trail, hard to reverse |
| Pushing a "small fix" without tests | Often introduces a new bug |