Skip to main content

Debugging

How to answer 'tell me about a tough bug' and 'how would you debug X?' — a repeatable method, not war stories.

Free reference · last reviewed

Debug with a method, not luck: reproduce it reliably, isolate the smallest failing case, form one hypothesis, test it, repeat — then fix the cause, not the symptom. "Tell me about a tough bug" is really asking whether you debug systematically, so narrate that loop. Below: the universal algorithm, the toolbox, the common bug categories, and a postmortem template.


The Universal Algorithm

Predict the pattern

You receive a fresh bug report. What is the first disciplined step before touching any code?

Reproduce it reliably — if you cannot make the bug happen on demand, you have no feedback loop: every "fix" is a guess.

  1. Reproduce: can you make the bug happen reliably?
  2. Isolate: minimize the input/conditions that trigger it
  3. Hypothesize: what could explain exactly this behavior?
  4. Test the hypothesis: change one thing, predict the result
  5. Fix the root cause, not the symptom
  6. Verify the fix: add a regression test
  7. Look for siblings: same root cause elsewhere?

Most engineers skip steps 2 and 7. Don't.


Mindset rules

  • Read the actual error. Not the type, the full message + stack.
  • Question every assumption. "I know X is true" → prove it with a log/breakpoint.

Predict the pattern

A regression appeared sometime in the last 200 commits. Which technique pinpoints the exact commit that introduced it with the fewest manual steps?

git bisect — it binary-searches the commit history; you only mark each midpoint good or bad, reaching the culprit in O(log n) steps.

  • Bisect: git bisect, binary-search through code, comment-out halves.
  • Latest change is suspect #1. Recent diff > stale code.
  • Two known states + a transition. If A worked and B doesn't, the bug is in what changed.
  • The bug is your friend: every reproduction is information.
  • Talk to the rubber duck. Explaining out loud reveals the false assumption.

Red flags to watch for

  • "It works on my machine" → environment differs (versions, env vars, data, OS, TZ).
  • "It's intermittent" → race condition, timing, caching, or external dependency.
  • "I just changed one small thing" → that's the change.
  • "It can't be the database" → it's the database.

Toolbox

Logging

  • Add logs at decision points, not everywhere.
  • Log the inputs, the branch taken, and the output.
  • Use structured logging ({"user_id": 42, "step": "auth", "result": "ok"}).
  • Log levels: ERROR (alert), WARN (suspicious), INFO (lifecycle), DEBUG (firehose).
  • For prod issues: turn on DEBUG for one request via correlation ID.

Debuggers

  • Python: pdb.set_trace() / breakpoint()
  • Node: node inspect, Chrome DevTools
  • Browser: debugger;, breakpoints, conditional breakpoints
  • Go: dlv
  • C/C++: gdb, lldb
  • IDE breakpoints with conditions: if user_id == 42

Profilers

  • CPU: py-spy, perf, Chrome Performance, pprof
  • Memory: memray, valgrind, heaptrack, Chrome Memory
  • DB: EXPLAIN ANALYZE, slow query log, pg_stat_statements
  • Network: tcpdump, Wireshark, Chrome Network tab

Tracing (distributed)

  • OpenTelemetry, Jaeger, Datadog APM
  • Correlation IDs propagated through services
  • Tells you which span/service is slow

System tools

  • strace (Linux) / dtruss (macOS): syscalls a process makes
  • lsof -p <pid>: files & sockets open
  • netstat -an / ss: open connections
  • top / htop: CPU, memory
  • iotop: disk I/O
  • vmstat, iostat, dstat: system stats

Git as a debugging tool

  • git bisect start <bad> <good> then test → git bisect good/bad
  • git log -p <file>: change history with diffs
  • git blame -L 100,110 <file>: who last touched these lines & why (commit msg)
  • git log -S "buggyFn", pickaxe: find commits that added/removed a string

Common Bug Categories

1. Memory leak

Symptoms: process RAM grows over time, OOM kill eventually.

Causes:

  • Long-lived collections that only grow (caches without eviction)
  • Event listeners not removed
  • Circular references (less common with GC, still possible)
  • Holding references to large objects (closures!)
  • Native allocations not freed

Debug:

  • Heap snapshot at T=0, T=1h, T=4h → diff
  • memray / Chrome Memory profiler
  • Look for top retainers
  • Check growing collections

Fix: weak refs, explicit eviction (LRU), unsubscribe in teardown, scope down captures.


2. Race condition

Symptoms: works in test, fails in prod; works under load, fails under more load; passes alone, fails in parallel.

Causes:

  • Read-modify-write without lock
  • Time-of-check-to-time-of-use (TOCTOU)
  • Async ordering assumptions
  • Two requests for "same" resource (double-charge!)

Debug:

  • Add log timestamps with thread/coroutine IDs
  • Force the bad ordering with sleep() in suspected window
  • Stress test with concurrency (ab, wrk, hey)

Fix:

  • Locks (and watch for deadlock)
  • Atomic ops (compareAndSwap)
  • DB row-level lock (SELECT ... FOR UPDATE)
  • Unique constraints in DB
  • Idempotency keys
  • Single-writer queue per resource

3. Deadlock

Symptoms: requests hang, threads stuck, eventual timeout.

Causes: lock A → lock B in one path, lock B → lock A in another.

Debug:

  • Thread dump (jstack, py-spy dump, gdb info threads)
  • DB: query for waiting locks (pg_locks, SHOW ENGINE INNODB STATUS)

Fix:

  • Always acquire locks in a fixed global order
  • Use try-lock with timeout + retry
  • Shorten critical sections
  • Prefer optimistic concurrency where possible

4. Performance regression

Symptoms: p99 latency up after deploy.

Approach:

  • What changed? git log <deploy-time>..HEAD
  • Profile both versions; compare flame graphs
  • Check new DB queries with EXPLAIN
  • Check new external calls (added an API call to a hot path?)
  • N+1 query? Loop calling per-item what could be a batch?

Common causes: missing index, N+1 query, blocking sync call in async code, cache miss, GC pressure (allocations in hot loop).


5. Slow DB query

  • Run EXPLAIN ANALYZE, look for Seq Scan on big tables, nested loops with high cost.
  • Missing index? Add and re-explain.
  • Stale stats? ANALYZE table_name.
  • Too much data scanned? Add WHERE selectivity, use partial indexes.
  • Many small queries vs one big one? Batch via IN (...) or JOIN.
  • Lock contention? Check pg_stat_activity.

6. Heisenbug / flaky test

Causes (in order of likelihood):

  1. Time-dependent: now() in test vs fixture; TZ issues; DST.
  2. Order-dependent: test depends on previous test's state; use isolated fixtures.
  3. Concurrency: shared resource, race.
  4. External dependency: network, time, randomness. Mock it.
  5. Resource leak: previous test left file handle / DB connection open.

Fix:

  • Run test 100x in isolation: pytest test_x.py --count=100 → reproduce locally
  • Freeze time: freezegun
  • Seed randomness
  • Reset DB between tests

7. Production crash (panic / unhandled exception)

  1. Capture: error message, stack trace, request ID, user ID, timestamp.
  2. Find related logs by correlation ID.
  3. Reproduce locally with same input.
  4. Fix and add regression test.
  5. Did it affect other users? Pull list of affected requests for follow-up.
  6. Postmortem if customer-impacting.

8. Slow API endpoint

Layer by layer:

  1. Client → server: DNS, TLS, network. Compare from multiple regions.
  2. LB → app: app health, queue depth, thread pool exhausted?
  3. App processing: profile the request handler.
  4. App → DB: query slow? Connection pool exhausted?
  5. App → cache: miss? Connection slow?
  6. App → external API: third party slow?

Distributed trace tells you which span dominates.


9. Wrong data / data corruption

  • Trace how data got that way: audit log, version history.
  • Was it bad on insert, or modified after?
  • Look for off-by-one, integer overflow, encoding (UTF-8 vs latin-1), TZ conversions.
  • Check all code paths writing to that field, not just the obvious one.

10. Browser / UI bug

  • DevTools: Console for JS errors; Network for failed/slow requests; Elements for CSS.
  • Doesn't render: check React DevTools, state, props; is component mounted?
  • Renders too often: profile, check useEffect deps; memoize.
  • CSS issue: computed styles tab; specificity wars; box model.
  • Doesn't reproduce: clear cache, incognito, try another browser, check viewport size.

Common Interview Questions

"Walk me through debugging a tough bug."

Structure: Situation → Hypothesis → Verification → Root cause → Fix → Lesson.

Good answer has:

  • A real bug (specific, technical)
  • A wrong hypothesis you ruled out → shows scientific method
  • The "aha" moment
  • The actual root cause, not just the symptom
  • What you changed to prevent it next time

Bad answer: "I read the logs, found the issue, fixed it". Too vague.


"How would you debug a slow API?"

Hit the layers (see #8). Mention: tracing, profiling, EXPLAIN, recent changes. Explicitly ask: is it slow for everyone, or one user/region? Always slow, or burst? When did it start?


"Production is down. What do you do?"

Predict the pattern

Production is down and a deploy went out 20 minutes ago. What is the right priority order?

Stabilize first — roll back the deploy (or flip a feature flag) to restore service, then investigate root cause from logs once users are unaffected. Debugging a live broken system under pressure increases MTTR and risks making things worse.

  1. Acknowledge in incident channel; assign roles (incident commander, comms, investigator).
  2. Stabilize first, debug second: roll back recent deploy, scale up, flip feature flag.
  3. Communicate status to users / stakeholders.
  4. Once stable, investigate root cause.
  5. Postmortem: blameless, action items.

"Roll back before debugging" is the right answer most of the time.


"How would you debug high memory usage in production?"

  • Confirm: which process, how fast growing, when did it start.
  • Correlate: traffic spike? Deploy? Bad input?
  • Capture heap snapshot (live tool depends on runtime).
  • Diff snapshots over time → top growers.
  • Identify retention path: who's holding the reference.
  • Fix: bound the collection, evict, scope down.

"An async job runs fine standalone but fails in production. Why?"

Likely causes:

  • Different env vars / secrets
  • Different DB (more data, different schema version)
  • Concurrent runs racing (no lock)
  • Resource starvation (memory, file handles)
  • Timezone (local TZ vs UTC server)
  • Network egress / firewall in prod

Ask for: logs, exit code, runtime, when it started failing.


"Tests pass locally, fail in CI. Why?"

  • Order-dependence (CI runs in different order or parallel)
  • Different OS/file system (case sensitivity, line endings)
  • Timezone (CI usually UTC)
  • Network calls not mocked (CI has no internet)
  • Resource limits (less RAM/CPU)
  • Hidden state from other test suites
  • Time-of-day (test that uses now())

Postmortem template

# Incident: <title>
Date: ...
Duration: ... (start → mitigated → fully resolved)
Severity: SEV-1/2/3
Impact: users affected, $$, SLO consumed

## Timeline
HH:MM - first alert
HH:MM - engineer ack
HH:MM - root cause identified
HH:MM - mitigation deployed
HH:MM - fully resolved

## Root cause
What broke and why.

## Contributing factors
Things that made it worse or harder to detect.

## What went well
- ...

## What went poorly
- ...

## Action items (each with owner + due date)
- [ ] Add alert for X (owner, date)
- [ ] Fix dashboard Y (owner, date)
- [ ] Regression test Z (owner, date)

Blameless. The system failed, not the person.


Sentence templates that win interviews

  • "Before changing anything, I'd want to reproduce the bug reliably."
  • "My first hypothesis would be X, I'd test it by ..."
  • "I want to be careful not to fix the symptom; the root cause is ..."
  • "I'd add a regression test so we know if this comes back."
  • "I'd check if this same root cause could affect other features."
  • "In prod, I'd roll back first and debug from logs, not from a live system."

Anti-patterns to call out

Anti-patternWhy it's bad
try/except: passSwallows the bug, makes future debugging harder
Adding retries to mask a bugBug still there, now also non-deterministic
Restarting the server "to make it work"Doesn't fix the leak/state issue
Increasing the timeoutHides the slowness rather than fixing it
if buggy_case: do_workaround()Patches the symptom, root cause spreads
Editing prod data directlyNo audit trail, hard to reverse
Pushing a "small fix" without testsOften introduces a new bug