Debugging: Interview Reference

Debug with a method, not luck: reproduce it reliably, isolate the smallest failing case, form one hypothesis, test it, repeat — then fix the cause, not the symptom. "Tell me about a tough bug" is really asking whether you debug systematically, so narrate that loop. Below: the universal algorithm, the toolbox, the common bug categories, and a postmortem template.

The Universal Algorithm

Predict the pattern

You receive a fresh bug report. What is the first disciplined step before touching any code?

Reproduce it reliably — if you cannot make the bug happen on demand, you have no feedback loop: every "fix" is a guess.

Reproduce: can you make the bug happen reliably?
Isolate: minimize the input/conditions that trigger it
Hypothesize: what could explain exactly this behavior?
Test the hypothesis: change one thing, predict the result
Fix the root cause, not the symptom
Verify the fix: add a regression test
Look for siblings: same root cause elsewhere?

Most engineers skip steps 2 and 7. Don't.

Mindset rules

Read the actual error. Not the type, the full message + stack.
Question every assumption. "I know X is true" → prove it with a log/breakpoint.

Predict the pattern

A regression appeared sometime in the last 200 commits. Which technique pinpoints the exact commit that introduced it with the fewest manual steps?

git bisect — it binary-searches the commit history; you only mark each midpoint good or bad, reaching the culprit in O(log n) steps.

Bisect: git bisect, binary-search through code, comment-out halves.
Latest change is suspect #1. Recent diff > stale code.
Two known states + a transition. If A worked and B doesn't, the bug is in what changed.
The bug is your friend: every reproduction is information.
Talk to the rubber duck. Explaining out loud reveals the false assumption.

Red flags to watch for

"It works on my machine" → environment differs (versions, env vars, data, OS, TZ).
"It's intermittent" → race condition, timing, caching, or external dependency.
"I just changed one small thing" → that's the change.
"It can't be the database" → it's the database.

Toolbox

Logging

Add logs at decision points, not everywhere.
Log the inputs, the branch taken, and the output.
Use structured logging ({"user_id": 42, "step": "auth", "result": "ok"}).
Log levels: ERROR (alert), WARN (suspicious), INFO (lifecycle), DEBUG (firehose).
For prod issues: turn on DEBUG for one request via correlation ID.

Debuggers

Python: pdb.set_trace() / breakpoint()
Node: node inspect, Chrome DevTools
Browser: debugger;, breakpoints, conditional breakpoints
Go: dlv
C/C++: gdb, lldb
IDE breakpoints with conditions: if user_id == 42

Profilers

CPU: py-spy, perf, Chrome Performance, pprof
Memory: memray, valgrind, heaptrack, Chrome Memory
DB: EXPLAIN ANALYZE, slow query log, pg_stat_statements
Network: tcpdump, Wireshark, Chrome Network tab

Tracing (distributed)

OpenTelemetry, Jaeger, Datadog APM
Correlation IDs propagated through services
Tells you which span/service is slow

System tools

strace (Linux) / dtruss (macOS): syscalls a process makes
lsof -p <pid>: files & sockets open
netstat -an / ss: open connections
top / htop: CPU, memory
iotop: disk I/O
vmstat, iostat, dstat: system stats

Git as a debugging tool

git bisect start <bad> <good> then test → git bisect good/bad
git log -p <file>: change history with diffs
git blame -L 100,110 <file>: who last touched these lines & why (commit msg)
git log -S "buggyFn", pickaxe: find commits that added/removed a string

Common Bug Categories

1. Memory leak

Symptoms: process RAM grows over time, OOM kill eventually.

Causes:

Long-lived collections that only grow (caches without eviction)
Event listeners not removed
Circular references (less common with GC, still possible)
Holding references to large objects (closures!)
Native allocations not freed

Debug:

Heap snapshot at T=0, T=1h, T=4h → diff
memray / Chrome Memory profiler
Look for top retainers
Check growing collections

Fix: weak refs, explicit eviction (LRU), unsubscribe in teardown, scope down captures.

2. Race condition

Symptoms: works in test, fails in prod; works under load, fails under more load; passes alone, fails in parallel.

Causes:

Read-modify-write without lock
Time-of-check-to-time-of-use (TOCTOU)
Async ordering assumptions
Two requests for "same" resource (double-charge!)

Debug:

Add log timestamps with thread/coroutine IDs
Force the bad ordering with sleep() in suspected window
Stress test with concurrency (ab, wrk, hey)

Fix:

Locks (and watch for deadlock)
Atomic ops (compareAndSwap)
DB row-level lock (SELECT ... FOR UPDATE)
Unique constraints in DB
Idempotency keys
Single-writer queue per resource

3. Deadlock

Symptoms: requests hang, threads stuck, eventual timeout.

Causes: lock A → lock B in one path, lock B → lock A in another.

Debug:

Thread dump (jstack, py-spy dump, gdb info threads)
DB: query for waiting locks (pg_locks, SHOW ENGINE INNODB STATUS)

Fix:

Always acquire locks in a fixed global order
Use try-lock with timeout + retry
Shorten critical sections
Prefer optimistic concurrency where possible

4. Performance regression

Symptoms: p99 latency up after deploy.

Approach:

What changed? git log <deploy-time>..HEAD
Profile both versions; compare flame graphs
Check new DB queries with EXPLAIN
Check new external calls (added an API call to a hot path?)
N+1 query? Loop calling per-item what could be a batch?

Common causes: missing index, N+1 query, blocking sync call in async code, cache miss, GC pressure (allocations in hot loop).

5. Slow DB query

Run EXPLAIN ANALYZE, look for Seq Scan on big tables, nested loops with high cost.
Missing index? Add and re-explain.
Stale stats? ANALYZE table_name.
Too much data scanned? Add WHERE selectivity, use partial indexes.
Many small queries vs one big one? Batch via IN (...) or JOIN.
Lock contention? Check pg_stat_activity.

6. Heisenbug / flaky test

Causes (in order of likelihood):

Time-dependent: now() in test vs fixture; TZ issues; DST.
Order-dependent: test depends on previous test's state; use isolated fixtures.
Concurrency: shared resource, race.
External dependency: network, time, randomness. Mock it.
Resource leak: previous test left file handle / DB connection open.

Fix:

Run test 100x in isolation: pytest test_x.py --count=100 → reproduce locally
Freeze time: freezegun
Seed randomness
Reset DB between tests

7. Production crash (panic / unhandled exception)

Capture: error message, stack trace, request ID, user ID, timestamp.
Find related logs by correlation ID.
Reproduce locally with same input.
Fix and add regression test.
Did it affect other users? Pull list of affected requests for follow-up.
Postmortem if customer-impacting.

8. Slow API endpoint

Layer by layer:

Client → server: DNS, TLS, network. Compare from multiple regions.
LB → app: app health, queue depth, thread pool exhausted?
App processing: profile the request handler.
App → DB: query slow? Connection pool exhausted?
App → cache: miss? Connection slow?
App → external API: third party slow?

Distributed trace tells you which span dominates.

9. Wrong data / data corruption

Trace how data got that way: audit log, version history.
Was it bad on insert, or modified after?
Look for off-by-one, integer overflow, encoding (UTF-8 vs latin-1), TZ conversions.
Check all code paths writing to that field, not just the obvious one.

10. Browser / UI bug

DevTools: Console for JS errors; Network for failed/slow requests; Elements for CSS.
Doesn't render: check React DevTools, state, props; is component mounted?
Renders too often: profile, check useEffect deps; memoize.
CSS issue: computed styles tab; specificity wars; box model.
Doesn't reproduce: clear cache, incognito, try another browser, check viewport size.

Common Interview Questions

"Walk me through debugging a tough bug."

Structure: Situation → Hypothesis → Verification → Root cause → Fix → Lesson.

Good answer has:

A real bug (specific, technical)
A wrong hypothesis you ruled out → shows scientific method
The "aha" moment
The actual root cause, not just the symptom
What you changed to prevent it next time

Bad answer: "I read the logs, found the issue, fixed it". Too vague.

"How would you debug a slow API?"

Hit the layers (see #8). Mention: tracing, profiling, EXPLAIN, recent changes. Explicitly ask: is it slow for everyone, or one user/region? Always slow, or burst? When did it start?

"Production is down. What do you do?"

Predict the pattern

Production is down and a deploy went out 20 minutes ago. What is the right priority order?

Stabilize first — roll back the deploy (or flip a feature flag) to restore service, then investigate root cause from logs once users are unaffected. Debugging a live broken system under pressure increases MTTR and risks making things worse.

Acknowledge in incident channel; assign roles (incident commander, comms, investigator).
Stabilize first, debug second: roll back recent deploy, scale up, flip feature flag.
Communicate status to users / stakeholders.
Once stable, investigate root cause.
Postmortem: blameless, action items.

"Roll back before debugging" is the right answer most of the time.

"How would you debug high memory usage in production?"

Confirm: which process, how fast growing, when did it start.
Correlate: traffic spike? Deploy? Bad input?
Capture heap snapshot (live tool depends on runtime).
Diff snapshots over time → top growers.
Identify retention path: who's holding the reference.
Fix: bound the collection, evict, scope down.

"An async job runs fine standalone but fails in production. Why?"

Likely causes:

Different env vars / secrets
Different DB (more data, different schema version)
Concurrent runs racing (no lock)
Resource starvation (memory, file handles)
Timezone (local TZ vs UTC server)
Network egress / firewall in prod

Ask for: logs, exit code, runtime, when it started failing.

"Tests pass locally, fail in CI. Why?"

Order-dependence (CI runs in different order or parallel)
Different OS/file system (case sensitivity, line endings)
Timezone (CI usually UTC)
Network calls not mocked (CI has no internet)
Resource limits (less RAM/CPU)
Hidden state from other test suites
Time-of-day (test that uses now())

Postmortem template

# Incident: <title>
Date: ...
Duration: ... (start → mitigated → fully resolved)
Severity: SEV-1/2/3
Impact: users affected, $$, SLO consumed

## Timeline
HH:MM - first alert
HH:MM - engineer ack
HH:MM - root cause identified
HH:MM - mitigation deployed
HH:MM - fully resolved

## Root cause
What broke and why.

## Contributing factors
Things that made it worse or harder to detect.

## What went well
- ...

## What went poorly
- ...

## Action items (each with owner + due date)
- [ ] Add alert for X (owner, date)
- [ ] Fix dashboard Y (owner, date)
- [ ] Regression test Z (owner, date)

Blameless. The system failed, not the person.

Sentence templates that win interviews

"Before changing anything, I'd want to reproduce the bug reliably."
"My first hypothesis would be X, I'd test it by ..."
"I want to be careful not to fix the symptom; the root cause is ..."
"I'd add a regression test so we know if this comes back."
"I'd check if this same root cause could affect other features."
"In prod, I'd roll back first and debug from logs, not from a live system."

Anti-patterns to call out

Anti-pattern	Why it's bad
`try/except: pass`	Swallows the bug, makes future debugging harder
Adding retries to mask a bug	Bug still there, now also non-deterministic
Restarting the server "to make it work"	Doesn't fix the leak/state issue
Increasing the timeout	Hides the slowness rather than fixing it
`if buggy_case: do_workaround()`	Patches the symptom, root cause spreads
Editing prod data directly	No audit trail, hard to reverse
Pushing a "small fix" without tests	Often introduces a new bug