Skip to main content

Microservices

When to split a service (and when not to), the anti-patterns, service discovery, data patterns, and what to actually say in interviews.

Free reference · last reviewed

Default to a monolith. Split into microservices only when a real seam — independent scaling, a separate deploy cadence, or clear team ownership — earns back the network calls, distributed transactions, and ops overhead you take on. The strongest interview answer leads with that trade-off, not a diagram. Below: when to split, the anti-patterns, service discovery, and the data patterns that actually come up.


What is a Microservice?

A service:

  • Has one bounded responsibility (one business capability).
  • Owns its data (no other service touches its DB).
  • Is deployable independently.
  • Communicates over the network with other services.

What it is NOT:

  • "A small service" (size isn't the point, independence is).
  • "Anything in its own repo."
  • A solution looking for a problem.

Predict the pattern

Microservices always scale better than a monolith.


When To Split (and when NOT to)

Reasons that justify splitting

  • Team scaling: 50 engineers stepping on each other in one repo.
  • Independent deploy cadence: parts of the system want to release at different rhythms.
  • Different scaling profile: search needs 100× more compute than profiles.
  • Different tech needs: ML pipeline in Python, low-latency edge in Go.
  • Failure isolation: don't let recommendations crash bring down checkout.
  • Compliance / data residency: payment must live in a hardened account/region.

Bad reasons to split

  • "Microservices are best practice."
  • Resume-driven design.
  • Team A doesn't like Team B's code.
  • "We might need to scale this someday."

The default: modular monolith

  • One repo, one deploy.
  • Clear module boundaries inside.
  • Single DB but logically owned schemas/tables per module.
  • Easy to refactor; you can carve out a service later when the seam is real.

Most startups should ship a monolith until ~30+ engineers or a clear pain point. Microservices add network failures, distributed tracing, deployment complexity, and data-consistency problems. Pay that cost only when the benefits outweigh.


Service Boundaries

The hard part. Get it wrong and you get the distributed monolith: services that can't deploy independently, share a DB, and break together. Worst of both worlds.

Principles

  • Boundaries align with business domains, not technical layers (Domain-Driven Design's "bounded contexts").
  • A service should be deployable without coordinating with another team's deploy.
  • A service should have a clear owner (team responsible end-to-end).
  • Cross-service calls in a single user request should be few. A web request that hits 10 services is fragile and slow.

Signs your boundaries are wrong

  • Shipping a feature requires changing 4 services in lockstep.
  • Two services share a database schema.
  • One service can't function without a synchronous call to another.
  • Teams need a meeting to decide where new code goes.

If your boundaries are wrong, fix them. Merging services back is acceptable; staying with bad ones is expensive.


Communication Styles

Make the call

New inter-service call: Order service needs to notify Inventory when an order is placed. Which communication style fits better?

Synchronous (HTTP, gRPC)

  • Request/response. Caller blocks.
  • Pros: simple model, immediate feedback.
  • Cons: tight coupling, latency adds up, cascading failures.

Asynchronous (queue, event bus)

  • Producer fires event; consumers process when ready.
  • Pros: decoupled, resilient to bursts, easy to add new consumers.
  • Cons: eventual consistency, debugging harder (trace across queues), ordering issues.

Default

  • Read paths → synchronous (HTTP/gRPC).
  • Side effects, fan-out, slow work → asynchronous (Kafka / SQS / EventBridge).

gRPC vs REST

  • gRPC: binary (protobuf), bidirectional streaming, strict schema, faster.
  • REST: human-readable, browser-friendly, mature tooling.
  • Internal east-west: gRPC fine.
  • External / browser: REST (or GraphQL).

Service Discovery

Services need to find each other:

ApproachHow
DNSStatic name; LB behind it; DNS resolves to LB. Simple, slightly stale.
Client-side discoveryService registry (Consul, Eureka). Client picks instance.
Server-side discoveryLB looks up registry, routes. (K8s Service works this way internally.)
Service mesh (Istio, Linkerd)Sidecar proxy handles discovery, retries, mTLS, traffic shifting.

In K8s: just use Service names + DNS. Add a mesh when you actually need its features.


API Gateway

Single entry point for external clients.

Responsibilities:

  • Authentication & rate limiting.
  • Routing to backend services.
  • Aggregation (one client call → multiple backend calls).
  • Protocol translation (HTTPS in → gRPC out).
  • TLS termination.

Examples: AWS API Gateway, Kong, Apigee, Envoy.

Watch out: gateway becomes a bottleneck and a single point of failure. Keep logic in services; gateway should be thin.


Data Patterns

Make the call

Mid-migration: two services still share a Postgres schema to keep the team moving. Stick with the shared DB short-term or force the split now?

Database per service

Each service owns its data. Other services must not read its DB directly, only through its API.

Why: ownership. If two services touch the same table, they're effectively one service that pretends otherwise.

Shared DB (anti-pattern)

  • Schema changes are coupled.
  • Performance noisy neighbors.
  • No real boundaries.

Sometimes necessary as a transitional step. Document it as debt.

Saga pattern (distributed transactions)

You can't do BEGIN; do_thing_in_A; do_thing_in_B; COMMIT; across services. Instead:

Make the call

Checkout flow: reserve inventory → charge payment → confirm order, with compensating rollbacks on failure. Which saga style is a better fit?

Choreography: services emit events; others react.

Order placed → InventoryReserved → PaymentCharged → OrderConfirmed
                       ↓                      ↓
                  InventoryFailed     PaymentFailed → InventoryReleased

Orchestration: a central coordinator drives the steps.

Orchestrator → Inventory.reserve
            → Payment.charge
            → if either fails → run compensating actions

Use orchestration for complex flows (clearer); choreography for loosely-coupled events.

Outbox pattern (reliable event publishing)

Problem: write to DB and publish event to Kafka, not atomic. Solution: write event into an outbox table in same DB transaction. Separate process (CDC like Debezium, or polling worker) reads outbox and publishes to Kafka. Atomic at the DB; events eventually delivered.

CQRS (Command Query Responsibility Segregation)

Separate write and read models.

  • Writes go to a normalized "command" model.
  • Reads served from optimized "query" projections, built async from events.
  • Good for read-heavy systems; adds complexity.

Event sourcing

Store every state change as an event; current state = replay of events.

  • Pros: full audit, time travel, rebuild projections.
  • Cons: schema evolution painful, big stream stores, harder to learn.

Rarely needed. Use when the problem genuinely is "what did the state look like at time T?" (audit, finance, debugging compliance).


Resilience Patterns

Retries

  • Network is flaky; retry transient failures.
  • Exponential backoff + jitter to avoid thundering herd on recovery.
  • Cap attempts; otherwise you amplify outages.

Timeouts

  • Always set timeouts on outbound calls. No timeout = hung threads = cascading failure.
  • Caller timeout < callee SLO < end-user budget.

Circuit breaker

  • After N failures, "open" the circuit → fail fast, don't even try.
  • After cool-down, "half-open" → let a trickle through; close if success.
  • Prevents a failing dependency from dragging you down.
  • Libraries: Hystrix (legacy), Resilience4j, Polly (.NET).

Bulkhead

  • Isolate resource pools: one slow downstream can't exhaust all threads/connections.
  • Example: 50 threads for /checkout + 50 for /search. If search backend hangs, checkout still works.

Rate limiting

  • Inbound: protect yourself from abuse.
  • Outbound: respect downstream limits.

Graceful degradation

  • If recommendations are down, show generic items rather than 500.
  • Identify "what's the minimum viable response?"

Dead Letter Queue (DLQ)

  • Messages that fail repeatedly go to DLQ for inspection.
  • Always configure one; otherwise messages vanish or queue grows forever.

Idempotency

Caller may retry. Same operation must produce same effect. Use idempotency keys for non-idempotent writes (see api-design.md).


Observability

In a monolith, a stack trace tells you everything. In microservices, you need to assemble the picture from many services.

Three pillars

  • Logs: structured (JSON), correlation ID per request.
  • Metrics: RED (Rate, Errors, Duration) per endpoint; USE (Utilization, Saturation, Errors) per resource.
  • Traces: distributed tracing (OpenTelemetry → Jaeger / Tempo / Datadog). Each request gets a trace; spans show which service did what when.

Correlation ID

Generated at the edge (or accepted from caller); propagated through every call (HTTP header X-Request-ID / W3C traceparent). Without it, debugging is hell.

Health endpoints

  • /health: liveness; "I'm alive."
  • /ready: readiness; "I can serve traffic" (deps reachable).
  • Different! K8s uses both.

Deployment

Blue/green

  • Stand up new version (green) alongside old (blue).
  • Switch traffic. Roll back by switching back.
  • Doubles capacity briefly.

Canary

  • Send small % of traffic to new version.
  • Monitor metrics; if good, roll forward; if bad, roll back.
  • Lower risk than full blue/green flip.

Rolling

  • Replace pods/instances one batch at a time.
  • K8s default for Deployments.

Feature flags

  • Decouple deploy from release. Ship dark; enable for cohort; ramp.
  • Tools: LaunchDarkly, Unleash, GrowthBook, in-house.

Schema / Contract Evolution

Two services depend on the same shape. Changing it without breaking the other:

Add a field: non-breaking

Old clients ignore it. Safe.

Remove a field: breaking

  • Mark deprecated.
  • Wait for all callers to stop using it.
  • Then remove.

Rename / change semantics: breaking

  • Add the new field/version alongside the old.
  • Migrate callers.
  • Remove the old.

Tooling

  • Schema registry (Confluent Schema Registry for Kafka): enforces schema compatibility.
  • Consumer-driven contract tests (Pact): consumers define expectations; provider runs them on every build.

Anti-patterns (the hits)

Anti-patternWhy it's bad
Distributed monolithServices that must deploy together, costs of microservices with none of the benefits.
Shared databaseNo real ownership; schema changes are blocked across teams.
Synchronous chainsService A → B → C → D synchronously means latency adds and any failure breaks everything.
Chatty interfacesOne business operation requires N round trips. Combine into one call or use coarse-grained APIs.
No timeouts / retriesFailures cascade. Always set timeouts; retry with backoff.
Manual deploysIf deploys are scary, you'll batch them → big riskier deploys. CI/CD per service.
Service per developerInverse Conway: tiny services per team member, integration nightmare.
Two-phase commit across servicesDon't. Use Saga + compensations.
Ignoring observabilityYou can't debug what you can't see. Tracing + correlation IDs from day one.
Premature microservices5-person startup, 20 services. Just ship a monolith.

Conway's Law

"Organizations design systems that mirror their communication structure."

If you want a 3-service system, organize 3 teams. If you want 20 services, you need 20 teams (or you'll get a distributed monolith because the teams must coordinate anyway). Inverse Conway: design the team structure you want first.


Common Interview Questions

"Microservices vs monolith: when?"

Monolith default. Microservices when: team scaling pain, independent deploy cadence is essential, different scaling/tech needs, failure isolation matters. Not as a "best practice."

"How do you handle distributed transactions?"

Avoid where possible (design so ownership is clear). When unavoidable, Saga with compensating actions. Use outbox to atomically commit to DB and emit event. Never use 2PC across services.

"How do services discover each other?"

DNS + LB (simplest), service registry (Consul/Eureka), or platform-native (K8s Services). Add a service mesh when you need its features (mTLS, traffic shaping, retries at infra layer).

"How do you handle a cascading failure?"

Timeouts + circuit breakers + bulkheads + graceful degradation. Identify the failure mode and what to show users when the dependency is down (cache, fallback, partial data).

"How do you do a backward-incompatible API change?"

You don't, directly. Add new version (in field/header/URL), migrate clients, remove old. For events: register new schema, dual-publish during transition, then remove old.

"Sync vs async communication: when which?"

Sync for user-facing reads needing immediate response. Async for side effects, fan-out, slow work, decoupling. A useful pattern: API returns 202 + job ID, work happens async, client polls or gets notified.

"How do you debug a failed request that crossed 5 services?"

Distributed tracing (correlation ID propagated; spans collected in Jaeger/Datadog). Without tracing → manual log correlation by request ID, time window, and praying. Don't ship microservices without tracing.

"Outbox pattern: what and why?"

DB write + event publish must be atomic. Write event to an outbox table in the same DB transaction; a separate process publishes from outbox to the broker. Guarantees eventual delivery without distributed transactions.

"What's a saga?"

A multi-step distributed transaction split into local transactions per service, each with a compensating action to undo it on failure. Orchestration (central coordinator) or choreography (services emit/react to events).

"Why is 'database per service' a rule?"

Without it, schema changes need cross-team coordination, queries cross "trust boundaries," and you can't deploy independently. Sharing the DB = your services are actually one service.

"When have you regretted splitting?"

(For senior interviews.) Tell a story where premature splitting led to deployment pain / latency / debugging cost; you merged back; what you learned.

"Service mesh: needed?"

Helpful at scale (mTLS everywhere, traffic shifting, retries at infra layer). Significant ops cost. Most small teams don't need it; reach for it when you have ≥ ~10 services and the cross-cutting concerns hurt.


A Mental Sanity Check

Before adding a service, ask:

  1. Will it really deploy independently?
  2. Does it own its data?
  3. Is the boundary stable? (Will I refactor it in 6 months?)
  4. Who owns it on-call?
  5. Is the latency overhead acceptable?
  6. Do I have the observability to debug it?

If "no" to any: stay in the monolith.


Cheat sheet

NeedPattern
Decouple servicesevent bus / queue
Reliable event publishoutbox pattern
Multi-step business txnsaga (orchestration / choreography)
Find servicesDNS / registry / mesh
Prevent cascadetimeout + circuit breaker + bulkhead
Same operation, safe retriesidempotency key
Read-heavy with complex writesCQRS
Full audit / replayevent sourcing
Debug across servicesdistributed tracing
Roll out safelycanary + feature flag
Schema change safelyadditive → migrate → remove
Avoid two services touching the same tabledatabase per service