Skip to main content

SQL

The SQL surface area asked in SDE interviews — joins, indexes, window functions — plus 15 classic problems with solutions (PostgreSQL).

Free reference · last reviewed

SDE interviews test a narrow slice of SQL: the join types, when an index helps (and when it's quietly ignored), GROUP BY/HAVING, and window functions. Get those cold and you clear most rounds. Below is exactly that surface area, plus 15 classic problems with worked PostgreSQL solutions.

Dialect: PostgreSQL (notes for MySQL where they differ).


Order of Execution

Mental model: SQL is written in one order but executed in another.

Number FROM/WHERE/GROUP BY/HAVING/SELECT/ORDER BY/LIMIT in execution order

WrittenExecuted
SELECT5
FROM1
WHERE2
GROUP BY3
HAVING4
SELECT5
ORDER BY6
LIMIT7

Implication: aliases in SELECT can't be used in WHERE (still being built), but can in ORDER BY.


Joins

-- INNER: rows matching on both sides
SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id;

-- LEFT: all left rows; right NULL if no match
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;

-- FULL OUTER: union of LEFT + RIGHT
SELECT u.name, o.total
FROM users u
FULL OUTER JOIN orders o ON o.user_id = u.id;

-- SELF JOIN: same table twice (employee → manager)
SELECT e.name AS emp, m.name AS mgr
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- CROSS JOIN: cartesian product (use sparingly)
SELECT a.id, b.id FROM a CROSS JOIN b;

Predict the pattern

Datasetusers: (id=1, name='alice'), (id=2, name='bob'). orders: (id=1, user_id=1, total=50).

SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;

What is o.total for the row where u.name = 'bob'?

LEFT JOIN keeps every row from the left table. Bob (id=2) has no matching order, so all right-side columns are filled with NULL — the row is not dropped.

Predict the pattern

Dataset — same as above: users: (id=1 alice), (id=2 bob). orders: one row with user_id=1.

SELECT u.name, o.total
FROM users u
JOIN orders o ON o.user_id = u.id;

How many rows does this INNER JOIN return?

INNER JOIN only keeps rows where the join condition matches on both sides. Bob (id=2) has no order, so he is excluded. Only alice matches → 1 row.

Anti-join / semi-join

-- Users with NO orders
SELECT u.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.id IS NULL;

-- Or:
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM orders);
-- Or (safer with NULLs):
SELECT * FROM users u WHERE NOT EXISTS
  (SELECT 1 FROM orders o WHERE o.user_id = u.id);

NOT IN trap: if subquery returns any NULL, the whole NOT IN is unknown → 0 rows. Prefer NOT EXISTS.


Aggregates + GROUP BY / HAVING

SELECT user_id, COUNT(*) AS n_orders, SUM(total) AS revenue
FROM orders
WHERE created_at >= '2025-01-01'
GROUP BY user_id
HAVING SUM(total) > 1000
ORDER BY revenue DESC;

Predict the pattern

Datasetemp table, 3 rows: bonus values are 100, NULL, 200.

SELECT COUNT(bonus) FROM emp;

What does this query return?

COUNT(col) counts non-NULL values only; the row with bonus = NULL is silently skipped. COUNT(*) would return 3. COUNT(bonus) returns 2.

Predict the pattern

Dataset — table t with 3 rows: col = 'a', col = 'x', col = NULL.

SELECT * FROM t WHERE col != 'x';

Rows where col IS NULL are ___ from the result.

Any comparison with NULL yields UNKNOWN (not TRUE), so NULL != 'x' is UNKNOWN, which fails the WHERE filter. Only col = 'a' passes → NULL rows are excluded.

  • WHERE filters rows before grouping.
  • HAVING filters groups after.
  • Every non-aggregated column in SELECT must be in GROUP BY (Postgres strict; MySQL lax → enable ONLY_FULL_GROUP_BY).

Conditional aggregates

SELECT
  COUNT(*) FILTER (WHERE status = 'paid')   AS paid,
  COUNT(*) FILTER (WHERE status = 'failed') AS failed
FROM orders;

-- Portable form:
SELECT
  SUM(CASE WHEN status='paid'   THEN 1 ELSE 0 END) AS paid,
  SUM(CASE WHEN status='failed' THEN 1 ELSE 0 END) AS failed
FROM orders;

Window Functions

The interview-grade tool. OVER (PARTITION BY ... ORDER BY ...).

-- Running total per user
SELECT user_id, order_date, total,
       SUM(total) OVER (PARTITION BY user_id ORDER BY order_date) AS running
FROM orders;

-- Rank orders within user by total
SELECT user_id, total,
       ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY total DESC) AS rn,
       RANK()       OVER (PARTITION BY user_id ORDER BY total DESC) AS rk,
       DENSE_RANK() OVER (PARTITION BY user_id ORDER BY total DESC) AS drk
FROM orders;

-- Previous / next row
SELECT user_id, order_date, total,
       LAG(total)  OVER (PARTITION BY user_id ORDER BY order_date) AS prev_total,
       LEAD(total) OVER (PARTITION BY user_id ORDER BY order_date) AS next_total
FROM orders;
FunctionOutput
ROW_NUMBER()unique sequential rank, no ties
RANK()ties share rank, leaves gaps (1,1,3)
DENSE_RANK()ties share rank, no gaps (1,1,2)
NTILE(k)distribute rows into k buckets
LAG(col, n)value from row n behind
LEAD(col, n)value from row n ahead
FIRST_VALUE, LAST_VALUEfirst/last in window
SUM/AVG/COUNT OVERrunning aggregate

Frame clause

SUM(x) OVER (
  PARTITION BY u
  ORDER BY t
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW   -- running total
)
-- BETWEEN 6 PRECEDING AND CURRENT ROW              -- 7-day moving sum

CTEs (WITH)

WITH recent_orders AS (
  SELECT user_id, SUM(total) AS spent
  FROM orders
  WHERE created_at >= NOW() - INTERVAL '30 days'
  GROUP BY user_id
)
SELECT u.name, r.spent
FROM users u JOIN recent_orders r ON r.user_id = u.id
WHERE r.spent > 500;
  • Improves readability over deep subqueries.
  • Postgres ≥ 12: CTEs no longer always materialize; use MATERIALIZED/NOT MATERIALIZED to control.

Recursive CTE (graph / hierarchy)

-- All descendants of manager id 1
WITH RECURSIVE tree AS (
  SELECT id, name, manager_id, 1 AS depth
  FROM employees WHERE id = 1
  UNION ALL
  SELECT e.id, e.name, e.manager_id, t.depth + 1
  FROM employees e JOIN tree t ON e.manager_id = t.id
)
SELECT * FROM tree;

Use for: org charts, category trees, BOMs, shortest-path on small graphs, "explode" sequences.


Subqueries

TypeExample
ScalarSELECT name, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) FROM users u
INWHERE id IN (SELECT user_id FROM orders)
EXISTSWHERE EXISTS (SELECT 1 FROM orders WHERE user_id = u.id)
Correlatedsubquery references outer query, runs per row

EXISTS short-circuits → typically faster than IN on large data, and NULL-safe.


Indexes

Types

TypeBest for
B-tree (default)equality + range; ORDER BY
Hashequality only
GINfull-text, JSONB, array contains
GiSTgeometry, ranges
BRINhuge, append-only, naturally ordered tables

Composite index

CREATE INDEX idx_orders_user_date ON orders(user_id, created_at);

Useful for:

  • WHERE user_id = ? AND created_at > ?
  • WHERE user_id = ?
  • Not WHERE created_at > ? (leftmost-prefix rule).

Covering index

CREATE INDEX idx_orders_user_total
ON orders(user_id) INCLUDE (total);
-- Query can be answered from index alone (no table fetch).

Partial index

CREATE INDEX idx_orders_pending
ON orders(created_at) WHERE status = 'pending';

When indexes hurt

  • Write-heavy tables: every index = extra write cost.
  • Low-cardinality columns (e.g. boolean): full scan often wins.
  • Tiny tables.

Index doesn't get used? Common reasons

  • Function on column: WHERE LOWER(email) = ? → use functional index or normalize at write.
  • Implicit cast: WHERE int_col = '5' → planner may not use index.
  • Leading wildcard: LIKE '%abc'.
  • OR across different columns (sometimes).
  • Stats stale: ANALYZE table_name.

EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT u.name, SUM(o.total)
FROM users u JOIN orders o ON o.user_id = u.id
GROUP BY u.name;

Look for:

  • Seq Scan on a large table with selective predicate → missing index.
  • High cost discrepancy vs actual time → stale statistics.
  • Nested Loop with huge outer side → consider Hash Join.
  • Rows Removed by Filter large → filter too late, push to index.
  • Sort with Disk mention → bump work_mem or add ORDER BY index.

Normalization

FormRule
1NFatomic values, unique rows
2NF+ no partial dependency on composite PK
3NF+ no transitive dependency
BCNF+ every determinant is a candidate key

In practice: aim for 3NF, then denormalize for performance when joins are too expensive (e.g. precomputed counts, materialized views, search indexes).


Transactions & Isolation

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;
Isolation levelPhenomena allowed
READ UNCOMMITTEDdirty reads
READ COMMITTEDnon-repeatable reads, phantoms (Postgres default)
REPEATABLE READphantoms (MySQL default; Postgres prevents phantoms here)
SERIALIZABLEnone

Pessimistic lock

SELECT * FROM seats WHERE show_id = 1 AND id = 42 FOR UPDATE;
-- ... reserve ...
COMMIT;

Optimistic concurrency

UPDATE orders SET total = ?, version = version + 1
WHERE id = ? AND version = ?;   -- 0 rows updated => someone beat you

Useful PostgreSQL features

  • RETURNING: get inserted/updated row(s) back.
INSERT INTO users(name) VALUES ('a') RETURNING id;
  • INSERT ... ON CONFLICT ... DO UPDATE (upsert):
INSERT INTO clicks(url, n) VALUES ('/x', 1)
ON CONFLICT (url) DO UPDATE SET n = clicks.n + 1;
  • generate_series: synthetic rows for date spines.
SELECT d::date FROM generate_series('2025-01-01','2025-01-31','1 day') d;
  • JSONB: col->'key', col->>'key' (as text), col @> '{"a":1}' containment.
  • Arrays: int[], text[]; ANY, unnest.
  • CTE with DELETE/UPDATE + RETURNING for "move rows" patterns.

15 Classic SQL Interview Questions

Schema we'll use unless stated:

employees(id, name, dept_id, salary, manager_id, hire_date)
departments(id, name)
orders(id, user_id, total, status, created_at)
users(id, name, email)
logins(user_id, login_date)

Q1. Nth highest salary

-- 2nd highest, distinct
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

-- General Nth
SELECT salary
FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) rk
      FROM employees) t
WHERE rk = N;

Q2. Employees earning more than their manager

SELECT e.name
FROM employees e
JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

Q3. Top 3 highest-paid per department

SELECT id, name, dept_id, salary
FROM (
  SELECT *, DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) rk
  FROM employees
) t
WHERE rk <= 3;

Q4. Departments with no employees

SELECT d.name
FROM departments d
LEFT JOIN employees e ON e.dept_id = d.id
WHERE e.id IS NULL;

Q5. Duplicate emails

SELECT email, COUNT(*) AS n
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Q6. Delete duplicates, keep lowest id

DELETE FROM users a
USING users b
WHERE a.email = b.email AND a.id > b.id;

Q7. Consecutive login streak (≥ 3 days)

Classic "gaps & islands":

WITH grp AS (
  SELECT user_id, login_date,
         login_date - (ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date))::int AS bucket
  FROM logins
)
SELECT user_id, MIN(login_date) AS start_d, MAX(login_date) AS end_d, COUNT(*) AS streak
FROM grp
GROUP BY user_id, bucket
HAVING COUNT(*) >= 3;

Q8. Median salary

SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median
FROM employees;

Q9. Year-over-year revenue growth

WITH yearly AS (
  SELECT EXTRACT(YEAR FROM created_at)::int AS yr, SUM(total) AS rev
  FROM orders GROUP BY 1
)
SELECT yr, rev,
       LAG(rev) OVER (ORDER BY yr) AS prev_rev,
       ROUND(100.0 * (rev - LAG(rev) OVER (ORDER BY yr))
             / NULLIF(LAG(rev) OVER (ORDER BY yr), 0), 2) AS pct_growth
FROM yearly;

Q10. Pivot status counts (one row per user)

SELECT user_id,
  COUNT(*) FILTER (WHERE status='paid')    AS paid,
  COUNT(*) FILTER (WHERE status='pending') AS pending,
  COUNT(*) FILTER (WHERE status='failed')  AS failed
FROM orders
GROUP BY user_id;

Q11. Users who placed an order every month of last year

SELECT user_id
FROM orders
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01'
GROUP BY user_id
HAVING COUNT(DISTINCT EXTRACT(MONTH FROM created_at)) = 12;

Q12. Active users last 7 days but not in the 7 before

WITH last7  AS (SELECT DISTINCT user_id FROM logins
                WHERE login_date >= CURRENT_DATE - 7),
     prev7  AS (SELECT DISTINCT user_id FROM logins
                WHERE login_date >= CURRENT_DATE - 14 AND login_date < CURRENT_DATE - 7)
SELECT user_id FROM last7 WHERE user_id NOT IN (SELECT user_id FROM prev7);

Q13. Cumulative orders per user

SELECT user_id, created_at, total,
       SUM(total) OVER (PARTITION BY user_id ORDER BY created_at) AS cum_total
FROM orders;

Q14. Self-join: friend recommendations (friends of friends, not already friends)

-- friendships(a, b)
SELECT DISTINCT f1.a AS u, f2.b AS rec
FROM friendships f1
JOIN friendships f2 ON f1.b = f2.a
WHERE f1.a != f2.b
  AND NOT EXISTS (
    SELECT 1 FROM friendships f3
    WHERE f3.a = f1.a AND f3.b = f2.b
  );

Q15. Tree depth (recursive)

WITH RECURSIVE r AS (
  SELECT id, manager_id, 1 AS depth FROM employees WHERE manager_id IS NULL
  UNION ALL
  SELECT e.id, e.manager_id, r.depth + 1
  FROM employees e JOIN r ON e.manager_id = r.id
)
SELECT MAX(depth) AS tree_depth FROM r;

Performance gotchas: quick list

  • SELECT * ships every column over the wire and disables covering indexes. Name columns.
  • OFFSET 100000 LIMIT 20 is slow. Use cursor pagination (WHERE id > last_id ORDER BY id LIMIT 20).
  • COUNT(*) over a huge table is slow in Postgres. Use estimates (pg_class.reltuples) if approximation is fine.
  • Implicit casts kill index usage.
  • ORDER BY ... LIMIT 1 over indexed column is O(log n); without index it sorts everything.
  • WHERE col IN (huge list) may be slower than a temp table join.
  • IN vs EXISTS: EXISTS is safer with NULLs and short-circuits.
  • Transactions held open block writers. Keep them short.
  • Each index slows down writes. Audit unused indexes (pg_stat_user_indexes).

Quick reference

NeedSQL
Distinct valuesSELECT DISTINCT col FROM t
Count of distinctCOUNT(DISTINCT col)
String containsWHERE col ILIKE '%abc%'
Null checkWHERE col IS NULL (never = NULL)
CoalesceCOALESCE(col, 'default')
ConditionalCASE WHEN x THEN a ELSE b END
Castcol::int (PG) / CAST(col AS INT)
Concata || b (PG/Standard), CONCAT(a,b) (MySQL)
Date diff (days)(d1 - d2)::int (PG), DATEDIFF(d1,d2) (MySQL)
Top N per groupROW_NUMBER() OVER (PARTITION BY g ORDER BY x DESC) <= N
Paginationcursor: WHERE id > $last ORDER BY id LIMIT 20