All roadmaps
PySparkEasy → Hard

PySpark for Data Engineers

Real DE pipeline problems from ingestion to analytics — joins, aggregations, window functions, sessionization, and retention. Solvable in PySpark, SQL, or Pandas.

18 problems~10h
1
Classic Word Counteasypyspark

The "hello world" of distributed computing — understand map/reduce and the Spark execution model before anything else.

2
Repeat Customerseasysql

GROUP BY + HAVING COUNT > 1 — identifying returning customers is one of the first metrics every DE pipeline computes.

3
Total Revenue by Regioneasysql

Partitioned aggregation — GROUP BY region + SUM(). The building block for every regional reporting pipeline.

4
Most Popular Product Categoryeasysql

GROUP BY + COUNT + ORDER BY + LIMIT — rank categories by volume. A daily metric in every e-commerce pipeline.

5
Product Return Rateeasysql

Conditional COUNT / total COUNT — ratio metric pattern. Used in quality monitoring pipelines across all industries.

6
Running Total of Daily Saleseasypyspark

Window + cumulative SUM — the first window function in PySpark. Window.orderBy() + F.sum().over().

7
Customer Lifetime Valuemediumsql

Multi-table join + GROUP BY + SUM — the single most common DE business metric. Every pipeline has a CLV job.

8
Rank Employees by Salary per Departmentmediumsql

RANK() / DENSE_RANK() OVER (PARTITION BY dept) — partitioned ranking, used in top-N-per-group pipeline outputs.

9
Month-over-Month Revenue Growthmediumsql

LAG() OVER (ORDER BY month) — access the previous period without a self-join. Core growth metric pattern.

10
Peak Order Hourmediumsql

EXTRACT(hour FROM timestamp) + GROUP BY + COUNT — time bucketing on event timestamps, used in capacity planning.

11
Identify Inactive Customersmediumsql

LEFT JOIN + date comparison — find users with no activity in N days. Standard churn signal in every retention pipeline.

12
Sessionize Clickstream Datamediumpyspark

lag() window + time gap threshold — session boundary detection on event streams. A real DE pipeline problem at every product company.

13
Monthly Cohort Retentionhardsql

Self-join + window function — retention matrix by signup cohort. Asked at Amazon, Google, Meta DE rounds.

14
Year-Over-Year Growth Ratehardsql

LAG() with PARTITION BY year — multi-level window composition. The canonical KPI pipeline metric.

15
Fill Missing Dates in Saleshardsql

Date spine generation + LEFT JOIN — filling gaps in time-series data is one of the trickiest DE pipeline tasks.

16
Inventory Turnover Rate Trendhardsql

Rolling window aggregation + ratio — combines frame clauses with derived metrics. Real supply chain pipeline pattern.

17
Deduplicate Records by Latest Updatehardpyspark

row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) — the standard upsert/dedup pattern in every lakehouse pipeline.

18
User Session Detectionhardpyspark

Cumulative SUM over a gap flag — advanced sessionization assigning session IDs. Asked at Databricks, Netflix, Uber.