PySparkEasy → Hard

PySpark for Data Engineers

Real DE pipeline problems from ingestion to analytics — joins, aggregations, window functions, sessionization, and retention. Solvable in PySpark, SQL, or Pandas.

18 problems~10h

Classic Word Counteasypyspark

The "hello world" of distributed computing — understand map/reduce and the Spark execution model before anything else.

Repeat Customerseasysql

GROUP BY + HAVING COUNT > 1 — identifying returning customers is one of the first metrics every DE pipeline computes.

Total Revenue by Regioneasysql

Partitioned aggregation — GROUP BY region + SUM(). The building block for every regional reporting pipeline.

Most Popular Product Categoryeasysql

GROUP BY + COUNT + ORDER BY + LIMIT — rank categories by volume. A daily metric in every e-commerce pipeline.

Product Return Rateeasysql

Conditional COUNT / total COUNT — ratio metric pattern. Used in quality monitoring pipelines across all industries.

Running Total of Daily Saleseasypyspark

Window + cumulative SUM — the first window function in PySpark. Window.orderBy() + F.sum().over().

Customer Lifetime Valuemediumsql

Multi-table join + GROUP BY + SUM — the single most common DE business metric. Every pipeline has a CLV job.

Rank Employees by Salary per Departmentmediumsql

RANK() / DENSE_RANK() OVER (PARTITION BY dept) — partitioned ranking, used in top-N-per-group pipeline outputs.

Month-over-Month Revenue Growthmediumsql

LAG() OVER (ORDER BY month) — access the previous period without a self-join. Core growth metric pattern.

Peak Order Hourmediumsql

EXTRACT(hour FROM timestamp) + GROUP BY + COUNT — time bucketing on event timestamps, used in capacity planning.

Identify Inactive Customersmediumsql

LEFT JOIN + date comparison — find users with no activity in N days. Standard churn signal in every retention pipeline.

Sessionize Clickstream Datamediumpyspark

lag() window + time gap threshold — session boundary detection on event streams. A real DE pipeline problem at every product company.

Monthly Cohort Retentionhardsql

Self-join + window function — retention matrix by signup cohort. Asked at Amazon, Google, Meta DE rounds.

Year-Over-Year Growth Ratehardsql

LAG() with PARTITION BY year — multi-level window composition. The canonical KPI pipeline metric.

Fill Missing Dates in Saleshardsql

Date spine generation + LEFT JOIN — filling gaps in time-series data is one of the trickiest DE pipeline tasks.

Inventory Turnover Rate Trendhardsql

Rolling window aggregation + ratio — combines frame clauses with derived metrics. Real supply chain pipeline pattern.

Deduplicate Records by Latest Updatehardpyspark

row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) — the standard upsert/dedup pattern in every lakehouse pipeline.

User Session Detectionhardpyspark

Cumulative SUM over a gap flag — advanced sessionization assigning session IDs. Asked at Databricks, Netflix, Uber.