PySpark for Data Engineers
Real DE pipeline problems from ingestion to analytics — joins, aggregations, window functions, sessionization, and retention. Solvable in PySpark, SQL, or Pandas.
The "hello world" of distributed computing — understand map/reduce and the Spark execution model before anything else.
GROUP BY + HAVING COUNT > 1 — identifying returning customers is one of the first metrics every DE pipeline computes.
Partitioned aggregation — GROUP BY region + SUM(). The building block for every regional reporting pipeline.
GROUP BY + COUNT + ORDER BY + LIMIT — rank categories by volume. A daily metric in every e-commerce pipeline.
Conditional COUNT / total COUNT — ratio metric pattern. Used in quality monitoring pipelines across all industries.
Window + cumulative SUM — the first window function in PySpark. Window.orderBy() + F.sum().over().
Multi-table join + GROUP BY + SUM — the single most common DE business metric. Every pipeline has a CLV job.
RANK() / DENSE_RANK() OVER (PARTITION BY dept) — partitioned ranking, used in top-N-per-group pipeline outputs.
LAG() OVER (ORDER BY month) — access the previous period without a self-join. Core growth metric pattern.
EXTRACT(hour FROM timestamp) + GROUP BY + COUNT — time bucketing on event timestamps, used in capacity planning.
LEFT JOIN + date comparison — find users with no activity in N days. Standard churn signal in every retention pipeline.
lag() window + time gap threshold — session boundary detection on event streams. A real DE pipeline problem at every product company.
Self-join + window function — retention matrix by signup cohort. Asked at Amazon, Google, Meta DE rounds.
LAG() with PARTITION BY year — multi-level window composition. The canonical KPI pipeline metric.
Date spine generation + LEFT JOIN — filling gaps in time-series data is one of the trickiest DE pipeline tasks.
Rolling window aggregation + ratio — combines frame clauses with derived metrics. Real supply chain pipeline pattern.
row_number() OVER (PARTITION BY id ORDER BY updated_at DESC) — the standard upsert/dedup pattern in every lakehouse pipeline.
Cumulative SUM over a gap flag — advanced sessionization assigning session IDs. Asked at Databricks, Netflix, Uber.