All roadmaps
PySparkBasic → Hard

PySpark Syntax Revision

A quick-reference practice path covering one problem per core PySpark operation. Run through this the day before an interview to refresh your API muscle memory.

17 problems~4h
1
Orders with Customer Namesbasicpyspark

join() — the most common DataFrame operation. Practice inner join syntax and column selection.

2
Products Never Orderedbasicpyspark

Left anti join — the PySpark way to find rows with no match, equivalent to LEFT JOIN ... WHERE IS NULL.

3
Customers from Specific Statesbasicpyspark

isin() — PySpark's equivalent of SQL IN (...). Refresher on filtering against a list of values.

4
Products Under Budgetbasicpyspark

filter() / where() — basic boolean condition filtering. The first thing you do in any pipeline.

5
Find Gmail Customersbasicpyspark

contains() / like() — PySpark string matching. Know both; interviewers ask which to use and when.

6
Employee Full Namebasicpyspark

F.concat() / F.concat_ws() — combining string columns. concat_ws avoids manual separator handling.

7
Uppercase Employee Namesbasicpyspark

F.upper() — string transformation with withColumn(). Pattern for any column-level string function.

8
Employee Count by Departmentbasicpyspark

groupBy() + count() — the bread-and-butter aggregation. Refresher on groupBy syntax vs pandas.

9
Monthly Sales Totalsbasicpyspark

groupBy() + agg(F.sum()) — multi-column groupBy with named aggregation using .alias().

10
Find Duplicate Phone Numbersbasicpyspark

groupBy() + having — filter after aggregation using .filter(F.col("count") > 1). No HAVING keyword in PySpark.

11
Unique Product Categoriesbasicpyspark

distinct() / dropDuplicates() — know the difference: distinct() on full row, dropDuplicates(cols) on subset.

12
Employees Sorted by Salary and Namebasicpyspark

orderBy() with multiple columns and F.desc() — multi-key sort syntax that differs from SQL.

13
Top Expensive Productsbasicpyspark

orderBy() + limit() — chaining sort and limit, the PySpark equivalent of ORDER BY ... LIMIT.

14
Employee Salary Increasemediumpyspark

withColumn() + F.when().otherwise() — conditional column derivation. The PySpark equivalent of CASE WHEN.

15
New vs Returning Customersmediumpyspark

union() / unionByName() — stacking DataFrames. Know why unionByName is safer when column order differs.

16
Running Total of Daily Saleseasypyspark

Window + F.sum().over() — window function syntax: Window.orderBy() + rowsBetween/rangeBetween.

17
Deduplicate Recordshardpyspark

row_number() window + filter — window-based dedup to keep one row per group. The advanced dedup pattern.