PySpark Syntax Revision
A quick-reference practice path covering one problem per core PySpark operation. Run through this the day before an interview to refresh your API muscle memory.
join() — the most common DataFrame operation. Practice inner join syntax and column selection.
Left anti join — the PySpark way to find rows with no match, equivalent to LEFT JOIN ... WHERE IS NULL.
isin() — PySpark's equivalent of SQL IN (...). Refresher on filtering against a list of values.
filter() / where() — basic boolean condition filtering. The first thing you do in any pipeline.
contains() / like() — PySpark string matching. Know both; interviewers ask which to use and when.
F.concat() / F.concat_ws() — combining string columns. concat_ws avoids manual separator handling.
F.upper() — string transformation with withColumn(). Pattern for any column-level string function.
groupBy() + count() — the bread-and-butter aggregation. Refresher on groupBy syntax vs pandas.
groupBy() + agg(F.sum()) — multi-column groupBy with named aggregation using .alias().
groupBy() + having — filter after aggregation using .filter(F.col("count") > 1). No HAVING keyword in PySpark.
distinct() / dropDuplicates() — know the difference: distinct() on full row, dropDuplicates(cols) on subset.
orderBy() with multiple columns and F.desc() — multi-key sort syntax that differs from SQL.
orderBy() + limit() — chaining sort and limit, the PySpark equivalent of ORDER BY ... LIMIT.
withColumn() + F.when().otherwise() — conditional column derivation. The PySpark equivalent of CASE WHEN.
union() / unionByName() — stacking DataFrames. Know why unionByName is safer when column order differs.
Window + F.sum().over() — window function syntax: Window.orderBy() + rowsBetween/rangeBetween.
row_number() window + filter — window-based dedup to keep one row per group. The advanced dedup pattern.