Pandas Data Wrangling
The essential Pandas skills for DE and analytics engineering interviews — filtering, grouping, merging, pivoting, time-series, and window functions.
Boolean indexing + sort_values() — the two most fundamental Pandas operations. Start every analysis here.
merge() with how="left" + isnull() filter — Pandas left join to find rows with no match in another table.
Boolean filter on a categorical column — practice filtering on string equality and multiple conditions.
value_counts() — counting distinct categories in a column, the quickest way to profile a column.
groupby() + count() + rename() — aggregation pipeline to build a frequency distribution.
groupby() + agg with multiple functions + sort — combining sum/count aggregations in one groupby call.
groupby() + max/min on dates + timedelta arithmetic — date column operations and per-group date range.
groupby() + filter by date + count + sort + head() — multi-step chained transformation pipeline.
merge() + groupby() + sort — joining two DataFrames then aggregating on the joined result.
groupby() + filter with transform — filtering groups based on aggregate conditions using transform().
groupby() + idxmax() or nlargest — finding the mode in a frequency-encoded dataset.
merge() on two date columns + timedelta comparison — joining on computed date difference.
merge() + isnull().mean() — ratio calculation after a left join. Funnel metric pattern.
merge() + boolean condition on joined columns + mean() — join-then-filter-then-aggregate pipeline.
pivot or conditional groupby + percentage calculation — reshaping data to compare two metrics side by side.
groupby() + nunique() + filter — counting distinct values per group, then filtering on that count.
groupby() + conditional sum using np.where — splitting one column into two based on row parity.
groupby() + rank(method="dense") — top-N per group using dense rank, a key interview pattern.
merge() + groupby() + rank() + filter — multi-step pipeline combining join, aggregation, and ranking.
groupby() + cumcount() to assign row numbers per group + filter on row 3 — ranked row selection pattern.
sort_values() + shift() per group + timedelta comparison — detecting near-duplicate rows within a time window.
rolling(window, min_periods) + mean() — time-series rolling window calculation, a key analytics pattern.
merge() + np.select() for multi-condition column assignment — complex conditional logic across joined tables.
pivot + date arithmetic + sum — computing uptime from start/stop events, a real DE infrastructure metric.
groupby() + cumsum() + conditional allocation — multi-constraint optimization over grouped data.