Nightly ETL & Data Warehouse Pipeline
Design a nightly ETL pipeline for a retail company that ingests data from 8 source systems — ERP, CRM, point of sale, inventory management, logistics, web analytics, returns processing, and supplier portal — into a ce...
Design a nightly ETL pipeline for a retail company that ingests data from 8 source systems — ERP, CRM, point-of-sale, inventory management, logistics, web analytics, returns processing, and supplier portal — into a central data warehouse. The pipeline runs at 22:00 UTC and must complete before 06:00 UTC when 300+ business analysts begin running reports and dashboards. The warehouse currently holds 5 years of historical data (~12TB) and grows by ~50GB per day. Source schemas change without warning two to three times per year. Design this end to end, with emphasis on incremental load strategies, SCD Type 2 handling for slowly changing dimension data, dependency-ordered orchestration across 8 sources, late-arriving data, data quality enforcement, idempotent re-runs, and failure recovery.
How to Approach This Problem
What Makes This Problem Hard A nightly ETL pipeline sounds like a solved problem — extract data, transform it, load it. But interview candidates fail this question because they describe a one off script, not a production system that runs every night for years without breaking. The real complexity lives in three areas that naive designs completely miss. Hard problem 1: SCD Type 2 correctness under re runs Slowly Changing Dimensions require versioned history rows. A customer who changes their email address should have two rows in dim customer: one closed (valid to = yesterday) and one current (valid to = 9999 12 31). If your ETL re runs for the same business date — which it will, due to upstre
Clarifying Questions
Functional Requirements Which of the 8 sources support CDC (Change Data Capture) vs full daily snapshots? CDC enables true incremental loads; full snapshots require diff logic. Which tables are slowly changing dimensions (SCD)? Do we need Type 1 (overwrite), Type 2 (history), or Type 3 (previous value column)? What is the grain of each fact table? (Order level, line item level, daily aggregate?) Do analysts need intraday data, or is daily refresh sufficient? Are there intercompany or cross entity data dependencies (e.g., a product in the ERP must exist before an order can load)? Should deleted source records be soft deleted (add deleted at column) or hard deleted from the warehouse? What is
Envelope Estimation
Scale Estimates Source Data Volumes per Night Source Row Count Avg Row Size Compressed Size ERP (orders + line items) 2M rows 800B ~300MB CRM (customer + contact updates) 500K rows 600B ~80MB POS (transactions) 5M rows 400B ~500MB Inventory (stock movements) 3M rows 300B ~250MB Logistics (shipment events) 1M rows 500B ~150MB Web analytics (clickstream) 800M events 200B ~3GB Returns 200K rows 700B ~40MB Supplier portal 100K rows 1KB ~25MB Total ~812M rows ~4.3GB/night compressed Warehouse Size 5 years × 365 days × 50GB/day (uncompressed) = ~91TB uncompressed With Parquet/columnar compression (5 10×): ~9 18TB in warehouse storage Growth rate: 50GB/day raw → ~5 10GB/day in warehouse after compr
Architecture Walkthrough
The 5 Layer Pipeline The architecture follows a medalion / layered pattern : Raw → Validated → Transformed → Warehouse → Semantic. Each layer is independently re readable, so any transformation can be re run from the previous layer without re extracting from source systems. Layer by Layer Flow Layer Storage What Happens Raw Zone S3/GCS/ADLS Source data landed as is, partitioned by source + date. Never modified after write. Retained 90 days for re processing. DQ Gate In memory Row count check ± 5%, PK uniqueness, schema contract validation. Fail here = alert source team, do not proceed. Transform Layer Spark SCD merges, deduplication, business rule application, surrogate key generation. Outpu
Component Deep Dive
SCD Type 2 Merge: The Core Algorithm SCD Type 2 is where most ETL pipelines have subtle bugs. The correct implementation has four cases: PySpark SCD Type 2 Implementation Idempotency guarantee: If this merge runs twice for the same date, the second run finds the same source data, the same current rows, and the changed rows set is empty (changes were already applied). No new rows are inserted. The merge is a true no op on re run. Incremental Fact Load with Watermark Why update the watermark last? If the Spark job fails after writing partial output but before updating the watermark, the next run re reads from the previous watermark and re processes the same records. The warehouse MERGE/UPSERT
Data Modeling
Star Schema Design Fact Table: fact orders Dimension Table: dim customer (SCD Type 2) Why a partial unique index? UNIQUE WHERE is current = TRUE enforces that there is exactly one current version of each customer while allowing many historical rows. Without this, you'd need application level logic to prevent duplicate current rows, which breaks under concurrent inserts. Dimension Table: dim product (SCD Type 1 + selective Type 2) Not all attributes need versioning. Product price is Type 1 (overwrite) because analysts want current pricing. Product category is Type 2 because category changes affect historical revenue analysis by category. Handling Orphaned Fact Records (Late Dim Arrival) If a