Large-Scale Historical Backfill Pipeline
Your data team discovers that the customer lifetime value metric — used in executive dashboards, ML models, and budget planning — has been calculated with a bug in the attribution logic for the past 3 years. You have...
Your data team discovers that the customer lifetime value metric — used in executive dashboards, ML models, and budget planning — has been calculated with a bug in the attribution logic for the past 3 years. You have 5 years of raw event data stored in S3 in daily Parquet partitions: approximately 15 terabytes, 2 trillion events. The new corrected pipeline also adds 3 new derived features the ML team needs. You must reprocess all 5 years of data (1,825 partitions) within 2 weeks without impacting the live nightly pipeline that runs on the same cluster and reads from the same source data. The backfill must be resumable: if it fails at day 1,200 of 1,825, it should continue from day 1,200, not restart from day 1. You must be able to roll back if the new output is discovered to be wrong after promotion. Design this end to end.
How to Approach This Problem
What Makes This Problem Unique Historical backfill questions test a different skill set from forward looking pipeline design. The interviewer is evaluating whether you can manage risk, isolate blast radius, and maintain operational correctness while running a potentially month long job that touches petabytes of data. The three hard problems are not about Spark performance — they are about correctness under failure and safe promotion. Hard problem 1: Resumability without data corruption A 14 day job running across 1,825 partitions will fail multiple times. The question is: when it resumes, does it re process partitions that already completed, and if so, does that produce duplicate or corrupte
Clarifying Questions
Functional Requirements What is the partition granularity? Daily, hourly, or by business unit? (Affects job count: 5 years × 365 days = 1,825 if daily; 1,825 × 24 = 43,800 if hourly) Does the new logic produce a different schema, or is it schema compatible with the old output? Are there downstream dependencies on the old output that need to be notified or re run after promotion? What is the validation criterion — row count match, statistical distribution check, or spot check on a sample? Is rollback a hard requirement, or is "run the old pipeline again" an acceptable recovery? Should the backfill process all 5 years, or is there a cut off (e.g., only last 3 years for the CLV metric)? Non Fun
Envelope Estimation
Scale Estimates Data Volume Source data: 15TB, 2 trillion events, 5 years of daily partitions = 1,825 partitions Average partition size: 15TB ÷ 1,825 = ~8GB/partition uncompressed; ~800MB–1.5GB compressed Parquet New output size: ~10–20% larger than input (3 new derived features added) → ~16–18TB total backfill output Parallelism Math Parallelism Jobs/day Total Duration Cluster Size 10 concurrent jobs 10 183 days — far too slow 10 × 4 cores = 40 cores 50 concurrent jobs 50 37 days — still too slow 50 × 4 cores = 200 cores 100 concurrent jobs 100 19 days — borderline 100 × 4 cores = 400 cores 150 concurrent jobs 150 13 days — fits 14 day window 150 × 4 cores = 600 cores Target: 150 concurrent
Architecture: The 5-Phase Model
Five Phases of a Safe Historical Backfill Phase 1: Plan (Day 0, 1 hour) The Partition Planner enumerates all 1,825 partition paths, validates that each source partition exists (alert on any missing partitions before starting), and enqueues them in the job queue in most recent first order . Why most recent first? Business stakeholders most urgently need recent months fixed. If the backfill is interrupted at Day 7 of 14, you have the last 3 years corrected, not the first 3 years. Phase 2: Execute (Days 1–13, continuous) The executor pool consumes jobs from the queue. For each job: 1. Check checkpoint: if COMPLETED → skip (idempotent) 2. Detect zombie: if IN PROGRESS with heartbeat 5 minutes ol
Partitioning Strategy
Why Partition Boundary Alignment Is Critical The wrong approach: "Process the first 500GB of data in job 1, the next 500GB in job 2..." — splitting by byte range crosses partition boundaries. Job 1 reads partial data from Day 1, job 2 reads the rest. The downstream aggregation is now split across two jobs with no way to reconcile them. Any business logic that requires all events from a day to be present (daily aggregation, session reconstruction, day level CLV) will produce wrong results. The right approach: Each job processes exactly one partition (one day or one business unit) atomically. The partition is the unit of work, the unit of idempotency, and the unit of output. Handling Skewed Pa
Idempotency: Every Partition Job Must Be Safe to Re-Run
The Idempotency Contract Every partition job must satisfy this contract: "Running the job for partition date X any number of times must produce the same output in the same output path, without any manual intervention between runs." Output Path Versioning Each partition job writes to backfill/v1/dt={date}/run id={run id}/ . The run id ensures that if the job runs twice (retry after failure), the second run writes to the same path and overwrites the first run's output. Writing to a new run id path each time would leave orphaned partial outputs. Checkpoint Store Read Before Write