Customer 360 & Identity Resolution

Design a Customer 360 data platform for a large e commerce company with 200 million customers. Customer identity is fragmented across six source systems: the main e commerce database (email + order history), a mobile...

Design a Customer 360 data platform for a large e-commerce company with 200 million customers. Customer identity is fragmented across six source systems: the main e-commerce database (email + order history), a mobile app (device IDs and app events), a loyalty program (loyalty_id and points balance), offline retail POS systems (store loyalty card swipes), a call center CRM (phone numbers and support tickets), and a third-party demographic enrichment provider (email and phone matching). A single customer may appear in all six systems with completely different identifiers and no shared primary key. Design a batch pipeline that resolves identities across all six systems, creates a unified golden record for each customer, handles PII compliantly under GDPR and CCPA, and refreshes the Customer 360 profile daily. The system must handle identity merges (two profiles that were separate turn out to be the same person) and identity splits (one profile that was merged turns out to be two different people) without corrupting downstream analytics.

How to Approach This Problem

What Makes This Problem Uniquely Hard Customer 360 and identity resolution is one of the most commonly asked data engineering design questions at e commerce, retail, and tech companies — and one of the most commonly answered poorly. The typical wrong answer is: "match records on email address." That is a one line script. The hard problems are what happen when identity matching produces incorrect results, and how you undo them. Hard problem 1: Identity merges that need to be undone (splits) Probabilistic matching produces false positives. You will eventually merge two profiles that belong to different people — perhaps a father and son who share a last name and live at the same address. When a

Clarifying Questions

Functional Requirements What matching keys are available across sources? (email, phone, loyalty id, device id, name + address?) Is deterministic matching (exact match on shared key) sufficient, or do we need probabilistic matching for records with no shared key? What is the refresh cadence for the C360 profile — daily batch or near real time? Do we need bi directional linking — given a C360 profile, can I find all 6 source records? And given a source record, can I find its C360 profile? What are the downstream consumers of the C360 profile? (Real time personalization API, ML feature store, marketing segmentation, analytics?) What is the expected false positive rate tolerance? (Finance teams

Envelope Estimation

Scale Estimates Record Volume Source Records Daily Delta Avg Record Size E commerce DB 200M 500K new + 1M updates 1KB Mobile App 180M devices 2M new sessions 200B Loyalty Program 80M members 100K new 500B Retail POS 60M loyalty card holders 200K new swipes 300B Call Center CRM 30M contacts 50K new 800B 3rd party enrichment 150M matched records weekly 10M refresh 600B Total ~700M source records ~3.85M/day Identity Graph Size 700M source records = 700M nodes Expected match rate: 40 60% of customers appear in 2+ sources ~80 120M identity clusters Average cluster size: 2.5 source records per customer Match edges: if 40% of 700M records form clusters of 2.5, that's ~280M nodes in clusters ~560M e

Architecture Walkthrough

The 6 Stage Identity Resolution Pipeline Stage 1: Ingestion & Normalization Each source lands in the raw zone. A normalization job standardises PII fields before tokenization: Email: lowercase + strip whitespace + remove dots in Gmail usernames Phone: E.164 format (+1XXXXXXXXXX) — removes country code variations Name: title case + remove titles (Mr., Dr.) + unicode normalisation Why normalize before tokenizing? "john.doe@gmail.com" and "johndoe@gmail.com" are the same Gmail address. If you tokenize before normalization, they produce different tokens and will never be matched. Stage 2: PII Tokenization After normalization, PII values are replaced with deterministic tokens: The token vault sto

Component Deep Dive

Entity Resolution: Blocking + Scoring in PySpark Golden Record Survivorship Rules Audit trail: Every attribute in the golden record includes source and updated at metadata. When an analyst asks "why does this customer have phone number X?", the golden record itself contains the answer.

Identity Resolution: Deterministic vs Probabilistic

Two Matching Strategies Deterministic Matching Exact match on a shared, reliable identifier. No false positives — if two records share the same loyalty id, they are the same person. Match Rule Condition Confidence Email exact match email token A == email token B 1.0 (near certain) Phone exact match phone token A == phone token B 0.95 (phone can be shared in household) Loyalty ID match loyalty id A == loyalty id B 1.0 (system issued, unique) Device ID + account device id A == device id B AND account A != account B 0.90 (shared device) Deterministic rules run first. Any pair matched deterministically skips probabilistic scoring. Probabilistic Matching For records with no shared key (e.g., POS

Loading system design guide...