Skip to main content

Microsoft Fabric End-to-End Project

Customer Master Data Platform with Data Quality & Deduplication (Microsoft Fabric) — End-to-End

This post documents a complete, production-style build of a Customer Master Data Platform in Microsoft Fabric, using Fabric Pipelines (Copy Data) for ingestion from RandomUser.me API, transformations in Fabric Notebooks (PySpark), and Gold dimensional modelling with SCD Type 2 using a T-SQL Notebook into a Fabric Warehouse.

Source: RandomUser.me API
Load: Daily + Incremental
Architecture: Bronze / Silver / Gold
DQ: Completeness + Uniqueness + Duplicates
Model: dim_person, dim_address, fact_person_snapshot

What you will build

A reusable platform that ingests raw user profiles, standardises and validates them, detects duplicates, applies survivorship rules, and publishes a trusted dimensional model for analytics and Power BI.

Layer Storage Content
Bronze Lakehouse (Files + Delta) Raw JSON from API (no schema enforcement)
Silver Lakehouse (Delta tables) Flattened, cleaned, standardised, DQ checks, dedup candidates
Gold Warehouse Dimensional model with SCD Type 2 + analytics facts

Table of contents

  1. Repository & project structure
  2. Fabric workspace, Lakehouse, Warehouse setup
  3. API ingestion with Fabric Pipelines (Copy Data)
  4. Bronze: raw storage layout & naming
  5. Silver: PySpark flattening + DQ rules
  6. Duplicate detection & survivorship
  7. Gold: dimensional modelling + SCD Type 2 (T-SQL)
  8. Incremental loads & watermarking
  9. Power BI semantic model & visuals
  10. Operational best practices (what to do / not do)
API Base URL
https://randomuser.me/api/
Relative URL
?results=100&page=1&seed=fabric_masterdata&inc=gender,name,location,email,dob,registered,phone,cell,id,picture,nat&format=json
Pagination
page=1..N (loop), stable ordering via seed
Incremental
Watermark using load_date + hash keys

1) Version Control (GitHub) — reproducible, reviewable engineering

I keep all artefacts in a single Git repository to make the build reproducible and easy to review: https://github.com/the-mama/randomuser_project_ms_fabric. This repo includes pipeline configuration notes, notebook sources, SQL scripts, and Power BI model conventions.

Git initialisation (local machine)
mkdir randomuser_project_ms_fabric

cd randomuser_project_ms_fabric
git init
git checkout -b main

# recommended structure

mkdir -p docs notebooks/pyspark notebooks/tsql pipelines sql models powerbi samples

git add .
git commit -m "Initial project structure"
FolderPurpose
pipelines/Pipeline definitions, parameters, environment notes
notebooks/pyspark/Bronze→Silver transformations, DQ, dedup logic
notebooks/tsql/SCD Type 2 merges into Warehouse
sql/DDL, views, stored procedures (if used)
powerbi/Semantic model notes + measures
samples/Sample API payloads for documentation
Do / Do not
  • Do version control notebooks and SQL scripts (auditable changes).
  • Do keep environment-specific config in parameters (not hard-coded).
  • Do not commit secrets, keys, or tokens. Use Fabric connections and workspace security.

2) Microsoft Fabric Setup — Workspace, Lakehouse, Warehouse

In Fabric, I create one workspace, then two core items: Lakehouse for Bronze/Silver and Warehouse for Gold dimensional modelling.

ItemName (example)Notes
Workspacews_randomuser_masterdataDev first; later add Test/Prod if needed
Lakehouselh_randomuserBronze raw files + Silver delta tables
Warehousewh_masterdataGold dims/facts + semantic layer
Best practice
  • Use consistent naming: lh_ for Lakehouse, wh_ for Warehouse, nb_ for notebooks, pl_ for pipelines.
  • Keep Bronze and Silver in Lakehouse; publish curated Gold to Warehouse for BI performance and governance.

3) Ingestion with Fabric Pipelines — Copy Data from RandomUser.me API

The ingestion is implemented using Microsoft Fabric Data Pipeline and a Copy Data activity. The API does not require authentication, so the connection is straightforward.

SettingValue
Connection (Base URL)https://randomuser.me/api/
Relative URL (example)?results=100&page=@{pipeline().parameters.page}&seed=fabric_masterdata&inc=gender,name,location,email,dob,registered,phone,cell,id,nat&format=json
Pagination approachLoop pages 1..N using a pipeline ForEach
SinkLakehouse Files (Bronze) as JSON
File pathFiles/bronze/randomuser/load_date=YYYY-MM-DD/page=0001/
File namerandomuser_YYYYMMDD_page0001.json
Pipeline parameters (recommended)
{

"load_date": "2026-01-24",
"results_per_page": 100,
"max_pages": 20,
"seed": "fabric_masterdata"
}
Pagination pattern in Fabric Pipeline (concept)
1) Set variable: pages = range(1, max_pages+1)

2) ForEach page in pages:

* Copy Data activity
  Source:
  Base URL: [https://randomuser.me/api/](https://randomuser.me/api/)
  Relative URL:
  ?results=@{pipeline().parameters.results_per_page}
  &page=@{item()}
  &seed=@{pipeline().parameters.seed}
  &inc=gender,name,location,email,dob,registered,phone,cell,id,nat
  &format=json
  Sink:
  Lakehouse Files:
  Files/bronze/randomuser/load_date=@{pipeline().parameters.load_date}/page=@{formatNumber(item(),'0000')}/
Do / Do not
  • Do use a stable seed for repeatability in dev/test.
  • Do store each page as a separate file for audit and replay.
  • Do not flatten JSON in the pipeline; keep ingestion minimal (raw-first principle).

4) Bronze Layer — Raw JSON (no schema enforcement)

Bronze is intentionally “as-is”. I store the exact payload from RandomUser.me along with operational metadata (load date/time, pipeline run id, page number).

Sample payload snippet (RandomUser.me)
{

"results": [
{
"gender": "male",
"name": { "title": "Mr", "first": "Aiden", "last": "Smith" },
"location": {
"street": { "number": 221, "name": "Baker Street" },
"city": "London",
"state": "England",
"country": "United Kingdom",
"postcode": "NW1 6XE"
},
"email": "[aiden.smith@example.com](mailto:aiden.smith@example.com)",
"dob": { "date": "1988-02-12T10:15:30.000Z", "age": 37 },
"registered": { "date": "2014-06-01T09:10:20.000Z", "age": 11 },
"phone": "020 7946 0011",
"cell": "07911 123456",
"id": { "name": "NINO", "value": "QQ123456C" },
"nat": "GB"
}
],
"info": { "seed": "fabric_masterdata", "results": 1, "page": 1, "version": "1.4" }
}
Bronze checklist
  • Partition by load_date and page.
  • Add ingestion metadata columns later in Silver (do not mutate Bronze).
  • Keep immutable files for traceability and reprocessing.

5) Silver Layer — PySpark flattening, standardisation, data quality

Silver is where I convert semi-structured JSON into structured Delta tables, standardise formats, and compute DQ metrics. This is implemented in a Fabric Notebook (PySpark).

PySpark — Read Bronze JSON and explode results
from pyspark.sql.functions import col, explode, lit, to_timestamp, sha2, concat_ws, upper, trim, regexp_replace

load_date = "2026-01-24"  # in practice: pass as notebook parameter

bronze_path = f"Files/bronze/randomuser/load_date={load_date}/*/*.json"

raw = spark.read.option("multiLine", True).json(bronze_path)

df = raw.select(explode(col("results")).alias("r"), col("info").alias("info"))

silver = df.select(
lit(load_date).alias("load_date"),
col("info.seed").alias("api_seed"),
col("info.page").alias("api_page"),
col("r.gender").alias("gender"),
col("r.name.title").alias("name_title"),
col("r.name.first").alias("first_name"),
col("r.name.last").alias("last_name"),
col("r.email").alias("email"),
to_timestamp(col("r.dob.date")).alias("dob_utc"),
col("r.dob.age").alias("age"),
to_timestamp(col("r.registered.date")).alias("registered_utc"),
col("r.phone").alias("phone"),
col("r.cell").alias("cell"),
col("r.nat").alias("nationality"),
col("r.location.street.number").alias("street_no"),
col("r.location.street.name").alias("street_name"),
col("r.location.city").alias("city"),
col("r.location.state").alias("state"),
col("r.location.country").alias("country"),
col("r.location.postcode").cast("string").alias("postcode")
)

# standardise strings

silver = (silver
.withColumn("email_std", upper(trim(col("email"))))
.withColumn("postcode_std", upper(regexp_replace(trim(col("postcode")), r"\s+", "")))
)

# create business hash keys (stable)

silver = silver.withColumn(
"person_nk",
sha2(concat_ws("||", col("email_std"), col("dob_utc").cast("string"), col("postcode_std")), 256)
)
PySpark — Data Quality metrics (completeness + uniqueness)
from pyspark.sql.functions import count, sum as _sum, when

dq = silver.agg(
count("*").alias("row_count"),
(_sum(when(col("email").isNotNull(), 1).otherwise(0)) / count("*") * 100).alias("email_completeness_pct"),
(_sum(when(col("dob_utc").isNotNull(), 1).otherwise(0)) / count("*") * 100).alias("dob_completeness_pct"),
(_sum(when(col("postcode").isNotNull(), 1).otherwise(0)) / count("*") * 100).alias("postcode_completeness_pct")
)

# uniqueness of natural key candidate

unique_nk = (silver.groupBy("person_nk").count()
.agg((_sum(when(col("count")==1,1).otherwise(0))/count("*")*100).alias("person_nk_uniqueness_pct")))

display(dq)
display(unique_nk)
Write Silver tables (Delta)
silver.write.mode("append").format("delta").saveAsTable("lh_randomuser.silver_person_staging")
Do / Do not
  • Do standardise key fields before hashing (email/postcode).
  • Do store both raw and standardised versions (audit + matching).
  • Do not rely on API “id.value” as a primary key (it can be null or inconsistent).

6) Duplicate detection & survivorship rules

The core matching rule is: Email + DOB + Postcode similarity. For this synthetic dataset, I use a deterministic approach: standardise these fields, then generate a person_nk hash key. Any duplicates share the same person_nk.

PySpark — pick the survivor (latest update wins)
from pyspark.sql.window import Window

from pyspark.sql.functions import row_number, desc

w = Window.partitionBy("person_nk").orderBy(desc("registered_utc"), desc("load_date"))

survivors = (silver
.withColumn("rn", row_number().over(w))
.filter(col("rn")==1)
.drop("rn")
)

survivors.write.mode("append").format("delta").saveAsTable("lh_randomuser.silver_person_survivor")
Survivorship notes
  • Rule used here: latest registered timestamp wins. In real MDM, you may use source priority + confidence scoring.
  • Keep a separate duplicates table for traceability: “survivor_id ↔ duplicate_ids”.

7) Gold Layer — Dimensional model + SCD Type 2 (T-SQL Notebook)

Gold lives in the Fabric Warehouse for strong BI performance and governance. I publish: dim_person, dim_address (both SCD2), and fact_person_snapshot.

TableTypeGrain
dim_personSCD Type 2One row per person version
dim_addressSCD Type 2One row per address version
fact_person_snapshotFactOne row per load_date per person
T-SQL (Warehouse) — Dim table skeleton (SCD2 columns)
-- dim_person (SCD2)

CREATE TABLE IF NOT EXISTS dbo.dim_person
(
person_sk           BIGINT IDENTITY(1,1) NOT NULL,
person_nk           VARCHAR(64) NOT NULL,  -- sha256 hex
email_std           VARCHAR(320) NULL,
first_name          VARCHAR(100) NULL,
last_name           VARCHAR(100) NULL,
gender              VARCHAR(20) NULL,
dob_date            DATE NULL,
nationality         VARCHAR(10) NULL,

-- SCD2
effective_from      DATETIME2 NOT NULL,
effective_to        DATETIME2 NOT NULL,
is_current          BIT NOT NULL,
record_hash         VARCHAR(64) NOT NULL,  -- hash of tracked attributes

CONSTRAINT PK_dim_person PRIMARY KEY (person_sk)
);

CREATE INDEX IX_dim_person_nk_current ON dbo.dim_person(person_nk, is_current);
T-SQL (Warehouse) — SCD2 MERGE pattern (current row expiry + insert new)
-- Source: survivors from Lakehouse (exposed to Warehouse via shortcuts or staging table)

-- Assume staging table exists in Warehouse: dbo.stg_person_survivor

DECLARE @asOf DATETIME2 = SYSUTCDATETIME();

-- 1) Expire changed current rows
UPDATE tgt
SET tgt.effective_to = DATEADD(SECOND, -1, @asOf),
tgt.is_current = 0
FROM dbo.dim_person tgt
JOIN dbo.stg_person_survivor src
ON tgt.person_nk = src.person_nk
WHERE tgt.is_current = 1
AND tgt.record_hash <> src.record_hash;

-- 2) Insert new rows for:
--    a) new natural keys
--    b) changed rows (after expiring old version)
INSERT INTO dbo.dim_person
(
person_nk, email_std, first_name, last_name, gender, dob_date, nationality,
effective_from, effective_to, is_current, record_hash
)
SELECT
src.person_nk,
src.email_std,
src.first_name,
src.last_name,
src.gender,
CAST(src.dob_utc AS DATE) AS dob_date,
src.nationality,
@asOf AS effective_from,
'9999-12-31' AS effective_to,
1 AS is_current,
src.record_hash
FROM dbo.stg_person_survivor src
LEFT JOIN dbo.dim_person tgt
ON tgt.person_nk = src.person_nk
AND tgt.is_current = 1
WHERE tgt.person_nk IS NULL
OR tgt.record_hash <> src.record_hash;
Do / Do not
  • Do use record_hash to detect attribute changes efficiently.
  • Do keep SCD2 dates in UTC and use a consistent “infinite” end date (9999-12-31).
  • Do not MERGE with complex “WHEN MATCHED THEN UPDATE + INSERT” if your platform has edge-case behaviour; explicit UPDATE then INSERT is safer.

8) Incremental Loads — watermarking and idempotency

RandomUser.me does not provide true CDC, so incremental design here is implemented as: daily ingestion + dedup + SCD2. The incremental behaviour is achieved by: (1) tracking load_date as a partition, (2) using deterministic natural keys, (3) only applying changes when record_hash differs.

PySpark — record_hash for change detection
from pyspark.sql.functions import sha2, concat_ws, col

tracked_cols = ["email_std","first_name","last_name","gender","dob_utc","nationality","postcode_std","city","state","country","street_no","street_name"]
silver = silver.withColumn("record_hash", sha2(concat_ws("||", *[col(c).cast("string") for c in tracked_cols]), 256))
Incremental best practice
  • Make your pipeline idempotent: re-running a day should not duplicate Gold rows.
  • Use dedup + record_hash so only real changes create new SCD versions.

9) Masking sensitive fields (safe-by-default)

Even though RandomUser data is synthetic, I still implement masking patterns as if this were a real customer platform. In Gold, expose masked values in reporting views, and restrict raw fields to privileged roles.

T-SQL — masked view example (reporting-safe)
CREATE OR ALTER VIEW dbo.vw_dim_person_masked AS

SELECT
person_sk,
person_nk,
CASE
WHEN email_std IS NULL THEN NULL
ELSE CONCAT(LEFT(email_std, 2), '***', RIGHT(email_std, 6))
END AS email_masked,
first_name,
last_name,
gender,
dob_date,
nationality,
is_current
FROM dbo.dim_person;
Do / Do not
  • Do apply least privilege: analysts consume masked views by default.
  • Do separate raw vs curated access paths.
  • Do not publish raw email/phone to Power BI unless explicitly required and approved.

10) Power BI — semantic model & visuals

With Gold tables in the Warehouse, I build a semantic model and a small set of operational + business dashboards. This is the part stakeholders typically see first, so I include both quality and business outcomes.

VisualWhat it provesData
DQ ScorecardCompleteness/uniqueness trends over timeSilver DQ metrics table
Duplicates TrendHow many duplicates detected per loadDedup groups
Current Customer CountMastered customers (current versions)dim_person where is_current=1
Change VolumeSCD2 new versions per daydim_person effective_from
Geo DistributionCustomer distribution by country/state/citydim_address
Power BI notes
  • Prefer a star schema: dims relate to facts via surrogate keys.
  • Use incremental refresh only if needed; otherwise keep semantic model simple.
  • Always surface data quality metrics alongside business metrics.

Comments

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Understanding employment distribution among top companies can provide valuable insights into industry dominance and workforce trends. In this blog, I analyze the largest employers in the UK using a Power BI table visualization, sourced from CompaniesMarketCap . Source:  CompaniesMarketCap Key Insights from the Data: Compass Group leads the ranking with 550,000 employees, dominating the food service industry. Tesco, the retail giant, follows with 330,000 employees. HSBC, a major player in banking, employs over 215,000 people. The total workforce among the top companies surpasses 1.98 million employees. Visualizing in Power BI: Using a table visualization, we can clearly compare the number of employees across different companies. Power BI’s sorting, aggregation, and filtering features enhance data readability and analysis. However, incorporating bar charts, conditional formatting, and KPIs could make the insights even more compelling. What’s Next? Would you add more interactive eleme...

Master Databricks Asset Bundles Through Hands-On Practice

15 min read | 100% Practical Guide Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs) . By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code. 🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command. Level 0: YAML Basics BEGINNER The Golden Rules Rule #1: YAML uses spaces for indentation , never tabs. Standard is 2 spaces per level. Rule #2: YAML is case-sensitive . Name ≠ name Rule #3: Indentation = Structure . It defines parent-child relationships. ...

PySpark Important Last-Minute Notes

If you are preparing for Data Engineering interviews , Spark projects , or need a quick PySpark revision , this post consolidates the most important PySpark concepts in one place. Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation. What is PySpark? PySpark is the Python API for Apache Spark , an open-source distributed computing framework used for large-scale data processing. Why PySpark? Distributed, in-memory processing Faster than traditional batch systems Scales across clusters Supports SQL, Streaming, and Machine Learning Common use cases ETL / ELT pipelines Big data analytics Machine learning at scale Real-time and batch processing Spark Cluster Architecture A Spark cluster typically consists of: Master Node: Manages resources and schedules tasks Worker Nodes: Execute tasks in parallel This architecture enables efficient distributed p...