Learn PySpark in 7 Days – Day 7
Welcome to Day 7 — the most important day of this series. Today, we bring everything together and build a real-world, end-to-end PySpark ETL pipeline using best practices expected from a professional data engineer.
Day 7 Focus:
• End-to-end PySpark ETL pipeline
• Bronze → Silver → Gold architecture
• Performance-aware transformations
• Interview-ready explanation
• End-to-end PySpark ETL pipeline
• Bronze → Silver → Gold architecture
• Performance-aware transformations
• Interview-ready explanation
Real-World Scenario
We receive daily employee data as CSV files. Our task is to:
- Ingest raw data (Bronze)
- Clean and transform data (Silver)
- Create analytics-ready output (Gold)
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("EndToEnd-PySpark-Pipeline") \
.getOrCreate()
Step 2: Bronze Layer – Read Raw Data
bronze_df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("/data/bronze/employees.csv")
bronze_df.show()
Bronze layer stores data as received, with minimal transformation.
Step 3: Silver Layer – Clean & Transform
from pyspark.sql.functions import col
silver_df = bronze_df \
.filter(col("salary").isNotNull()) \
.withColumn("salary", col("salary").cast("int")) \
.withColumnRenamed("dept", "department")
silver_df.show()
- Removed invalid records
- Standardised column names
- Applied data types
Step 4: Enrichment (Join with Lookup)
departments = [
(10, "IT"),
(20, "HR"),
(30, "Finance")
]
dept_df = spark.createDataFrame(
departments, ["dept_id", "dept_name"]
)
silver_enriched_df = silver_df.join(
dept_df,
silver_df.dept_id == dept_df.dept_id,
"left"
)
Step 5: Gold Layer – Aggregations
from pyspark.sql.functions import avg, count
gold_df = silver_enriched_df.groupBy("dept_name").agg(
count("*").alias("employee_count"),
avg("salary").alias("avg_salary")
)
gold_df.show()
Gold layer is business-ready and optimised for reporting.
Step 6: Write Output (Parquet)
gold_df.coalesce(1).write \
.mode("overwrite") \
.parquet("/data/gold/employee_metrics")
- Parquet for performance
- coalesce to control file count
How to Explain This in an Interview
Sample Answer:
“I designed an end-to-end PySpark pipeline using a Bronze–Silver–Gold approach. Raw CSV data was ingested into the Bronze layer with minimal changes. In the Silver layer, I applied schema enforcement, filtering, and enrichment joins. Finally, in the Gold layer, I aggregated the data into analytics-ready metrics and stored it in Parquet for efficient downstream consumption.”
“I designed an end-to-end PySpark pipeline using a Bronze–Silver–Gold approach. Raw CSV data was ingested into the Bronze layer with minimal changes. In the Silver layer, I applied schema enforcement, filtering, and enrichment joins. Finally, in the Gold layer, I aggregated the data into analytics-ready metrics and stored it in Parquet for efficient downstream consumption.”
Key Concepts You Mastered in 7 Days
- PySpark DataFrames and Spark SQL
- Joins, aggregations, and window functions
- Performance tuning and partitioning
- End-to-end ETL pipeline design
What’s Next?
- Practice with real datasets
- Learn Delta Lake & MERGE patterns
- Run pipelines on Databricks / Fabric
- Prepare PySpark interview questions

Comments
Post a Comment