Posts

PySpark in 7 Days (Day 7)

Image
Learn PySpark in 7 Days – Day 7 Welcome to Day 7 — the most important day of this series. Today, we bring everything together and build a real-world, end-to-end PySpark ETL pipeline using best practices expected from a professional data engineer. Day 7 Focus: • End-to-end PySpark ETL pipeline • Bronze → Silver → Gold architecture • Performance-aware transformations • Interview-ready explanation Real-World Scenario We receive daily employee data as CSV files. Our task is to: Ingest raw data (Bronze) Clean and transform data (Silver) Create analytics-ready output (Gold) Step 1: Create Spark Session from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("EndToEnd-PySpark-Pipeline") \ .getOrCreate() Step 2: Bronze Layer – Read Raw Data bronze_df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv("/data/bronze/employees.cs...

PySpark in 7 Days (Day 6)

Image
Learn PySpark in 7 Days – Day 6 Welcome to Day 6 . Today we move from writing correct PySpark code to writing efficient and scalable PySpark code . This is where many data engineers fail interviews and production pipelines. Day 6 Focus: • repartition vs coalesce • Understanding shuffle & data skew • cache vs persist • Efficient file formats (Parquet) Why Performance Matters in Spark Spark can process terabytes of data, but poor partitioning or unnecessary shuffles can make jobs slow and expensive . Performance tuning is a core data engineering skill . Partitions in Spark (Concept) Data in Spark is split into partitions Each partition is processed in parallel Too few partitions → underutilised cluster Too many partitions → overhead repartition() df_repart = df.repartition(10) Increases or decreases partitions Causes a full shuffle Use when you need even distribution coalesce() df_coalesce = df.coalesce(5) Reduces number...

PySpark in 7 Days (Day 5)

Image
Learn PySpark in 7 Days – Day 5 Welcome to Day 5 . Today we cover aggregations and window functions — two topics that appear frequently in interviews and are used extensively in real-world analytics and reporting pipelines . Day 5 Focus: • groupBy and aggregate functions • Window functions fundamentals • row_number, rank, dense_rank • Real interview-style use cases Sample Data data = [ (1, "IT", 60000), (2, "IT", 75000), (3, "HR", 45000), (4, "HR", 50000), (5, "Finance", 80000) ] columns = ["emp_id", "department", "salary"] df = spark.createDataFrame(data, columns) df.show() Aggregations Using groupBy Basic Aggregations from pyspark.sql.functions import avg, max, min, count df.groupBy("department").agg( count("*").alias("emp_count"), avg("salary").alias("avg_salary"), max("salary").al...

PySpark in 7 Days (Day 4)

Image
Learn PySpark in 7 Days – Day 4 Welcome to Day 4 . Today we cover one of the most tested topics in interviews and one of the most performance-critical operations in Spark: Joins . Day 4 Focus: • Join types in PySpark • Join conditions and syntax • Broadcast joins • Join performance basics Why Joins Matter in Spark In real-world data engineering, data rarely comes from a single source. Joins combine datasets but are also the biggest cause of performance issues if implemented incorrectly. Sample DataFrames employees = [ (1, "Mahesh", 10), (2, "Ravi", 20), (3, "Anita", 10), (4, "John", 30) ] departments = [ (10, "IT"), (20, "HR"), (30, "Finance") ] emp_df = spark.createDataFrame(employees, ["emp_id", "name", "dept_id"]) dept_df = spark.createDataFrame(departments, ["dept_id", "dept_name"]) Inner Join emp_df.j...

PySpark in 7 Days (Day 3)

Image
Learn PySpark in 7 Days – Day 3 Welcome to Day 3 . Today is a very important day for SQL professionals , because we focus on Spark SQL and how PySpark allows you to run SQL directly on DataFrames. Day 3 Focus: • Spark SQL vs DataFrame API • Temporary views • Writing SQL queries on DataFrames • Aggregations and CASE WHEN logic Why Spark SQL Is Important Most data engineers come from a SQL background . Spark SQL allows you to reuse your SQL knowledge while working on distributed big data . Easier to read and maintain Business-friendly Optimised by Spark Catalyst engine Creating a Sample DataFrame data = [ (1, "IT", 60000), (2, "HR", 45000), (3, "IT", 75000), (4, "Finance", 80000) ] columns = ["emp_id", "department", "salary"] df = spark.createDataFrame(data, columns) df.show() Creating a Temporary View df.createOrReplaceTempView("employees") Temp...

PySpark in 7 Days (Day 2)

Image
Learn PySpark in 7 Days – Day 2 Welcome to Day 2 of the PySpark in 7 Days series . Today is one of the most important days because DataFrames are the core abstraction used in real-world Spark pipelines. Day 2 Focus: • Understanding PySpark DataFrames • Reading CSV and JSON files • Schema inference vs manual schema • select, filter, withColumn operations What Is a PySpark DataFrame? A PySpark DataFrame is a distributed table-like structure with rows and columns, similar to a SQL table or Pandas DataFrame, but designed to work on large-scale data . Immutable (cannot be changed in place) Schema-aware Optimised using Spark Catalyst engine Creating SparkSession (Recap) from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Day2-PySpark") \ .getOrCreate() Reading CSV Files df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv(...

PySpark in 7 Days (Day 1)

Image
Learn PySpark in 7 Days – Day 1 Welcome to Day 1 of the “Learn PySpark in 7 Days” series . This series is designed for data engineers, analysts, and SQL professionals who want to learn PySpark in a structured, practical, and interview-ready way . Day 1 Focus: • What is Apache Spark? • Why PySpark is used in data engineering • Spark architecture (Driver, Executor, Cluster) • Installing and starting PySpark • Your first PySpark program What is Apache Spark? Apache Spark is a distributed data processing engine designed to process large volumes of data quickly across multiple machines. Unlike traditional systems, Spark keeps data in memory , which makes it much faster for analytics and transformations. Why Spark Instead of Traditional Tools? Handles massive datasets (GBs to PBs) Distributed and fault-tolerant Faster than MapReduce Supports batch and streaming workloads What is PySpark? PySpark is the Python API for Apache Spark. It allows Python de...