PySpark in 7 Days (Day 4)

January 17, 2026

Learn PySpark in 7 Days – Day 4

Welcome to Day 4. Today we cover one of the most tested topics in interviews and one of the most performance-critical operations in Spark: Joins.

Day 4 Focus:

• Join types in PySpark

• Join conditions and syntax

• Broadcast joins

• Join performance basics

Why Joins Matter in Spark

In real-world data engineering, data rarely comes from a single source. Joins combine datasets but are also the biggest cause of performance issues if implemented incorrectly.

Sample DataFrames

employees = [
    (1, "Mahesh", 10),
    (2, "Ravi", 20),
    (3, "Anita", 10),
    (4, "John", 30)
]

departments = [
    (10, "IT"),
    (20, "HR"),
    (30, "Finance")
]

emp_df = spark.createDataFrame(employees, ["emp_id", "name", "dept_id"])
dept_df = spark.createDataFrame(departments, ["dept_id", "dept_name"])

Inner Join

emp_df.join(
    dept_df,
    emp_df.dept_id == dept_df.dept_id,
    "inner"
).show()

Returns only matching records from both DataFrames.

Left Join

emp_df.join(
    dept_df,
    emp_df.dept_id == dept_df.dept_id,
    "left"
).show()

Returns all records from the left DataFrame and matching records from the right.

Right Join

emp_df.join(
    dept_df,
    emp_df.dept_id == dept_df.dept_id,
    "right"
).show()

Full Outer Join

emp_df.join(
    dept_df,
    emp_df.dept_id == dept_df.dept_id,
    "outer"
).show()

Handling Duplicate Columns

emp_df.alias("e").join(
    dept_df.alias("d"),
    col("e.dept_id") == col("d.dept_id"),
    "inner"
).select(
    "e.emp_id",
    "e.name",
    "d.dept_name"
).show()

Broadcast Join (Very Important)

Use a broadcast join when one table is small and the other is large.

from pyspark.sql.functions import broadcast

emp_df.join(
    broadcast(dept_df),
    emp_df.dept_id == dept_df.dept_id,
    "inner"
).show()

Broadcast joins avoid expensive shuffles by sending the small table to all executors.

Join Performance Best Practices

Filter data before joining
Use broadcast joins for small lookup tables
Avoid joining on high-cardinality columns unnecessarily
Select only required columns after join

Interview Tip:

If joins are slow, the first thing to check is data size and shuffle.

Key Concepts You Learned Today

All major join types in PySpark
Correct join syntax and conditions
How broadcast joins work
Basic join performance optimisation

What’s Coming on Day 5?

Day 5 – Aggregations & Window Functions

groupBy and aggregations
Window functions
Ranking and running totals
Real interview-style problems

Search This Blog

Data Engineer