PySpark in 7 Days (Day 4)
Learn PySpark in 7 Days – Day 4
Welcome to Day 4. Today we cover one of the most tested topics in interviews and one of the most performance-critical operations in Spark: Joins.
Day 4 Focus:
• Join types in PySpark
• Join conditions and syntax
• Broadcast joins
• Join performance basics
• Join types in PySpark
• Join conditions and syntax
• Broadcast joins
• Join performance basics
Why Joins Matter in Spark
In real-world data engineering, data rarely comes from a single source. Joins combine datasets but are also the biggest cause of performance issues if implemented incorrectly.
Sample DataFrames
employees = [
(1, "Mahesh", 10),
(2, "Ravi", 20),
(3, "Anita", 10),
(4, "John", 30)
]
departments = [
(10, "IT"),
(20, "HR"),
(30, "Finance")
]
emp_df = spark.createDataFrame(employees, ["emp_id", "name", "dept_id"])
dept_df = spark.createDataFrame(departments, ["dept_id", "dept_name"])
Inner Join
emp_df.join(
dept_df,
emp_df.dept_id == dept_df.dept_id,
"inner"
).show()
Returns only matching records from both DataFrames.
Left Join
emp_df.join(
dept_df,
emp_df.dept_id == dept_df.dept_id,
"left"
).show()
Returns all records from the left DataFrame and matching records from the right.
Right Join
emp_df.join(
dept_df,
emp_df.dept_id == dept_df.dept_id,
"right"
).show()
Full Outer Join
emp_df.join(
dept_df,
emp_df.dept_id == dept_df.dept_id,
"outer"
).show()
Handling Duplicate Columns
emp_df.alias("e").join(
dept_df.alias("d"),
col("e.dept_id") == col("d.dept_id"),
"inner"
).select(
"e.emp_id",
"e.name",
"d.dept_name"
).show()
Broadcast Join (Very Important)
Use a broadcast join when one table is small and the other is large.
from pyspark.sql.functions import broadcast
emp_df.join(
broadcast(dept_df),
emp_df.dept_id == dept_df.dept_id,
"inner"
).show()
Broadcast joins avoid expensive shuffles by sending the small table to all executors.
Join Performance Best Practices
- Filter data before joining
- Use broadcast joins for small lookup tables
- Avoid joining on high-cardinality columns unnecessarily
- Select only required columns after join
Interview Tip:
If joins are slow, the first thing to check is data size and shuffle.
If joins are slow, the first thing to check is data size and shuffle.
Key Concepts You Learned Today
- All major join types in PySpark
- Correct join syntax and conditions
- How broadcast joins work
- Basic join performance optimisation
What’s Coming on Day 5?
Day 5 – Aggregations & Window Functions
- groupBy and aggregations
- Window functions
- Ranking and running totals
- Real interview-style problems
