PySpark in 7 Days (Day 4)

Learn PySpark in 7 Days – Day 4

Welcome to Day 4. Today we cover one of the most tested topics in interviews and one of the most performance-critical operations in Spark: Joins.

Day 4 Focus:
• Join types in PySpark
• Join conditions and syntax
• Broadcast joins
• Join performance basics


Why Joins Matter in Spark

In real-world data engineering, data rarely comes from a single source. Joins combine datasets but are also the biggest cause of performance issues if implemented incorrectly.

Sample DataFrames

employees = [ (1, "Mahesh", 10), (2, "Ravi", 20), (3, "Anita", 10), (4, "John", 30) ] departments = [ (10, "IT"), (20, "HR"), (30, "Finance") ] emp_df = spark.createDataFrame(employees, ["emp_id", "name", "dept_id"]) dept_df = spark.createDataFrame(departments, ["dept_id", "dept_name"])

Inner Join

emp_df.join( dept_df, emp_df.dept_id == dept_df.dept_id, "inner" ).show()

Returns only matching records from both DataFrames.

Left Join

emp_df.join( dept_df, emp_df.dept_id == dept_df.dept_id, "left" ).show()

Returns all records from the left DataFrame and matching records from the right.

Right Join

emp_df.join( dept_df, emp_df.dept_id == dept_df.dept_id, "right" ).show()

Full Outer Join

emp_df.join( dept_df, emp_df.dept_id == dept_df.dept_id, "outer" ).show()

Handling Duplicate Columns

emp_df.alias("e").join( dept_df.alias("d"), col("e.dept_id") == col("d.dept_id"), "inner" ).select( "e.emp_id", "e.name", "d.dept_name" ).show()

Broadcast Join (Very Important)

Use a broadcast join when one table is small and the other is large.
from pyspark.sql.functions import broadcast emp_df.join( broadcast(dept_df), emp_df.dept_id == dept_df.dept_id, "inner" ).show()

Broadcast joins avoid expensive shuffles by sending the small table to all executors.

Join Performance Best Practices

  • Filter data before joining
  • Use broadcast joins for small lookup tables
  • Avoid joining on high-cardinality columns unnecessarily
  • Select only required columns after join
Interview Tip:
If joins are slow, the first thing to check is data size and shuffle.

Key Concepts You Learned Today

  • All major join types in PySpark
  • Correct join syntax and conditions
  • How broadcast joins work
  • Basic join performance optimisation

What’s Coming on Day 5?

Day 5 – Aggregations & Window Functions
  • groupBy and aggregations
  • Window functions
  • Ranking and running totals
  • Real interview-style problems

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Master Databricks Asset Bundles Through Hands-On Practice

PySpark Important Last-Minute Notes