Skip to main content

PySpark in 7 Days (Day 6)

Learn PySpark in 7 Days – Day 6

Welcome to Day 6. Today we move from writing correct PySpark code to writing efficient and scalable PySpark code. This is where many data engineers fail interviews and production pipelines.

Day 6 Focus:
• repartition vs coalesce
• Understanding shuffle & data skew
• cache vs persist
• Efficient file formats (Parquet)

Why Performance Matters in Spark

Spark can process terabytes of data, but poor partitioning or unnecessary shuffles can make jobs slow and expensive. Performance tuning is a core data engineering skill.

Partitions in Spark (Concept)

  • Data in Spark is split into partitions
  • Each partition is processed in parallel
  • Too few partitions → underutilised cluster
  • Too many partitions → overhead

repartition()

df_repart = df.repartition(10)
  • Increases or decreases partitions
  • Causes a full shuffle
  • Use when you need even distribution

coalesce()

df_coalesce = df.coalesce(5)
  • Reduces number of partitions
  • No full shuffle
  • Best used before writing output
Interview Tip:
Use repartition to increase partitions, coalesce to reduce them.

What Is a Shuffle?

A shuffle happens when Spark needs to redistribute data across executors, such as during joins, groupBy, or repartition.

  • Network intensive
  • Disk I/O heavy
  • Primary cause of slow jobs

Data Skew (Common Problem)

Data skew occurs when one partition has significantly more data than others.

Example: One customer_id has millions of records while others have few.

Basic Mitigation Techniques

  • Broadcast small tables
  • Filter early
  • Increase parallelism

cache() vs persist()

cache()

df.cache()
  • Stores DataFrame in memory
  • Default storage level
  • Best for repeated reuse

persist()

from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK)
  • More control over storage
  • Useful when data doesn’t fit in memory
Best Practice:
Cache only when DataFrame is reused multiple times.

Why Parquet Is Preferred

df.write.mode("overwrite").parquet("/path/output/employees")
  • Columnar format
  • Compressed
  • Faster reads and writes
  • Schema stored with data

CSV vs Parquet (Quick Comparison)

  • CSV → Human readable, slow, no schema
  • Parquet → Optimised, compressed, analytics-friendly

Key Concepts You Learned Today

  • How partitions affect performance
  • repartition vs coalesce
  • What shuffle and skew are
  • When to use cache or persist
  • Why Parquet is preferred in production

What’s Coming on Day 7?

Day 7 – End-to-End PySpark Pipeline
  • Reading raw data
  • Transformations
  • Joins and aggregations
  • Writing curated output
  • Interview-ready explanation

Comments

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Understanding employment distribution among top companies can provide valuable insights into industry dominance and workforce trends. In this blog, I analyze the largest employers in the UK using a Power BI table visualization, sourced from CompaniesMarketCap . Source:  CompaniesMarketCap Key Insights from the Data: Compass Group leads the ranking with 550,000 employees, dominating the food service industry. Tesco, the retail giant, follows with 330,000 employees. HSBC, a major player in banking, employs over 215,000 people. The total workforce among the top companies surpasses 1.98 million employees. Visualizing in Power BI: Using a table visualization, we can clearly compare the number of employees across different companies. Power BI’s sorting, aggregation, and filtering features enhance data readability and analysis. However, incorporating bar charts, conditional formatting, and KPIs could make the insights even more compelling. What’s Next? Would you add more interactive eleme...

Master Databricks Asset Bundles Through Hands-On Practice

15 min read | 100% Practical Guide Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs) . By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code. 🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command. Level 0: YAML Basics BEGINNER The Golden Rules Rule #1: YAML uses spaces for indentation , never tabs. Standard is 2 spaces per level. Rule #2: YAML is case-sensitive . Name ≠ name Rule #3: Indentation = Structure . It defines parent-child relationships. ...

PySpark Important Last-Minute Notes

If you are preparing for Data Engineering interviews , Spark projects , or need a quick PySpark revision , this post consolidates the most important PySpark concepts in one place. Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation. What is PySpark? PySpark is the Python API for Apache Spark , an open-source distributed computing framework used for large-scale data processing. Why PySpark? Distributed, in-memory processing Faster than traditional batch systems Scales across clusters Supports SQL, Streaming, and Machine Learning Common use cases ETL / ELT pipelines Big data analytics Machine learning at scale Real-time and batch processing Spark Cluster Architecture A Spark cluster typically consists of: Master Node: Manages resources and schedules tasks Worker Nodes: Execute tasks in parallel This architecture enables efficient distributed p...