Learn PySpark in 7 Days – Day 6
Welcome to Day 6. Today we move from writing correct PySpark code to writing efficient and scalable PySpark code. This is where many data engineers fail interviews and production pipelines.
Day 6 Focus:
• repartition vs coalesce
• Understanding shuffle & data skew
• cache vs persist
• Efficient file formats (Parquet)
• repartition vs coalesce
• Understanding shuffle & data skew
• cache vs persist
• Efficient file formats (Parquet)
Why Performance Matters in Spark
Spark can process terabytes of data, but poor partitioning or unnecessary shuffles can make jobs slow and expensive. Performance tuning is a core data engineering skill.
Partitions in Spark (Concept)
- Data in Spark is split into partitions
- Each partition is processed in parallel
- Too few partitions → underutilised cluster
- Too many partitions → overhead
repartition()
df_repart = df.repartition(10)
- Increases or decreases partitions
- Causes a full shuffle
- Use when you need even distribution
coalesce()
df_coalesce = df.coalesce(5)
- Reduces number of partitions
- No full shuffle
- Best used before writing output
Interview Tip:
Use repartition to increase partitions, coalesce to reduce them.
Use repartition to increase partitions, coalesce to reduce them.
What Is a Shuffle?
A shuffle happens when Spark needs to redistribute data across executors, such as during joins, groupBy, or repartition.
- Network intensive
- Disk I/O heavy
- Primary cause of slow jobs
Data Skew (Common Problem)
Data skew occurs when one partition has significantly more data than others.
Example: One customer_id has millions of records while others have few.
Basic Mitigation Techniques
- Broadcast small tables
- Filter early
- Increase parallelism
cache() vs persist()
cache()
df.cache()
- Stores DataFrame in memory
- Default storage level
- Best for repeated reuse
persist()
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
- More control over storage
- Useful when data doesn’t fit in memory
Best Practice:
Cache only when DataFrame is reused multiple times.
Cache only when DataFrame is reused multiple times.
Why Parquet Is Preferred
df.write.mode("overwrite").parquet("/path/output/employees")
- Columnar format
- Compressed
- Faster reads and writes
- Schema stored with data
CSV vs Parquet (Quick Comparison)
- CSV → Human readable, slow, no schema
- Parquet → Optimised, compressed, analytics-friendly
Key Concepts You Learned Today
- How partitions affect performance
- repartition vs coalesce
- What shuffle and skew are
- When to use cache or persist
- Why Parquet is preferred in production
What’s Coming on Day 7?
Day 7 – End-to-End PySpark Pipeline
- Reading raw data
- Transformations
- Joins and aggregations
- Writing curated output
- Interview-ready explanation

Comments
Post a Comment