PySpark in 7 Days (bonus2)

PySpark Cheat Sheet – Data Engineer Edition

TL;DR
• SparkSession is the entry point
• Transformations are lazy, actions trigger execution
• Avoid shuffles, prefer broadcast joins
• Use Parquet/Delta, not CSV
• Window functions = “Top N per group”
• repartition increases, coalesce reduces partitions
• Always think: DAG → stages → tasks



1. Spark Session

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("AppName") \ .getOrCreate()

2. Reading Data

FormatExample
CSVspark.read.option("header","true").csv(path)
JSONspark.read.json(path)
Parquetspark.read.parquet(path)
Deltaspark.read.format("delta").load(path)

3. Writing Data

df.write.mode("overwrite").parquet(path) df.write.partitionBy("date").parquet(path)

4. DataFrame Basics

df.show() df.printSchema() df.count() df.columns

5. Select / Filter / withColumn

from pyspark.sql.functions import col df.select("id","name") df.filter(col("salary") > 50000) df.withColumn("bonus", col("salary") * 0.1)

6. Null Handling

df.dropna() df.fillna(0) df.filter(col("col").isNotNull())

7. Joins

TypeSyntax
Innerdf1.join(df2, "id", "inner")
Leftdf1.join(df2, "id", "left")
Rightdf1.join(df2, "id", "right")
Fulldf1.join(df2, "id", "outer")

Broadcast Join

from pyspark.sql.functions import broadcast df.join(broadcast(dim_df), "id")

8. Aggregations

from pyspark.sql.functions import avg, count, max df.groupBy("dept").agg( count("*"), avg("salary"), max("salary") )

9. Window Functions

from pyspark.sql.window import Window from pyspark.sql.functions import row_number, rank, dense_rank w = Window.partitionBy("dept").orderBy(col("salary").desc()) df.withColumn("rn", row_number().over(w))

10. Spark SQL

df.createOrReplaceTempView("employees") spark.sql(""" SELECT dept, AVG(salary) FROM employees GROUP BY dept """)

11. Partitions & Performance

OperationUse Case
repartition(n)Increase / rebalance partitions (shuffle)
coalesce(n)Reduce partitions (no shuffle)
df.repartition(10) df.coalesce(1)

12. Cache & Persist

df.cache() from pyspark import StorageLevel df.persist(StorageLevel.MEMORY_AND_DISK)

13. Execution Plan & Debugging

df.explain() df.explain(True)

14. Delta Lake (Must-Know)

MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *

15. Structured Streaming (Basics)

df.writeStream \ .format("delta") \ .option("checkpointLocation", "/chk") \ .start("/out")

16. Common Interview Triggers

  • Slow job → shuffle or skew
  • Top N per group → window functions
  • Small lookup table → broadcast join
  • Repeated use → cache
  • Analytics storage → Parquet / Delta

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Master Databricks Asset Bundles Through Hands-On Practice

PySpark Important Last-Minute Notes