PySpark Important Last-Minute Notes

If you are preparing for Data Engineering interviews, Spark projects, or need a quick PySpark revision, this post consolidates the most important PySpark concepts in one place.

Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation.




What is PySpark?

PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for large-scale data processing.

Why PySpark?

  • Distributed, in-memory processing
  • Faster than traditional batch systems
  • Scales across clusters
  • Supports SQL, Streaming, and Machine Learning

Common use cases

  • ETL / ELT pipelines
  • Big data analytics
  • Machine learning at scale
  • Real-time and batch processing

Spark Cluster Architecture

A Spark cluster typically consists of:

  • Master Node: Manages resources and schedules tasks
  • Worker Nodes: Execute tasks in parallel

This architecture enables efficient distributed processing on large datasets.


SparkSession – Entry Point to PySpark

SparkSession is the unified entry point to Spark features (SQL, streaming, ML, and DataFrame operations).

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp") \
    .getOrCreate()

Key points

  • .appName() helps track multiple applications
  • .getOrCreate() prevents duplicate sessions that waste memory

PySpark DataFrames

PySpark DataFrames are similar to Pandas, but distributed and optimized for big data processing.

Create a DataFrame

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()

Inspect schema

df.printSchema()

Count rows

df.count()

Basic Analytics Operations

Group and aggregate

df.groupBy("gender").agg({"salary_usd": "avg"})

Common DataFrame methods (PySpark vs SQL)

PySpark SQL Equivalent
select() SELECT
filter() WHERE
groupBy() GROUP BY
agg() AGGREGATE

Filtering and Selecting Data

df.filter(df.age > 50).select("name", "age").show()

Handling Missing (Null) Values

Drop nulls

df.na.drop()
df.na.drop(subset=["column_name"])

Fill nulls

df.na.fill(value=0)
df.na.fill({"column1": 0, "column2": "Unknown"})

Filter out nulls using isNotNull()

df.filter(df["column_name"].isNotNull())

Column Operations

Create a new column

df = df.withColumn("new_column", df["existing_column"] * 2)

Rename a column

df = df.withColumnRenamed("old_column_name", "new_column_name")

Drop a column

df = df.drop("column_name")

Row Operations

Filter rows

df_filtered = df.filter(df["age"] > 30)

Group and aggregate

df_grouped = df.groupBy("category").agg({"value_column": "sum"})

Defining Schema Manually (Recommended)

Schema inference is convenient, but it can misinterpret column data types. Manual schema improves data quality and stability.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("age", IntegerType()),
    StructField("education_num", IntegerType()),
    StructField("marital_status", StringType()),
    StructField("occupation", StringType()),
    StructField("income", StringType())
])

df = spark.read.csv("data.csv", schema=schema)
df.printSchema()

Joins in PySpark

Joins combine two DataFrames based on matching keys, similar to SQL joins.

Supported join types

  • Inner Join
  • Left (Outer) Join
  • Right (Outer) Join
  • Full Outer Join
df1.join(df2, df1["id"] == df2["id"], "inner")

Union Operation

Union stacks two DataFrames (they must have the same schema).

df_union = df1.union(df2)

Working with Complex Data Types

Arrays

from pyspark.sql.functions import lit
df = df.withColumn("array_col", lit([1, 2, 3]))

Maps

from pyspark.sql.functions import map, lit
df = df.withColumn("map_col", map(lit("k1"), lit("v1")))

Structs

Structs group related fields into one column (useful for nested data).


User Defined Functions (UDFs)

PySpark UDF

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def to_upper(s):
    return s.upper()

upper_udf = udf(to_upper, StringType())

df = df.withColumn("upper_name", upper_udf(df["name"]))

pandas UDF (Preferred for large datasets)

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("float")
def square(s: pd.Series) -> pd.Series:
    return s * s

df = df.withColumn("squared", square(df["value"]))

PySpark UDF vs pandas UDF

Feature PySpark UDF pandas UDF
Performance Slower Faster
Best for data size Small Large
Execution Row-wise Vectorized

Final Takeaways

  • PySpark is designed for distributed big data processing.
  • DataFrames are optimized and support SQL-like operations.
  • Manual schemas improve data quality and prevent type issues.
  • pandas UDFs are typically faster for large datasets.
  • PySpark is essential for Azure Databricks, Microsoft Fabric, and Spark-based interviews.

If this helped you: Bookmark this post for quick revision before interviews.

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Master Databricks Asset Bundles Through Hands-On Practice