Skip to main content

PySpark Important Last-Minute Notes

If you are preparing for Data Engineering interviews, Spark projects, or need a quick PySpark revision, this post consolidates the most important PySpark concepts in one place.

Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation.




What is PySpark?

PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for large-scale data processing.

Why PySpark?

  • Distributed, in-memory processing
  • Faster than traditional batch systems
  • Scales across clusters
  • Supports SQL, Streaming, and Machine Learning

Common use cases

  • ETL / ELT pipelines
  • Big data analytics
  • Machine learning at scale
  • Real-time and batch processing

Spark Cluster Architecture

A Spark cluster typically consists of:

  • Master Node: Manages resources and schedules tasks
  • Worker Nodes: Execute tasks in parallel

This architecture enables efficient distributed processing on large datasets.


SparkSession – Entry Point to PySpark

SparkSession is the unified entry point to Spark features (SQL, streaming, ML, and DataFrame operations).

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp") \
    .getOrCreate()

Key points

  • .appName() helps track multiple applications
  • .getOrCreate() prevents duplicate sessions that waste memory

PySpark DataFrames

PySpark DataFrames are similar to Pandas, but distributed and optimized for big data processing.

Create a DataFrame

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()

Inspect schema

df.printSchema()

Count rows

df.count()

Basic Analytics Operations

Group and aggregate

df.groupBy("gender").agg({"salary_usd": "avg"})

Common DataFrame methods (PySpark vs SQL)

PySpark SQL Equivalent
select() SELECT
filter() WHERE
groupBy() GROUP BY
agg() AGGREGATE

Filtering and Selecting Data

df.filter(df.age > 50).select("name", "age").show()

Handling Missing (Null) Values

Drop nulls

df.na.drop()
df.na.drop(subset=["column_name"])

Fill nulls

df.na.fill(value=0)
df.na.fill({"column1": 0, "column2": "Unknown"})

Filter out nulls using isNotNull()

df.filter(df["column_name"].isNotNull())

Column Operations

Create a new column

df = df.withColumn("new_column", df["existing_column"] * 2)

Rename a column

df = df.withColumnRenamed("old_column_name", "new_column_name")

Drop a column

df = df.drop("column_name")

Row Operations

Filter rows

df_filtered = df.filter(df["age"] > 30)

Group and aggregate

df_grouped = df.groupBy("category").agg({"value_column": "sum"})

Defining Schema Manually (Recommended)

Schema inference is convenient, but it can misinterpret column data types. Manual schema improves data quality and stability.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("age", IntegerType()),
    StructField("education_num", IntegerType()),
    StructField("marital_status", StringType()),
    StructField("occupation", StringType()),
    StructField("income", StringType())
])

df = spark.read.csv("data.csv", schema=schema)
df.printSchema()

Joins in PySpark

Joins combine two DataFrames based on matching keys, similar to SQL joins.

Supported join types

  • Inner Join
  • Left (Outer) Join
  • Right (Outer) Join
  • Full Outer Join
df1.join(df2, df1["id"] == df2["id"], "inner")

Union Operation

Union stacks two DataFrames (they must have the same schema).

df_union = df1.union(df2)

Working with Complex Data Types

Arrays

from pyspark.sql.functions import lit
df = df.withColumn("array_col", lit([1, 2, 3]))

Maps

from pyspark.sql.functions import map, lit
df = df.withColumn("map_col", map(lit("k1"), lit("v1")))

Structs

Structs group related fields into one column (useful for nested data).


User Defined Functions (UDFs)

PySpark UDF

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def to_upper(s):
    return s.upper()

upper_udf = udf(to_upper, StringType())

df = df.withColumn("upper_name", upper_udf(df["name"]))

pandas UDF (Preferred for large datasets)

from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf("float")
def square(s: pd.Series) -> pd.Series:
    return s * s

df = df.withColumn("squared", square(df["value"]))

PySpark UDF vs pandas UDF

Feature PySpark UDF pandas UDF
Performance Slower Faster
Best for data size Small Large
Execution Row-wise Vectorized

Final Takeaways

  • PySpark is designed for distributed big data processing.
  • DataFrames are optimized and support SQL-like operations.
  • Manual schemas improve data quality and prevent type issues.
  • pandas UDFs are typically faster for large datasets.
  • PySpark is essential for Azure Databricks, Microsoft Fabric, and Spark-based interviews.

If this helped you: Bookmark this post for quick revision before interviews.

Comments

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Understanding employment distribution among top companies can provide valuable insights into industry dominance and workforce trends. In this blog, I analyze the largest employers in the UK using a Power BI table visualization, sourced from CompaniesMarketCap . Source:  CompaniesMarketCap Key Insights from the Data: Compass Group leads the ranking with 550,000 employees, dominating the food service industry. Tesco, the retail giant, follows with 330,000 employees. HSBC, a major player in banking, employs over 215,000 people. The total workforce among the top companies surpasses 1.98 million employees. Visualizing in Power BI: Using a table visualization, we can clearly compare the number of employees across different companies. Power BI’s sorting, aggregation, and filtering features enhance data readability and analysis. However, incorporating bar charts, conditional formatting, and KPIs could make the insights even more compelling. What’s Next? Would you add more interactive eleme...

Master Databricks Asset Bundles Through Hands-On Practice

15 min read | 100% Practical Guide Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs) . By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code. 🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command. Level 0: YAML Basics BEGINNER The Golden Rules Rule #1: YAML uses spaces for indentation , never tabs. Standard is 2 spaces per level. Rule #2: YAML is case-sensitive . Name ≠ name Rule #3: Indentation = Structure . It defines parent-child relationships. ...