If you are preparing for Data Engineering interviews, Spark projects, or need a quick PySpark revision, this post consolidates the most important PySpark concepts in one place.
Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation.
What is PySpark?
PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for large-scale data processing.
Why PySpark?
- Distributed, in-memory processing
- Faster than traditional batch systems
- Scales across clusters
- Supports SQL, Streaming, and Machine Learning
Common use cases
- ETL / ELT pipelines
- Big data analytics
- Machine learning at scale
- Real-time and batch processing
Spark Cluster Architecture
A Spark cluster typically consists of:
- Master Node: Manages resources and schedules tasks
- Worker Nodes: Execute tasks in parallel
This architecture enables efficient distributed processing on large datasets.
SparkSession – Entry Point to PySpark
SparkSession is the unified entry point to Spark features (SQL, streaming, ML, and DataFrame operations).
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MySparkApp") \
.getOrCreate()
Key points
.appName()helps track multiple applications.getOrCreate()prevents duplicate sessions that waste memory
PySpark DataFrames
PySpark DataFrames are similar to Pandas, but distributed and optimized for big data processing.
Create a DataFrame
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()
Inspect schema
df.printSchema()
Count rows
df.count()
Basic Analytics Operations
Group and aggregate
df.groupBy("gender").agg({"salary_usd": "avg"})
Common DataFrame methods (PySpark vs SQL)
| PySpark | SQL Equivalent |
|---|---|
select() |
SELECT |
filter() |
WHERE |
groupBy() |
GROUP BY |
agg() |
AGGREGATE |
Filtering and Selecting Data
df.filter(df.age > 50).select("name", "age").show()
Handling Missing (Null) Values
Drop nulls
df.na.drop()
df.na.drop(subset=["column_name"])
Fill nulls
df.na.fill(value=0)
df.na.fill({"column1": 0, "column2": "Unknown"})
Filter out nulls using isNotNull()
df.filter(df["column_name"].isNotNull())
Column Operations
Create a new column
df = df.withColumn("new_column", df["existing_column"] * 2)
Rename a column
df = df.withColumnRenamed("old_column_name", "new_column_name")
Drop a column
df = df.drop("column_name")
Row Operations
Filter rows
df_filtered = df.filter(df["age"] > 30)
Group and aggregate
df_grouped = df.groupBy("category").agg({"value_column": "sum"})
Defining Schema Manually (Recommended)
Schema inference is convenient, but it can misinterpret column data types. Manual schema improves data quality and stability.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("age", IntegerType()),
StructField("education_num", IntegerType()),
StructField("marital_status", StringType()),
StructField("occupation", StringType()),
StructField("income", StringType())
])
df = spark.read.csv("data.csv", schema=schema)
df.printSchema()
Joins in PySpark
Joins combine two DataFrames based on matching keys, similar to SQL joins.
Supported join types
- Inner Join
- Left (Outer) Join
- Right (Outer) Join
- Full Outer Join
df1.join(df2, df1["id"] == df2["id"], "inner")
Union Operation
Union stacks two DataFrames (they must have the same schema).
df_union = df1.union(df2)
Working with Complex Data Types
Arrays
from pyspark.sql.functions import lit
df = df.withColumn("array_col", lit([1, 2, 3]))
Maps
from pyspark.sql.functions import map, lit
df = df.withColumn("map_col", map(lit("k1"), lit("v1")))
Structs
Structs group related fields into one column (useful for nested data).
User Defined Functions (UDFs)
PySpark UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_upper(s):
return s.upper()
upper_udf = udf(to_upper, StringType())
df = df.withColumn("upper_name", upper_udf(df["name"]))
pandas UDF (Preferred for large datasets)
from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf("float")
def square(s: pd.Series) -> pd.Series:
return s * s
df = df.withColumn("squared", square(df["value"]))
PySpark UDF vs pandas UDF
| Feature | PySpark UDF | pandas UDF |
|---|---|---|
| Performance | Slower | Faster |
| Best for data size | Small | Large |
| Execution | Row-wise | Vectorized |
Final Takeaways
- PySpark is designed for distributed big data processing.
- DataFrames are optimized and support SQL-like operations.
- Manual schemas improve data quality and prevent type issues.
- pandas UDFs are typically faster for large datasets.
- PySpark is essential for Azure Databricks, Microsoft Fabric, and Spark-based interviews.
If this helped you: Bookmark this post for quick revision before interviews.

Comments
Post a Comment