PySpark in 7 Days (Day 2)

January 17, 2026

Learn PySpark in 7 Days – Day 2

Welcome to Day 2 of the PySpark in 7 Days series. Today is one of the most important days because DataFrames are the core abstraction used in real-world Spark pipelines.

Day 2 Focus:

• Understanding PySpark DataFrames

• Reading CSV and JSON files

• Schema inference vs manual schema

• select, filter, withColumn operations

What Is a PySpark DataFrame?

A PySpark DataFrame is a distributed table-like structure with rows and columns, similar to a SQL table or Pandas DataFrame, but designed to work on large-scale data.

Immutable (cannot be changed in place)
Schema-aware
Optimised using Spark Catalyst engine

Creating SparkSession (Recap)

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Day2-PySpark") \
    .getOrCreate()

Reading CSV Files

df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("/path/employees.csv")

df.show()
df.printSchema()

Interview Tip:

inferSchema is convenient but expensive for large files.
In production, prefer explicit schemas.

Reading JSON Files

json_df = spark.read.json("/path/employees.json")
json_df.show()

Defining a Manual Schema

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("emp_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", IntegerType(), True)
])

df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv("/path/employees.csv")

Common DataFrame Operations

Select Columns

df.select("name", "salary").show()

Filter Rows

df.filter(df.salary > 50000).show()

Add or Modify Column (withColumn)

from pyspark.sql.functions import col

df = df.withColumn("salary_bonus", col("salary") * 1.10)
df.show()

Important:

withColumn always returns a new DataFrame.

Rename Columns

df = df.withColumnRenamed("emp_id", "employee_id")

Key Concepts You Learned Today

What a PySpark DataFrame is
Reading CSV and JSON data
Schema inference vs manual schema
select, filter, withColumn usage

What’s Coming on Day 3?

Day 3 – PySpark SQL & Expressions

Spark SQL vs DataFrame API
createOrReplaceTempView
SQL queries on DataFrames
Case when, aggregations

Search This Blog

Data Engineer