PySpark in 7 Days (Day 2)
Learn PySpark in 7 Days – Day 2
Welcome to Day 2 of the PySpark in 7 Days series. Today is one of the most important days because DataFrames are the core abstraction used in real-world Spark pipelines.
Day 2 Focus:
• Understanding PySpark DataFrames
• Reading CSV and JSON files
• Schema inference vs manual schema
• select, filter, withColumn operations
• Understanding PySpark DataFrames
• Reading CSV and JSON files
• Schema inference vs manual schema
• select, filter, withColumn operations
What Is a PySpark DataFrame?
A PySpark DataFrame is a distributed table-like structure with rows and columns, similar to a SQL table or Pandas DataFrame, but designed to work on large-scale data.
- Immutable (cannot be changed in place)
- Schema-aware
- Optimised using Spark Catalyst engine
Creating SparkSession (Recap)
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Day2-PySpark") \
.getOrCreate()
Reading CSV Files
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("/path/employees.csv")
df.show()
df.printSchema()
Interview Tip:
inferSchema is convenient but expensive for large files. In production, prefer explicit schemas.
inferSchema is convenient but expensive for large files. In production, prefer explicit schemas.
Reading JSON Files
json_df = spark.read.json("/path/employees.json")
json_df.show()
Defining a Manual Schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("emp_id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("department", StringType(), True),
StructField("salary", IntegerType(), True)
])
df = spark.read \
.option("header", "true") \
.schema(schema) \
.csv("/path/employees.csv")
Common DataFrame Operations
Select Columns
df.select("name", "salary").show()
Filter Rows
df.filter(df.salary > 50000).show()
Add or Modify Column (withColumn)
from pyspark.sql.functions import col
df = df.withColumn("salary_bonus", col("salary") * 1.10)
df.show()
Important:
withColumn always returns a new DataFrame.
withColumn always returns a new DataFrame.
Rename Columns
df = df.withColumnRenamed("emp_id", "employee_id")
Key Concepts You Learned Today
- What a PySpark DataFrame is
- Reading CSV and JSON data
- Schema inference vs manual schema
- select, filter, withColumn usage
What’s Coming on Day 3?
Day 3 – PySpark SQL & Expressions
- Spark SQL vs DataFrame API
- createOrReplaceTempView
- SQL queries on DataFrames
- Case when, aggregations
