Skip to main content

Posts

Recent posts

YAML Mastery for Data Engineers: The Configuration Language You Can't Ignore

YAML Mastery for Data Engineers: The Configuration Language You Can't Ignore YAML Mastery for Senior Data Engineers: The Configuration Language You Can't Ignore Published on January 21, 2026 | 8 min read As a senior data engineer, you've mastered SQL, conquered Python, and tamed distributed systems. But there's one skill that quietly determines whether your pipelines run smoothly or become maintenance nightmares: YAML proficiency . This humble configuration language is the backbone of modern data engineering workflows, and mastering it is non-negotiable. Why YAML Matters in Data Engineering YAML (YAML Ain't Markup Language) is a human-readable data serialization format that's become the de facto standard for configuration management. Unlike JSON or XML, YAML prioritizes readability while maintaining powerful data structuring capabilities. Where You'll Use ...

Terraform for Senior Data Engineers

Senior Data Engineer • DevOps for Data Terraform for Data Engineers: Why You Must Know It (and How You’ll Use It) Terraform is not “infra-only.” For modern data platforms (Azure / Databricks / Snowflake / Fabric / AWS), Terraform becomes the safest way to build, version, review, and reproduce environments across Dev → Test → Prod. Audience: Beginner → Advanced Outcome: Practical usage + interview-ready Includes: 10 most-used commands/scripts Includes: STAR interview Q&A Contents TL;DR What Terraform is (in one minute) Why it matters for Data Engineers Where Terraform is useful in a Data Engineer’s life Practical patterns you should follow 10 most used Terraform commands/scripts Interview questions + crisp STAR answers Quick checklist for “Terraform-ready” Data Engineers TL;DR Terraform = Infrastr...

PySpark in 7 Days (bonus2)

PySpark Cheat Sheet – Data Engineer Edition TL;DR • SparkSession is the entry point • Transformations are lazy, actions trigger execution • Avoid shuffles, prefer broadcast joins • Use Parquet/Delta, not CSV • Window functions = “Top N per group” • repartition increases, coalesce reduces partitions • Always think: DAG → stages → tasks 1. Spark Session from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("AppName") \ .getOrCreate() 2. Reading Data Format Example CSV spark.read.option("header","true").csv(path) JSON spark.read.json(path) Parquet spark.read.parquet(path) Delta spark.read.format("delta").load(path) 3. Writing Data df.write.mode("overwrite").parquet(path) df.write.partitionBy("date").parquet(path) 4. DataFrame Basics df.show() df.printSchema() df.count() df.columns 5. Select / Filter / withColumn from pyspark.sql.functions import col df.s...

PySpark in 7 Days (bonus1)

PySpark – Advanced Topics (Data Engineer Perspective) This is the final and advanced-level post of the Learn PySpark series. You have already covered Beginner → Intermediate . This post focuses on production-grade, interview-level, real-world PySpark expected from a Senior Data Engineer . TL;DR – What You Will Learn • How Spark actually executes your code (DAG, stages, tasks) • Catalyst Optimizer & Tungsten engine (why Spark is fast) • Advanced joins, skew handling, and salting • Incremental processing & watermarking • Delta Lake, MERGE, SCD patterns • Structured Streaming fundamentals • Error handling, idempotency, and production best practices • How to explain PySpark architecture in interviews 1. How Spark Really Executes Your Code (DAG) Every PySpark job is converted into a Directed Acyclic Graph (DAG) . Understanding this separates coders from data engineers . Transformations → Lazy (select, filter, join) Actions → Trigger execution (...

PySpark in 7 Days (Day 7)

Learn PySpark in 7 Days – Day 7 Welcome to Day 7 — the most important day of this series. Today, we bring everything together and build a real-world, end-to-end PySpark ETL pipeline using best practices expected from a professional data engineer. Day 7 Focus: • End-to-end PySpark ETL pipeline • Bronze → Silver → Gold architecture • Performance-aware transformations • Interview-ready explanation Real-World Scenario We receive daily employee data as CSV files. Our task is to: Ingest raw data (Bronze) Clean and transform data (Silver) Create analytics-ready output (Gold) Step 1: Create Spark Session from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("EndToEnd-PySpark-Pipeline") \ .getOrCreate() Step 2: Bronze Layer – Read Raw Data bronze_df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv("/data/bronze/employees.cs...

PySpark in 7 Days (Day 6)

Learn PySpark in 7 Days – Day 6 Welcome to Day 6 . Today we move from writing correct PySpark code to writing efficient and scalable PySpark code . This is where many data engineers fail interviews and production pipelines. Day 6 Focus: • repartition vs coalesce • Understanding shuffle & data skew • cache vs persist • Efficient file formats (Parquet) Why Performance Matters in Spark Spark can process terabytes of data, but poor partitioning or unnecessary shuffles can make jobs slow and expensive . Performance tuning is a core data engineering skill . Partitions in Spark (Concept) Data in Spark is split into partitions Each partition is processed in parallel Too few partitions → underutilised cluster Too many partitions → overhead repartition() df_repart = df.repartition(10) Increases or decreases partitions Causes a full shuffle Use when you need even distribution coalesce() df_coalesce = df.coalesce(5) Reduces number...