Data Engineer

Posts

Prompts & Interview Preparations

March 06, 2026

Prompt 1: Classifier Prompt (Decision Step) ------------------------------------------- You are a resume analyzer. INPUT: [RESUME_TEXT] TASK: Decide if the candidate is a Fresher or Experienced. RULES: - If experience is 0 years or no company name, label "Fresher". - if intern and company name, consider as "Fresher". - If experience is 1+ years or company/project work, label "Experienced". OUTPUT FORMAT (only this): Candidate_Type: Fresher OR Experienced Reason: (1 line) Prompt 2A: Fresher Interview Questions (Branch A) ------------------------------------------------ You are an interview trainer for freshers. INPUT: [RESUME_TEXT] TASK: Create 10 basic interview questions. RULES: - 4 HR questions - 6 basic technical questions - Simple and beginner friendly OUTPUT FORMAT: HR Questions: 1) ... 2) ... 3) ... 4) ... Technical Questions: 1) ... 2) ... ... ________________________________________ Prompt 2B: Experienced Interview Questions (Branch B) -----------...

Azure Data Engineering Project - End-to-End Azure Data Platform

February 03, 2026

In my recent project, I built an end-to-end Azure data platform that ingested semi-structured JSON data from multiple APIs and delivered analytics-ready datasets modelled in a star schema within Azure Synapse . The solution was metadata-driven, secure, and well governed , and designed to scale as new data sources were onboarded. For ingestion, I used Azure Data Factory and followed a metadata-driven approach rather than hardcoding logic into pipelines. I maintained control and watermark tables that defined how each API source should be processed, including source name, API endpoint, authentication method, pagination logic, incremental load column, target schema and table, load type, and whether the data contained PII. The watermark tables stored the last successful load timestamp or record ID , pipeline run status, and load timings. At runtime, ADF pipelines read from these tables and updated the watermark after successful execution, enabling true incremental loading and...

Master Databricks Asset Bundles Through Hands-On Practice

January 21, 2026

15 min read | 100% Practical Guide Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs) . By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code. 🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command. Level 0: YAML Basics BEGINNER The Golden Rules Rule #1: YAML uses spaces for indentation , never tabs. Standard is 2 spaces per level. Rule #2: YAML is case-sensitive . Name ≠ name Rule #3: Indentation = Structure . It defines parent-child relationships. ...

YAML Mastery for Data Engineers: The Configuration Language You Can't Ignore

January 21, 2026

YAML Mastery for Data Engineers: The Configuration Language You Can't Ignore YAML Mastery for Senior Data Engineers: The Configuration Language You Can't Ignore Published on January 21, 2026 | 8 min read As a senior data engineer, you've mastered SQL, conquered Python, and tamed distributed systems. But there's one skill that quietly determines whether your pipelines run smoothly or become maintenance nightmares: YAML proficiency . This humble configuration language is the backbone of modern data engineering workflows, and mastering it is non-negotiable. Why YAML Matters in Data Engineering YAML (YAML Ain't Markup Language) is a human-readable data serialization format that's become the de facto standard for configuration management. Unlike JSON or XML, YAML prioritizes readability while maintaining powerful data structuring capabilities. Where You'll Use ...

Terraform for Senior Data Engineers

January 21, 2026

Senior Data Engineer • DevOps for Data Terraform for Data Engineers: Why You Must Know It (and How You’ll Use It) Terraform is not “infra-only.” For modern data platforms (Azure / Databricks / Snowflake / Fabric / AWS), Terraform becomes the safest way to build, version, review, and reproduce environments across Dev → Test → Prod. Audience: Beginner → Advanced Outcome: Practical usage + interview-ready Includes: 10 most-used commands/scripts Includes: STAR interview Q&A Contents TL;DR What Terraform is (in one minute) Why it matters for Data Engineers Where Terraform is useful in a Data Engineer’s life Practical patterns you should follow 10 most used Terraform commands/scripts Interview questions + crisp STAR answers Quick checklist for “Terraform-ready” Data Engineers TL;DR Terraform = Infrastr...

PySpark in 7 Days (bonus2)

January 17, 2026

PySpark Cheat Sheet – Data Engineer Edition TL;DR • SparkSession is the entry point • Transformations are lazy, actions trigger execution • Avoid shuffles, prefer broadcast joins • Use Parquet/Delta, not CSV • Window functions = “Top N per group” • repartition increases, coalesce reduces partitions • Always think: DAG → stages → tasks 1. Spark Session from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("AppName") \ .getOrCreate() 2. Reading Data Format Example CSV spark.read.option("header","true").csv(path) JSON spark.read.json(path) Parquet spark.read.parquet(path) Delta spark.read.format("delta").load(path) 3. Writing Data df.write.mode("overwrite").parquet(path) df.write.partitionBy("date").parquet(path) 4. DataFrame Basics df.show() df.printSchema() df.count() df.columns 5. Select / Filter / withColumn from pyspark.sql.functions import col df.s...

PySpark in 7 Days (bonus1)

January 17, 2026

PySpark – Advanced Topics (Data Engineer Perspective) This is the final and advanced-level post of the Learn PySpark series. You have already covered Beginner → Intermediate . This post focuses on production-grade, interview-level, real-world PySpark expected from a Senior Data Engineer . TL;DR – What You Will Learn • How Spark actually executes your code (DAG, stages, tasks) • Catalyst Optimizer & Tungsten engine (why Spark is fast) • Advanced joins, skew handling, and salting • Incremental processing & watermarking • Delta Lake, MERGE, SCD patterns • Structured Streaming fundamentals • Error handling, idempotency, and production best practices • How to explain PySpark architecture in interviews 1. How Spark Really Executes Your Code (DAG) Every PySpark job is converted into a Directed Acyclic Graph (DAG) . Understanding this separates coders from data engineers . Transformations → Lazy (select, filter, join) Actions → Trigger execution (...