Skip to main content

Master Databricks Asset Bundles Through Hands-On Practice

15 min read | 100% Practical Guide

Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs). By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code.

🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command.

Level 0: YAML Basics BEGINNER

The Golden Rules

Rule #1:
YAML uses spaces for indentation, never tabs. Standard is 2 spaces per level.
Rule #2:
YAML is case-sensitive. Namename
Rule #3:
Indentation = Structure. It defines parent-child relationships.
Rule #4:
Use # for comments. Comment everything!

Basic Data Types

basic_syntax.yml
# Strings - quotes optional unless special characters present
name: customer_etl_job
description: "Daily ETL: Extract-Transform-Load"

# Numbers
max_retries: 3
timeout_seconds: 3600

# Booleans
enabled: true
debug_mode: false

# Null values
backup_path: null
# or
backup_path: ~
        

Lists (Arrays)

lists.yml
# Method 1: Dash notation (most common)
environments:
  - dev
  - staging
  - prod

# Method 2: Inline notation
environments: [dev, staging, prod]

# Nested lists
tags:
  - category: finance
    priority: high
  - category: customer
    priority: medium
        

Dictionaries (Key-Value Pairs)

dictionaries.yml
# Simple dictionary
cluster_config:
  spark_version: "13.3.x-scala2.12"
  node_type: "Standard_DS3_v2"
  num_workers: 2

# Nested dictionaries
job_settings:
  schedule:
    cron: "0 2 * * *"
    timezone: "America/New_York"
  notifications:
    on_failure:
      - admin@company.com
    on_success:
      - team@company.com
        

✅ Checkpoint #1: Can you read this?

If you understand that notifications is a dictionary containing on_failure (a list) and that indentation shows schedule belongs to job_settings, you're ready for Level 1!

Level 1: Your First databricks.yml BEGINNER

Understanding the Structure

Every Databricks Asset Bundle starts with a root databricks.yml file. Think of it as the blueprint of your entire workspace.

databricks.yml - Minimal Version
# Bundle metadata
bundle:
  name: customer_analytics

# Target environments
targets:
  dev:
    # Workspace URL for dev environment
    workspace:
      host: https://adb-1234567890123456.7.azuredatabricks.net
        
💡 Pro Tip: The bundle section defines your project. The targets section defines where it deploys (dev, staging, prod).

Adding Your First Job

databricks.yml - With a Simple Job
bundle:
  name: customer_analytics

# Define reusable variables
variables:
  warehouse_id:
    description: SQL Warehouse ID
    default: "abc123def456"

targets:
  dev:
    workspace:
      host: https://adb-1234567890123456.7.azuredatabricks.net
    
resources:
  jobs:
    # Job identifier (used internally)
    daily_customer_report:
      # Display name in Databricks UI
      name: "[DEV] Daily Customer Report"
      
      # Single task job
      tasks:
        - task_key: generate_report
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                SELECT 
                  customer_id,
                  COUNT(*) as order_count,
                  SUM(amount) as total_spent
                FROM customers
                GROUP BY customer_id
          
      # Schedule (optional)
      schedule:
        quartz_cron_expression: "0 0 8 * * ?"
        timezone_id: "America/New_York"
        
💡 Understanding Multi-line Strings: The | symbol after query: means "preserve newlines". Your SQL will be formatted exactly as written. Use > if you want lines folded into one.

Level 2: Variables and Environments INTERMEDIATE

The Power of Variables

Variables make your YAML reusable across environments. Instead of copying configs, you define once and reference everywhere.

databricks.yml - Environment Variables
bundle:
  name: customer_analytics

# Global variables (available to all targets)
variables:
  catalog_name:
    description: Unity Catalog name
  
  schema_name:
    description: Schema for tables
    default: bronze

targets:
  # Development environment
  dev:
    # Override variables per environment
    variables:
      catalog_name: dev_catalog
      schema_name: dev_bronze
    
    workspace:
      host: https://adb-dev.azuredatabricks.net
  
  # Production environment
  prod:
    variables:
      catalog_name: prod_catalog
      schema_name: prod_bronze
    
    workspace:
      host: https://adb-prod.azuredatabricks.net

resources:
  jobs:
    ingest_customers:
      name: "[${bundle.target}] Customer Ingestion"
      
      tasks:
        - task_key: bronze_load
          notebook_task:
            notebook_path: ./notebooks/bronze_ingestion.py
            base_parameters:
              # Reference variables with ${var.variable_name}
              catalog: ${var.catalog_name}
              schema: ${var.schema_name}
              table: customers
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 2
        
🎯 What Just Happened?
  • ${bundle.target} automatically inserts "dev" or "prod" in the job name
  • ${var.catalog_name} resolves to "dev_catalog" in dev, "prod_catalog" in prod
  • Same YAML, different environments—zero code duplication!

Level 3: Multi-Task Jobs & Dependencies INTERMEDIATE

Building a Real ETL Pipeline

Real data pipelines have multiple steps with dependencies. YAML handles this elegantly with the depends_on key.

databricks.yml - Multi-Task Pipeline
resources:
  jobs:
    customer_etl_pipeline:
      name: "[${bundle.target}] Customer ETL Pipeline"
      
      # Job-level settings
      max_concurrent_runs: 1
      timeout_seconds: 7200
      
      tasks:
        # Task 1: Extract raw data
        - task_key: extract_raw
          notebook_task:
            notebook_path: ./notebooks/01_extract.py
            base_parameters:
              source_table: "raw.customers"
              target_table: "${var.catalog_name}.bronze.customers_raw"
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 2
            spark_conf:
              "spark.databricks.delta.optimizeWrite.enabled": "true"
        
        # Task 2: Clean and validate (depends on Task 1)
        - task_key: clean_data
          depends_on:
            - task_key: extract_raw
          
          notebook_task:
            notebook_path: ./notebooks/02_clean.py
            base_parameters:
              source_table: "${var.catalog_name}.bronze.customers_raw"
              target_table: "${var.catalog_name}.silver.customers_clean"
          
          # Reuse the same cluster from Task 1
          existing_cluster_id: "{{tasks.extract_raw.cluster_instance_id}}"
        
        # Task 3: Business logic transformations
        - task_key: transform_business
          depends_on:
            - task_key: clean_data
          
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_metrics AS
                SELECT 
                  customer_id,
                  customer_name,
                  COUNT(order_id) as lifetime_orders,
                  SUM(order_amount) as lifetime_value,
                  MAX(order_date) as last_order_date,
                  DATEDIFF(CURRENT_DATE(), MAX(order_date)) as days_since_last_order
                FROM ${var.catalog_name}.silver.customers_clean
                GROUP BY customer_id, customer_name
        
        # Task 4: Data quality checks (runs in parallel with Task 3)
        - task_key: quality_checks
          depends_on:
            - task_key: clean_data
          
          notebook_task:
            notebook_path: ./notebooks/03_quality_checks.py
            base_parameters:
              table_to_check: "${var.catalog_name}.silver.customers_clean"
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 1
        
        # Task 5: Final notification (waits for all previous tasks)
        - task_key: notify_completion
          depends_on:
            - task_key: transform_business
            - task_key: quality_checks
          
          notebook_task:
            notebook_path: ./notebooks/04_notify.py
          
          existing_cluster_id: "{{tasks.quality_checks.cluster_instance_id}}"
      
      # Email notifications
      email_notifications:
        on_failure:
          - data-engineering@company.com
        on_success:
          - analytics-team@company.com
      
      # Job schedule
      schedule:
        quartz_cron_expression: "0 0 2 * * ?"  # 2 AM daily
        timezone_id: "America/New_York"
        pause_status: UNPAUSED
        

🧠 Challenge: Understand the Flow

Question: In the pipeline above, which tasks run in parallel?

Answer: Tasks 3 (transform_business) and 4 (quality_checks) run in parallel because they both depend only on Task 2 (clean_data) and don't depend on each other. Task 5 waits for both to complete.

Level 4: YAML Anchors & Reusability ADVANCED

DRY Principle: Don't Repeat Yourself

YAML anchors (&) let you define something once and reference it multiple times with aliases (*). This is crucial for cluster configs that repeat across jobs.

databricks.yml - Using Anchors
bundle:
  name: customer_analytics

# Define reusable cluster configurations
variables:
  # Small cluster anchor
  small_cluster: &small_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 2
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
  
  # Large cluster anchor
  large_cluster: &large_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS4_v2"
    num_workers: 8
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
      "spark.sql.adaptive.enabled": "true"

targets:
  dev:
    workspace:
      host: https://adb-dev.azuredatabricks.net

resources:
  jobs:
    # Job 1: Uses small cluster
    quick_report:
      name: "Quick Daily Report"
      tasks:
        - task_key: generate
          notebook_task:
            notebook_path: ./notebooks/quick_report.py
          # Reference the anchor with *
          new_cluster: *small_cluster
    
    # Job 2: Also uses small cluster
    validation_job:
      name: "Data Validation"
      tasks:
        - task_key: validate
          notebook_task:
            notebook_path: ./notebooks/validate.py
          new_cluster: *small_cluster
    
    # Job 3: Uses large cluster for heavy processing
    monthly_aggregation:
      name: "Monthly Aggregation"
      tasks:
        - task_key: aggregate
          notebook_task:
            notebook_path: ./notebooks/monthly_agg.py
          new_cluster: *large_cluster
        

Merging and Overriding with Anchors

You can extend anchors using merge keys (<<:) to inherit and override specific properties.

databricks.yml - Advanced Anchors
variables:
  # Base cluster configuration
  base_cluster: &base_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 2
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"

resources:
  jobs:
    custom_job:
      name: "Custom Processing Job"
      tasks:
        - task_key: process
          notebook_task:
            notebook_path: ./notebooks/process.py
          new_cluster:
            # Merge base configuration and override specific keys
            <<: *base_cluster
            num_workers: 4  # Override: use 4 workers instead of 2
            spark_conf:
              # Inherit all from base, add new config
              "spark.databricks.delta.optimizeWrite.enabled": "true"
              "spark.sql.shuffle.partitions": "200"
        

✅ Checkpoint #4: Anchors Mastery

You now understand that &name creates a reusable anchor, *name references it, and <<: *name lets you inherit and override. This eliminates 80% of repetition in real-world DAB configs!

Level 5: Complete Production Example PRO

Real-World Databricks Asset Bundle

Let's put it all together: a production-ready configuration with multiple environments, shared resources, and best practices.

databricks.yml - Production Grade
bundle:
  name: customer_data_platform

# Include external YAML files for better organization
include:
  - resources/*.yml

# Global variables
variables:
  catalog_name:
    description: Unity Catalog name
  
  notification_email:
    description: Email for job notifications
    default: data-engineering@company.com
  
  # Cluster configurations as anchors
  standard_cluster: &standard_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    autoscale:
      min_workers: 2
      max_workers: 8
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
    azure_attributes:
      availability: "ON_DEMAND_AZURE"
      spot_bid_max_price: -1

# Target environments
targets:
  # Development
  dev:
    mode: development
    default: true
    
    variables:
      catalog_name: dev_catalog
      notification_email: dev-team@company.com
    
    workspace:
      host: https://adb-dev-12345.7.azuredatabricks.net
      root_path: ~/.bundle/${bundle.name}/${bundle.target}
  
  # Staging
  staging:
    mode: production
    
    variables:
      catalog_name: staging_catalog
    
    workspace:
      host: https://adb-staging-12345.7.azuredatabricks.net
      root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
  
  # Production
  prod:
    mode: production
    
    variables:
      catalog_name: prod_catalog
    
    workspace:
      host: https://adb-prod-12345.7.azuredatabricks.net
      root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
    
    # Production-specific permissions
    permissions:
      - level: CAN_VIEW
        group_name: "data-analysts"

# Resources
resources:
  jobs:
    # Daily ETL Pipeline
    daily_etl:
      name: "[${bundle.target}] Daily Customer ETL"
      
      job_clusters:
        # Define job-level clusters that tasks can share
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: *standard_cluster
            spark_env_vars:
              ENVIRONMENT: ${bundle.target}
      
      tasks:
        # Bronze layer ingestion
        - task_key: bronze_ingestion
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/bronze/ingest_customers.py
            base_parameters:
              catalog: ${var.catalog_name}
              schema: bronze
              source_format: parquet
          
          libraries:
            - pypi:
                package: "pandas==2.0.3"
            - pypi:
                package: "pyarrow==12.0.1"
        
        # Silver layer transformation
        - task_key: silver_transformation
          depends_on:
            - task_key: bronze_ingestion
          
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/silver/transform_customers.py
            base_parameters:
              catalog: ${var.catalog_name}
              source_schema: bronze
              target_schema: silver
        
        # Gold layer aggregations
        - task_key: gold_aggregations
          depends_on:
            - task_key: silver_transformation
          
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                -- Customer lifetime value
                CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_ltv AS
                SELECT 
                  c.customer_id,
                  c.customer_name,
                  c.customer_segment,
                  COUNT(DISTINCT o.order_id) as total_orders,
                  SUM(o.order_amount) as lifetime_value,
                  AVG(o.order_amount) as avg_order_value,
                  MIN(o.order_date) as first_order_date,
                  MAX(o.order_date) as last_order_date,
                  DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) as days_since_last_order,
                  CASE 
                    WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 30 THEN 'Active'
                    WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 90 THEN 'At Risk'
                    ELSE 'Churned'
                  END as customer_status
                FROM ${var.catalog_name}.silver.customers c
                LEFT JOIN ${var.catalog_name}.silver.orders o
                  ON c.customer_id = o.customer_id
                GROUP BY c.customer_id, c.customer_name, c.customer_segment
        
        # Data quality validation
        - task_key: quality_validation
          depends_on:
            - task_key: gold_aggregations
          
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/validation/quality_checks.py
            base_parameters:
              catalog: ${var.catalog_name}
              tables_to_check: "bronze.customers,silver.customers,gold.customer_ltv"
      
      # Scheduling
      schedule:
        quartz_cron_expression: "0 0 3 * * ?"  # 3 AM daily
        timezone_id: "America/New_York"
        pause_status: UNPAUSED
      
      # Notifications
      email_notifications:
        on_start:
          - ${var.notification_email}
        on_success:
          - ${var.notification_email}
        on_failure:
          - ${var.notification_email}
          - oncall@company.com
        no_alert_for_skipped_runs: true
      
      # Retry policy
      max_retries: 2
      retry_on_timeout: true
      
      # Timeout (2 hours)
      timeout_seconds: 7200
      
      # Tags for cost tracking
      tags:
        Environment: ${bundle.target}
        Team: DataEngineering
        CostCenter: Analytics
        Project: CustomerDataPlatform
    
    # Weekly aggregation job
    weekly_aggregation:
      name: "[${bundle.target}] Weekly Aggregations"
      
      tasks:
        - task_key: weekly_agg
          notebook_task:
            notebook_path: ./pipelines/gold/weekly_aggregations.py
            base_parameters:
              catalog: ${var.catalog_name}
          
          new_cluster:
            <<: *standard_cluster
            num_workers: 4  # Override for larger job
      
      schedule:
        quartz_cron_expression: "0 0 4 ? * SUN"  # Sunday 4 AM
        timezone_id: "America/New_York"
      
      email_notifications:
        on_failure:
          - ${var.notification_email}

  # SQL Warehouses
  sql_warehouses:
    analytics_warehouse:
      name: "[${bundle.target}] Analytics Warehouse"
      cluster_size: "Medium"
      min_num_clusters: 1
      max_num_clusters: 3
      auto_stop_mins: 15
      enable_serverless_compute: true
      
      tags:
        custom_tags:
          - key: Environment
            value: ${bundle.target}
          - key: Team
            value: Analytics
        

Best Practices Summary

Practice Why It Matters Example
Use Variables Environment-specific configs without duplication ${var.catalog_name}
Use Anchors Reusable cluster configs reduce errors &standard_

Comments

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

Understanding employment distribution among top companies can provide valuable insights into industry dominance and workforce trends. In this blog, I analyze the largest employers in the UK using a Power BI table visualization, sourced from CompaniesMarketCap . Source:  CompaniesMarketCap Key Insights from the Data: Compass Group leads the ranking with 550,000 employees, dominating the food service industry. Tesco, the retail giant, follows with 330,000 employees. HSBC, a major player in banking, employs over 215,000 people. The total workforce among the top companies surpasses 1.98 million employees. Visualizing in Power BI: Using a table visualization, we can clearly compare the number of employees across different companies. Power BI’s sorting, aggregation, and filtering features enhance data readability and analysis. However, incorporating bar charts, conditional formatting, and KPIs could make the insights even more compelling. What’s Next? Would you add more interactive eleme...

PySpark Important Last-Minute Notes

If you are preparing for Data Engineering interviews , Spark projects , or need a quick PySpark revision , this post consolidates the most important PySpark concepts in one place. Best for: Data Engineers, Big Data Developers, Azure/Databricks/Microsoft Fabric users, and anyone doing last-minute interview preparation. What is PySpark? PySpark is the Python API for Apache Spark , an open-source distributed computing framework used for large-scale data processing. Why PySpark? Distributed, in-memory processing Faster than traditional batch systems Scales across clusters Supports SQL, Streaming, and Machine Learning Common use cases ETL / ELT pipelines Big data analytics Machine learning at scale Real-time and batch processing Spark Cluster Architecture A Spark cluster typically consists of: Master Node: Manages resources and schedules tasks Worker Nodes: Execute tasks in parallel This architecture enables efficient distributed p...