Master Databricks Asset Bundles Through Hands-On Practice

15 min read | 100% Practical Guide

Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs). By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code.

🎯 What You'll Build: A complete Databricks workspace configuration including jobs, clusters, notebooks, and permissions—all defined in YAML and deployable with a single command.

Level 0: YAML Basics BEGINNER

The Golden Rules

Rule #1:
YAML uses spaces for indentation, never tabs. Standard is 2 spaces per level.
Rule #2:
YAML is case-sensitive. Namename
Rule #3:
Indentation = Structure. It defines parent-child relationships.
Rule #4:
Use # for comments. Comment everything!

Basic Data Types

basic_syntax.yml
# Strings - quotes optional unless special characters present
name: customer_etl_job
description: "Daily ETL: Extract-Transform-Load"

# Numbers
max_retries: 3
timeout_seconds: 3600

# Booleans
enabled: true
debug_mode: false

# Null values
backup_path: null
# or
backup_path: ~
        

Lists (Arrays)

lists.yml
# Method 1: Dash notation (most common)
environments:
  - dev
  - staging
  - prod

# Method 2: Inline notation
environments: [dev, staging, prod]

# Nested lists
tags:
  - category: finance
    priority: high
  - category: customer
    priority: medium
        

Dictionaries (Key-Value Pairs)

dictionaries.yml
# Simple dictionary
cluster_config:
  spark_version: "13.3.x-scala2.12"
  node_type: "Standard_DS3_v2"
  num_workers: 2

# Nested dictionaries
job_settings:
  schedule:
    cron: "0 2 * * *"
    timezone: "America/New_York"
  notifications:
    on_failure:
      - admin@company.com
    on_success:
      - team@company.com
        

✅ Checkpoint #1: Can you read this?

If you understand that notifications is a dictionary containing on_failure (a list) and that indentation shows schedule belongs to job_settings, you're ready for Level 1!

Level 1: Your First databricks.yml BEGINNER

Understanding the Structure

Every Databricks Asset Bundle starts with a root databricks.yml file. Think of it as the blueprint of your entire workspace.

databricks.yml - Minimal Version
# Bundle metadata
bundle:
  name: customer_analytics

# Target environments
targets:
  dev:
    # Workspace URL for dev environment
    workspace:
      host: https://adb-1234567890123456.7.azuredatabricks.net
        
💡 Pro Tip: The bundle section defines your project. The targets section defines where it deploys (dev, staging, prod).

Adding Your First Job

databricks.yml - With a Simple Job
bundle:
  name: customer_analytics

# Define reusable variables
variables:
  warehouse_id:
    description: SQL Warehouse ID
    default: "abc123def456"

targets:
  dev:
    workspace:
      host: https://adb-1234567890123456.7.azuredatabricks.net
    
resources:
  jobs:
    # Job identifier (used internally)
    daily_customer_report:
      # Display name in Databricks UI
      name: "[DEV] Daily Customer Report"
      
      # Single task job
      tasks:
        - task_key: generate_report
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                SELECT 
                  customer_id,
                  COUNT(*) as order_count,
                  SUM(amount) as total_spent
                FROM customers
                GROUP BY customer_id
          
      # Schedule (optional)
      schedule:
        quartz_cron_expression: "0 0 8 * * ?"
        timezone_id: "America/New_York"
        
💡 Understanding Multi-line Strings: The | symbol after query: means "preserve newlines". Your SQL will be formatted exactly as written. Use > if you want lines folded into one.

Level 2: Variables and Environments INTERMEDIATE

The Power of Variables

Variables make your YAML reusable across environments. Instead of copying configs, you define once and reference everywhere.

databricks.yml - Environment Variables
bundle:
  name: customer_analytics

# Global variables (available to all targets)
variables:
  catalog_name:
    description: Unity Catalog name
  
  schema_name:
    description: Schema for tables
    default: bronze

targets:
  # Development environment
  dev:
    # Override variables per environment
    variables:
      catalog_name: dev_catalog
      schema_name: dev_bronze
    
    workspace:
      host: https://adb-dev.azuredatabricks.net
  
  # Production environment
  prod:
    variables:
      catalog_name: prod_catalog
      schema_name: prod_bronze
    
    workspace:
      host: https://adb-prod.azuredatabricks.net

resources:
  jobs:
    ingest_customers:
      name: "[${bundle.target}] Customer Ingestion"
      
      tasks:
        - task_key: bronze_load
          notebook_task:
            notebook_path: ./notebooks/bronze_ingestion.py
            base_parameters:
              # Reference variables with ${var.variable_name}
              catalog: ${var.catalog_name}
              schema: ${var.schema_name}
              table: customers
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 2
        
🎯 What Just Happened?
  • ${bundle.target} automatically inserts "dev" or "prod" in the job name
  • ${var.catalog_name} resolves to "dev_catalog" in dev, "prod_catalog" in prod
  • Same YAML, different environments—zero code duplication!

Level 3: Multi-Task Jobs & Dependencies INTERMEDIATE

Building a Real ETL Pipeline

Real data pipelines have multiple steps with dependencies. YAML handles this elegantly with the depends_on key.

databricks.yml - Multi-Task Pipeline
resources:
  jobs:
    customer_etl_pipeline:
      name: "[${bundle.target}] Customer ETL Pipeline"
      
      # Job-level settings
      max_concurrent_runs: 1
      timeout_seconds: 7200
      
      tasks:
        # Task 1: Extract raw data
        - task_key: extract_raw
          notebook_task:
            notebook_path: ./notebooks/01_extract.py
            base_parameters:
              source_table: "raw.customers"
              target_table: "${var.catalog_name}.bronze.customers_raw"
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 2
            spark_conf:
              "spark.databricks.delta.optimizeWrite.enabled": "true"
        
        # Task 2: Clean and validate (depends on Task 1)
        - task_key: clean_data
          depends_on:
            - task_key: extract_raw
          
          notebook_task:
            notebook_path: ./notebooks/02_clean.py
            base_parameters:
              source_table: "${var.catalog_name}.bronze.customers_raw"
              target_table: "${var.catalog_name}.silver.customers_clean"
          
          # Reuse the same cluster from Task 1
          existing_cluster_id: "{{tasks.extract_raw.cluster_instance_id}}"
        
        # Task 3: Business logic transformations
        - task_key: transform_business
          depends_on:
            - task_key: clean_data
          
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_metrics AS
                SELECT 
                  customer_id,
                  customer_name,
                  COUNT(order_id) as lifetime_orders,
                  SUM(order_amount) as lifetime_value,
                  MAX(order_date) as last_order_date,
                  DATEDIFF(CURRENT_DATE(), MAX(order_date)) as days_since_last_order
                FROM ${var.catalog_name}.silver.customers_clean
                GROUP BY customer_id, customer_name
        
        # Task 4: Data quality checks (runs in parallel with Task 3)
        - task_key: quality_checks
          depends_on:
            - task_key: clean_data
          
          notebook_task:
            notebook_path: ./notebooks/03_quality_checks.py
            base_parameters:
              table_to_check: "${var.catalog_name}.silver.customers_clean"
          
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "Standard_DS3_v2"
            num_workers: 1
        
        # Task 5: Final notification (waits for all previous tasks)
        - task_key: notify_completion
          depends_on:
            - task_key: transform_business
            - task_key: quality_checks
          
          notebook_task:
            notebook_path: ./notebooks/04_notify.py
          
          existing_cluster_id: "{{tasks.quality_checks.cluster_instance_id}}"
      
      # Email notifications
      email_notifications:
        on_failure:
          - data-engineering@company.com
        on_success:
          - analytics-team@company.com
      
      # Job schedule
      schedule:
        quartz_cron_expression: "0 0 2 * * ?"  # 2 AM daily
        timezone_id: "America/New_York"
        pause_status: UNPAUSED
        

🧠 Challenge: Understand the Flow

Question: In the pipeline above, which tasks run in parallel?

Answer: Tasks 3 (transform_business) and 4 (quality_checks) run in parallel because they both depend only on Task 2 (clean_data) and don't depend on each other. Task 5 waits for both to complete.

Level 4: YAML Anchors & Reusability ADVANCED

DRY Principle: Don't Repeat Yourself

YAML anchors (&) let you define something once and reference it multiple times with aliases (*). This is crucial for cluster configs that repeat across jobs.

databricks.yml - Using Anchors
bundle:
  name: customer_analytics

# Define reusable cluster configurations
variables:
  # Small cluster anchor
  small_cluster: &small_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 2
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
  
  # Large cluster anchor
  large_cluster: &large_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS4_v2"
    num_workers: 8
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
      "spark.sql.adaptive.enabled": "true"

targets:
  dev:
    workspace:
      host: https://adb-dev.azuredatabricks.net

resources:
  jobs:
    # Job 1: Uses small cluster
    quick_report:
      name: "Quick Daily Report"
      tasks:
        - task_key: generate
          notebook_task:
            notebook_path: ./notebooks/quick_report.py
          # Reference the anchor with *
          new_cluster: *small_cluster
    
    # Job 2: Also uses small cluster
    validation_job:
      name: "Data Validation"
      tasks:
        - task_key: validate
          notebook_task:
            notebook_path: ./notebooks/validate.py
          new_cluster: *small_cluster
    
    # Job 3: Uses large cluster for heavy processing
    monthly_aggregation:
      name: "Monthly Aggregation"
      tasks:
        - task_key: aggregate
          notebook_task:
            notebook_path: ./notebooks/monthly_agg.py
          new_cluster: *large_cluster
        

Merging and Overriding with Anchors

You can extend anchors using merge keys (<<:) to inherit and override specific properties.

databricks.yml - Advanced Anchors
variables:
  # Base cluster configuration
  base_cluster: &base_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    num_workers: 2
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"

resources:
  jobs:
    custom_job:
      name: "Custom Processing Job"
      tasks:
        - task_key: process
          notebook_task:
            notebook_path: ./notebooks/process.py
          new_cluster:
            # Merge base configuration and override specific keys
            <<: *base_cluster
            num_workers: 4  # Override: use 4 workers instead of 2
            spark_conf:
              # Inherit all from base, add new config
              "spark.databricks.delta.optimizeWrite.enabled": "true"
              "spark.sql.shuffle.partitions": "200"
        

✅ Checkpoint #4: Anchors Mastery

You now understand that &name creates a reusable anchor, *name references it, and <<: *name lets you inherit and override. This eliminates 80% of repetition in real-world DAB configs!

Level 5: Complete Production Example PRO

Real-World Databricks Asset Bundle

Let's put it all together: a production-ready configuration with multiple environments, shared resources, and best practices.

databricks.yml - Production Grade
bundle:
  name: customer_data_platform

# Include external YAML files for better organization
include:
  - resources/*.yml

# Global variables
variables:
  catalog_name:
    description: Unity Catalog name
  
  notification_email:
    description: Email for job notifications
    default: data-engineering@company.com
  
  # Cluster configurations as anchors
  standard_cluster: &standard_cluster
    spark_version: "13.3.x-scala2.12"
    node_type_id: "Standard_DS3_v2"
    autoscale:
      min_workers: 2
      max_workers: 8
    spark_conf:
      "spark.databricks.delta.optimizeWrite.enabled": "true"
      "spark.databricks.delta.autoCompact.enabled": "true"
    azure_attributes:
      availability: "ON_DEMAND_AZURE"
      spot_bid_max_price: -1

# Target environments
targets:
  # Development
  dev:
    mode: development
    default: true
    
    variables:
      catalog_name: dev_catalog
      notification_email: dev-team@company.com
    
    workspace:
      host: https://adb-dev-12345.7.azuredatabricks.net
      root_path: ~/.bundle/${bundle.name}/${bundle.target}
  
  # Staging
  staging:
    mode: production
    
    variables:
      catalog_name: staging_catalog
    
    workspace:
      host: https://adb-staging-12345.7.azuredatabricks.net
      root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
  
  # Production
  prod:
    mode: production
    
    variables:
      catalog_name: prod_catalog
    
    workspace:
      host: https://adb-prod-12345.7.azuredatabricks.net
      root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
    
    # Production-specific permissions
    permissions:
      - level: CAN_VIEW
        group_name: "data-analysts"

# Resources
resources:
  jobs:
    # Daily ETL Pipeline
    daily_etl:
      name: "[${bundle.target}] Daily Customer ETL"
      
      job_clusters:
        # Define job-level clusters that tasks can share
        - job_cluster_key: etl_cluster
          new_cluster:
            <<: *standard_cluster
            spark_env_vars:
              ENVIRONMENT: ${bundle.target}
      
      tasks:
        # Bronze layer ingestion
        - task_key: bronze_ingestion
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/bronze/ingest_customers.py
            base_parameters:
              catalog: ${var.catalog_name}
              schema: bronze
              source_format: parquet
          
          libraries:
            - pypi:
                package: "pandas==2.0.3"
            - pypi:
                package: "pyarrow==12.0.1"
        
        # Silver layer transformation
        - task_key: silver_transformation
          depends_on:
            - task_key: bronze_ingestion
          
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/silver/transform_customers.py
            base_parameters:
              catalog: ${var.catalog_name}
              source_schema: bronze
              target_schema: silver
        
        # Gold layer aggregations
        - task_key: gold_aggregations
          depends_on:
            - task_key: silver_transformation
          
          sql_task:
            warehouse_id: ${var.warehouse_id}
            query:
              query: |
                -- Customer lifetime value
                CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_ltv AS
                SELECT 
                  c.customer_id,
                  c.customer_name,
                  c.customer_segment,
                  COUNT(DISTINCT o.order_id) as total_orders,
                  SUM(o.order_amount) as lifetime_value,
                  AVG(o.order_amount) as avg_order_value,
                  MIN(o.order_date) as first_order_date,
                  MAX(o.order_date) as last_order_date,
                  DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) as days_since_last_order,
                  CASE 
                    WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 30 THEN 'Active'
                    WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 90 THEN 'At Risk'
                    ELSE 'Churned'
                  END as customer_status
                FROM ${var.catalog_name}.silver.customers c
                LEFT JOIN ${var.catalog_name}.silver.orders o
                  ON c.customer_id = o.customer_id
                GROUP BY c.customer_id, c.customer_name, c.customer_segment
        
        # Data quality validation
        - task_key: quality_validation
          depends_on:
            - task_key: gold_aggregations
          
          job_cluster_key: etl_cluster
          
          notebook_task:
            notebook_path: ./pipelines/validation/quality_checks.py
            base_parameters:
              catalog: ${var.catalog_name}
              tables_to_check: "bronze.customers,silver.customers,gold.customer_ltv"
      
      # Scheduling
      schedule:
        quartz_cron_expression: "0 0 3 * * ?"  # 3 AM daily
        timezone_id: "America/New_York"
        pause_status: UNPAUSED
      
      # Notifications
      email_notifications:
        on_start:
          - ${var.notification_email}
        on_success:
          - ${var.notification_email}
        on_failure:
          - ${var.notification_email}
          - oncall@company.com
        no_alert_for_skipped_runs: true
      
      # Retry policy
      max_retries: 2
      retry_on_timeout: true
      
      # Timeout (2 hours)
      timeout_seconds: 7200
      
      # Tags for cost tracking
      tags:
        Environment: ${bundle.target}
        Team: DataEngineering
        CostCenter: Analytics
        Project: CustomerDataPlatform
    
    # Weekly aggregation job
    weekly_aggregation:
      name: "[${bundle.target}] Weekly Aggregations"
      
      tasks:
        - task_key: weekly_agg
          notebook_task:
            notebook_path: ./pipelines/gold/weekly_aggregations.py
            base_parameters:
              catalog: ${var.catalog_name}
          
          new_cluster:
            <<: *standard_cluster
            num_workers: 4  # Override for larger job
      
      schedule:
        quartz_cron_expression: "0 0 4 ? * SUN"  # Sunday 4 AM
        timezone_id: "America/New_York"
      
      email_notifications:
        on_failure:
          - ${var.notification_email}

  # SQL Warehouses
  sql_warehouses:
    analytics_warehouse:
      name: "[${bundle.target}] Analytics Warehouse"
      cluster_size: "Medium"
      min_num_clusters: 1
      max_num_clusters: 3
      auto_stop_mins: 15
      enable_serverless_compute: true
      
      tags:
        custom_tags:
          - key: Environment
            value: ${bundle.target}
          - key: Team
            value: Analytics
        

Best Practices Summary

Practice Why It Matters Example
Use Variables Environment-specific configs without duplication ${var.catalog_name}
Use Anchors Reusable cluster configs reduce errors &standard_

Popular posts from this blog

Exploring the Largest UK Employers: A Power BI Visualization

PySpark Important Last-Minute Notes