Forget theory. Forget abstract examples. This is a hands-on, build-as-you-learn guide to mastering YAML through the lens of Databricks Asset Bundles (DABs). By the end of this post, you'll go from never writing YAML to confidently deploying production-grade data pipelines as code.
Level 0: YAML Basics BEGINNER
The Golden Rules
Name ≠ name# for comments. Comment everything!Basic Data Types
# Strings - quotes optional unless special characters present
name: customer_etl_job
description: "Daily ETL: Extract-Transform-Load"
# Numbers
max_retries: 3
timeout_seconds: 3600
# Booleans
enabled: true
debug_mode: false
# Null values
backup_path: null
# or
backup_path: ~
Lists (Arrays)
# Method 1: Dash notation (most common)
environments:
- dev
- staging
- prod
# Method 2: Inline notation
environments: [dev, staging, prod]
# Nested lists
tags:
- category: finance
priority: high
- category: customer
priority: medium
Dictionaries (Key-Value Pairs)
# Simple dictionary
cluster_config:
spark_version: "13.3.x-scala2.12"
node_type: "Standard_DS3_v2"
num_workers: 2
# Nested dictionaries
job_settings:
schedule:
cron: "0 2 * * *"
timezone: "America/New_York"
notifications:
on_failure:
- admin@company.com
on_success:
- team@company.com
✅ Checkpoint #1: Can you read this?
If you understand that notifications is a dictionary containing on_failure (a list) and that indentation shows schedule belongs to job_settings, you're ready for Level 1!
Level 1: Your First databricks.yml BEGINNER
Understanding the Structure
Every Databricks Asset Bundle starts with a root databricks.yml file. Think of it as the blueprint of your entire workspace.
# Bundle metadata
bundle:
name: customer_analytics
# Target environments
targets:
dev:
# Workspace URL for dev environment
workspace:
host: https://adb-1234567890123456.7.azuredatabricks.net
bundle section defines your project. The targets section defines where it deploys (dev, staging, prod).
Adding Your First Job
bundle:
name: customer_analytics
# Define reusable variables
variables:
warehouse_id:
description: SQL Warehouse ID
default: "abc123def456"
targets:
dev:
workspace:
host: https://adb-1234567890123456.7.azuredatabricks.net
resources:
jobs:
# Job identifier (used internally)
daily_customer_report:
# Display name in Databricks UI
name: "[DEV] Daily Customer Report"
# Single task job
tasks:
- task_key: generate_report
sql_task:
warehouse_id: ${var.warehouse_id}
query:
query: |
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM customers
GROUP BY customer_id
# Schedule (optional)
schedule:
quartz_cron_expression: "0 0 8 * * ?"
timezone_id: "America/New_York"
| symbol after query: means "preserve newlines". Your SQL will be formatted exactly as written. Use > if you want lines folded into one.
Level 2: Variables and Environments INTERMEDIATE
The Power of Variables
Variables make your YAML reusable across environments. Instead of copying configs, you define once and reference everywhere.
bundle:
name: customer_analytics
# Global variables (available to all targets)
variables:
catalog_name:
description: Unity Catalog name
schema_name:
description: Schema for tables
default: bronze
targets:
# Development environment
dev:
# Override variables per environment
variables:
catalog_name: dev_catalog
schema_name: dev_bronze
workspace:
host: https://adb-dev.azuredatabricks.net
# Production environment
prod:
variables:
catalog_name: prod_catalog
schema_name: prod_bronze
workspace:
host: https://adb-prod.azuredatabricks.net
resources:
jobs:
ingest_customers:
name: "[${bundle.target}] Customer Ingestion"
tasks:
- task_key: bronze_load
notebook_task:
notebook_path: ./notebooks/bronze_ingestion.py
base_parameters:
# Reference variables with ${var.variable_name}
catalog: ${var.catalog_name}
schema: ${var.schema_name}
table: customers
new_cluster:
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
${bundle.target}automatically inserts "dev" or "prod" in the job name${var.catalog_name}resolves to "dev_catalog" in dev, "prod_catalog" in prod- Same YAML, different environments—zero code duplication!
Level 3: Multi-Task Jobs & Dependencies INTERMEDIATE
Building a Real ETL Pipeline
Real data pipelines have multiple steps with dependencies. YAML handles this elegantly with the depends_on key.
resources:
jobs:
customer_etl_pipeline:
name: "[${bundle.target}] Customer ETL Pipeline"
# Job-level settings
max_concurrent_runs: 1
timeout_seconds: 7200
tasks:
# Task 1: Extract raw data
- task_key: extract_raw
notebook_task:
notebook_path: ./notebooks/01_extract.py
base_parameters:
source_table: "raw.customers"
target_table: "${var.catalog_name}.bronze.customers_raw"
new_cluster:
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
spark_conf:
"spark.databricks.delta.optimizeWrite.enabled": "true"
# Task 2: Clean and validate (depends on Task 1)
- task_key: clean_data
depends_on:
- task_key: extract_raw
notebook_task:
notebook_path: ./notebooks/02_clean.py
base_parameters:
source_table: "${var.catalog_name}.bronze.customers_raw"
target_table: "${var.catalog_name}.silver.customers_clean"
# Reuse the same cluster from Task 1
existing_cluster_id: "{{tasks.extract_raw.cluster_instance_id}}"
# Task 3: Business logic transformations
- task_key: transform_business
depends_on:
- task_key: clean_data
sql_task:
warehouse_id: ${var.warehouse_id}
query:
query: |
CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_metrics AS
SELECT
customer_id,
customer_name,
COUNT(order_id) as lifetime_orders,
SUM(order_amount) as lifetime_value,
MAX(order_date) as last_order_date,
DATEDIFF(CURRENT_DATE(), MAX(order_date)) as days_since_last_order
FROM ${var.catalog_name}.silver.customers_clean
GROUP BY customer_id, customer_name
# Task 4: Data quality checks (runs in parallel with Task 3)
- task_key: quality_checks
depends_on:
- task_key: clean_data
notebook_task:
notebook_path: ./notebooks/03_quality_checks.py
base_parameters:
table_to_check: "${var.catalog_name}.silver.customers_clean"
new_cluster:
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 1
# Task 5: Final notification (waits for all previous tasks)
- task_key: notify_completion
depends_on:
- task_key: transform_business
- task_key: quality_checks
notebook_task:
notebook_path: ./notebooks/04_notify.py
existing_cluster_id: "{{tasks.quality_checks.cluster_instance_id}}"
# Email notifications
email_notifications:
on_failure:
- data-engineering@company.com
on_success:
- analytics-team@company.com
# Job schedule
schedule:
quartz_cron_expression: "0 0 2 * * ?" # 2 AM daily
timezone_id: "America/New_York"
pause_status: UNPAUSED
🧠Challenge: Understand the Flow
Question: In the pipeline above, which tasks run in parallel?
Answer: Tasks 3 (transform_business) and 4 (quality_checks) run in parallel because they both depend only on Task 2 (clean_data) and don't depend on each other. Task 5 waits for both to complete.
Level 4: YAML Anchors & Reusability ADVANCED
DRY Principle: Don't Repeat Yourself
YAML anchors (&) let you define something once and reference it multiple times with aliases (*). This is crucial for cluster configs that repeat across jobs.
bundle:
name: customer_analytics
# Define reusable cluster configurations
variables:
# Small cluster anchor
small_cluster: &small_cluster
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
spark_conf:
"spark.databricks.delta.optimizeWrite.enabled": "true"
"spark.databricks.delta.autoCompact.enabled": "true"
# Large cluster anchor
large_cluster: &large_cluster
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS4_v2"
num_workers: 8
spark_conf:
"spark.databricks.delta.optimizeWrite.enabled": "true"
"spark.databricks.delta.autoCompact.enabled": "true"
"spark.sql.adaptive.enabled": "true"
targets:
dev:
workspace:
host: https://adb-dev.azuredatabricks.net
resources:
jobs:
# Job 1: Uses small cluster
quick_report:
name: "Quick Daily Report"
tasks:
- task_key: generate
notebook_task:
notebook_path: ./notebooks/quick_report.py
# Reference the anchor with *
new_cluster: *small_cluster
# Job 2: Also uses small cluster
validation_job:
name: "Data Validation"
tasks:
- task_key: validate
notebook_task:
notebook_path: ./notebooks/validate.py
new_cluster: *small_cluster
# Job 3: Uses large cluster for heavy processing
monthly_aggregation:
name: "Monthly Aggregation"
tasks:
- task_key: aggregate
notebook_task:
notebook_path: ./notebooks/monthly_agg.py
new_cluster: *large_cluster
Merging and Overriding with Anchors
You can extend anchors using merge keys (<<:) to inherit and override specific properties.
variables:
# Base cluster configuration
base_cluster: &base_cluster
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
num_workers: 2
spark_conf:
"spark.databricks.delta.optimizeWrite.enabled": "true"
resources:
jobs:
custom_job:
name: "Custom Processing Job"
tasks:
- task_key: process
notebook_task:
notebook_path: ./notebooks/process.py
new_cluster:
# Merge base configuration and override specific keys
<<: *base_cluster
num_workers: 4 # Override: use 4 workers instead of 2
spark_conf:
# Inherit all from base, add new config
"spark.databricks.delta.optimizeWrite.enabled": "true"
"spark.sql.shuffle.partitions": "200"
✅ Checkpoint #4: Anchors Mastery
You now understand that &name creates a reusable anchor, *name references it, and <<: *name lets you inherit and override. This eliminates 80% of repetition in real-world DAB configs!
Level 5: Complete Production Example PRO
Real-World Databricks Asset Bundle
Let's put it all together: a production-ready configuration with multiple environments, shared resources, and best practices.
bundle:
name: customer_data_platform
# Include external YAML files for better organization
include:
- resources/*.yml
# Global variables
variables:
catalog_name:
description: Unity Catalog name
notification_email:
description: Email for job notifications
default: data-engineering@company.com
# Cluster configurations as anchors
standard_cluster: &standard_cluster
spark_version: "13.3.x-scala2.12"
node_type_id: "Standard_DS3_v2"
autoscale:
min_workers: 2
max_workers: 8
spark_conf:
"spark.databricks.delta.optimizeWrite.enabled": "true"
"spark.databricks.delta.autoCompact.enabled": "true"
azure_attributes:
availability: "ON_DEMAND_AZURE"
spot_bid_max_price: -1
# Target environments
targets:
# Development
dev:
mode: development
default: true
variables:
catalog_name: dev_catalog
notification_email: dev-team@company.com
workspace:
host: https://adb-dev-12345.7.azuredatabricks.net
root_path: ~/.bundle/${bundle.name}/${bundle.target}
# Staging
staging:
mode: production
variables:
catalog_name: staging_catalog
workspace:
host: https://adb-staging-12345.7.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
# Production
prod:
mode: production
variables:
catalog_name: prod_catalog
workspace:
host: https://adb-prod-12345.7.azuredatabricks.net
root_path: /Shared/.bundle/${bundle.name}/${bundle.target}
# Production-specific permissions
permissions:
- level: CAN_VIEW
group_name: "data-analysts"
# Resources
resources:
jobs:
# Daily ETL Pipeline
daily_etl:
name: "[${bundle.target}] Daily Customer ETL"
job_clusters:
# Define job-level clusters that tasks can share
- job_cluster_key: etl_cluster
new_cluster:
<<: *standard_cluster
spark_env_vars:
ENVIRONMENT: ${bundle.target}
tasks:
# Bronze layer ingestion
- task_key: bronze_ingestion
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ./pipelines/bronze/ingest_customers.py
base_parameters:
catalog: ${var.catalog_name}
schema: bronze
source_format: parquet
libraries:
- pypi:
package: "pandas==2.0.3"
- pypi:
package: "pyarrow==12.0.1"
# Silver layer transformation
- task_key: silver_transformation
depends_on:
- task_key: bronze_ingestion
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ./pipelines/silver/transform_customers.py
base_parameters:
catalog: ${var.catalog_name}
source_schema: bronze
target_schema: silver
# Gold layer aggregations
- task_key: gold_aggregations
depends_on:
- task_key: silver_transformation
sql_task:
warehouse_id: ${var.warehouse_id}
query:
query: |
-- Customer lifetime value
CREATE OR REPLACE TABLE ${var.catalog_name}.gold.customer_ltv AS
SELECT
c.customer_id,
c.customer_name,
c.customer_segment,
COUNT(DISTINCT o.order_id) as total_orders,
SUM(o.order_amount) as lifetime_value,
AVG(o.order_amount) as avg_order_value,
MIN(o.order_date) as first_order_date,
MAX(o.order_date) as last_order_date,
DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) as days_since_last_order,
CASE
WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 30 THEN 'Active'
WHEN DATEDIFF(CURRENT_DATE(), MAX(o.order_date)) <= 90 THEN 'At Risk'
ELSE 'Churned'
END as customer_status
FROM ${var.catalog_name}.silver.customers c
LEFT JOIN ${var.catalog_name}.silver.orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.customer_segment
# Data quality validation
- task_key: quality_validation
depends_on:
- task_key: gold_aggregations
job_cluster_key: etl_cluster
notebook_task:
notebook_path: ./pipelines/validation/quality_checks.py
base_parameters:
catalog: ${var.catalog_name}
tables_to_check: "bronze.customers,silver.customers,gold.customer_ltv"
# Scheduling
schedule:
quartz_cron_expression: "0 0 3 * * ?" # 3 AM daily
timezone_id: "America/New_York"
pause_status: UNPAUSED
# Notifications
email_notifications:
on_start:
- ${var.notification_email}
on_success:
- ${var.notification_email}
on_failure:
- ${var.notification_email}
- oncall@company.com
no_alert_for_skipped_runs: true
# Retry policy
max_retries: 2
retry_on_timeout: true
# Timeout (2 hours)
timeout_seconds: 7200
# Tags for cost tracking
tags:
Environment: ${bundle.target}
Team: DataEngineering
CostCenter: Analytics
Project: CustomerDataPlatform
# Weekly aggregation job
weekly_aggregation:
name: "[${bundle.target}] Weekly Aggregations"
tasks:
- task_key: weekly_agg
notebook_task:
notebook_path: ./pipelines/gold/weekly_aggregations.py
base_parameters:
catalog: ${var.catalog_name}
new_cluster:
<<: *standard_cluster
num_workers: 4 # Override for larger job
schedule:
quartz_cron_expression: "0 0 4 ? * SUN" # Sunday 4 AM
timezone_id: "America/New_York"
email_notifications:
on_failure:
- ${var.notification_email}
# SQL Warehouses
sql_warehouses:
analytics_warehouse:
name: "[${bundle.target}] Analytics Warehouse"
cluster_size: "Medium"
min_num_clusters: 1
max_num_clusters: 3
auto_stop_mins: 15
enable_serverless_compute: true
tags:
custom_tags:
- key: Environment
value: ${bundle.target}
- key: Team
value: Analytics
Best Practices Summary
| Practice | Why It Matters | Example |
|---|---|---|
| Use Variables | Environment-specific configs without duplication | ${var.catalog_name} |
| Use Anchors | Reusable cluster configs reduce errors | &standard_ |
Comments
Post a Comment