MODULE 27 Databricks
0 / 14 topics
Module 27 ยท Overview

Databricks โ€” The Lakehouse Platform

Databricks is the unified analytics platform built on top of Apache Spark. It combines data engineering, data science, machine learning, and SQL analytics โ€” all in one place. This module covers every component you need to use Databricks professionally.

๐Ÿ’ก The Big Picture Analogy
Think of Databricks like a fully equipped kitchen inside a restaurant. Apache Spark is the oven โ€” powerful but raw. Databricks gives you the kitchen, the prep counters, the team coordination tools (workflows), the pantry (Unity Catalog), and the dishwasher (automated cleanup). You can cook much faster and better than if you built everything yourself.
27.1
๐Ÿ—‚๏ธ Workspace
Workspace structure, repos, Git integration
27.2
โšก Clusters
Interactive, job clusters, autoscaling, policies
27.3
๐Ÿš€ Runtime
Runtime versions, Photon engine, optimizations
27.4
๐Ÿ““ Notebooks
Development, widgets, parameters, scheduling
27.5
๐Ÿ”ง Workflows
Jobs, multi-task jobs, dependencies, scheduling
27.6
๐Ÿ›ก๏ธ Unity Catalog
Catalogs, schemas, tables, security, lineage
27.7
๐Ÿ”„ Delta Live Tables
Pipelines, expectations, CDC, quality
27.8
๐Ÿ“ฅ Auto Loader
File discovery, incremental ingestion, schema evolution
27.9
๐Ÿงช MLflow
Experiment tracking, model registry, serving
27.10
๐Ÿงฎ Databricks SQL
Warehouses, dashboards, query optimization
27.11
๐Ÿ”Œ Connect
Local development, remote execution from IDE
27.12
๐Ÿ“ฆ Asset Bundles
CI/CD, deployment automation as code
27.13
โ˜๏ธ Serverless
Serverless jobs, serverless SQL warehouses
27.14
๐ŸŒŠ Lakeflow
Declarative pipelines, workflow management
๐Ÿ“Œ Why Databricks Matters
Databricks is the #1 platform used by enterprise data engineers today. It handles petabyte-scale data processing, supports all cloud providers (AWS, Azure, GCP), and provides a managed Spark experience โ€” meaning no cluster management headaches. Learning Databricks = learning the industry standard.
27.1

Databricks Workspace

The workspace is your home inside Databricks โ€” a collaborative environment where you organize notebooks, jobs, clusters, and data assets. Think of it as your project folder on steroids, shared across your entire team.

๐Ÿ—‚๏ธ
Workspace Structure Foundation โ–ผ
What is the Workspace?

The Databricks workspace is a web-based interface and a hierarchical file system where you store notebooks, libraries, and experiment results. Every user gets a home folder (/Users/your.email@company.com) and teams share a /Shared folder.

๐Ÿ’ก Analogy
The workspace is like Google Drive for your data engineering code. Your personal folder is your "My Drive," the Shared folder is a team drive, and Repos is like syncing with GitHub.
Databricks Workspace โ”œโ”€โ”€ /Users โ”‚ โ”œโ”€โ”€ /alice@company.com โ† Alice's personal notebooks โ”‚ โ””โ”€โ”€ /bob@company.com โ† Bob's personal notebooks โ”œโ”€โ”€ /Shared โ† Team-shared notebooks & folders โ”‚ โ”œโ”€โ”€ /ETL_Pipelines โ”‚ โ””โ”€โ”€ /ML_Models โ””โ”€โ”€ /Repos โ† Git-synced code repositories โ”œโ”€โ”€ /alice@company.com/my-repo โ””โ”€โ”€ /data-team/shared-repo
๐Ÿ“Œ Key Point
In production teams, code lives in Repos (Git-synced), not in personal folders. Personal folders are for experimentation only.
Repositories (Repos)

Databricks Repos connects your workspace directly to a Git provider (GitHub, GitLab, Bitbucket, Azure DevOps). This means you can clone, branch, commit, push, and pull โ€” all from inside the Databricks UI, without touching the command line.

๐Ÿ“‹ Example Workflow
1. Clone your GitHub repo into /Repos/alice@company.com/etl-project
2. Create a branch: feature/add-new-pipeline
3. Edit notebook in Databricks UI
4. Commit + push from Databricks โ†’ GitHub PR raised
5. After merge, pull latest into main branch in workspace
BASH โ€” Databricks CLI to clone repo
# Install Databricks CLI
pip install databricks-cli

# Configure with your workspace
databricks configure --token
# Enter: Host = https://<your-workspace>.azuredatabricks.net
# Enter: Token = your_personal_access_token

# List repos
databricks repos list

# Create a repo (clone from GitHub)
databricks repos create \
  --url https://github.com/myorg/etl-project \
  --provider gitHub \
  --path /Repos/alice@company.com/etl-project
Git Integration

Databricks integrates with Git at two levels: Repos (folder-level, full repo sync) and Notebook Git versioning (single notebook linked to a Git file). For production, always use Repos.

FeatureReposNotebook Git Versioning
ScopeEntire repositorySingle notebook
BranchingFull branch supportLimited
CI/CD IntegrationYes โ€” via DABsManual only
Use CaseProduction codeQuick experiments
PYTHON โ€” Working with Repos via Databricks REST API
import requests

WORKSPACE = "https://<your-workspace>.azuredatabricks.net"
TOKEN = "your_pat_token"

headers = {"Authorization": f"Bearer {TOKEN}"}

# List all repos
resp = requests.get(
    f"{WORKSPACE}/api/2.0/repos",
    headers=headers
)
print(resp.json())

# Pull latest changes (update repo to latest branch HEAD)
repo_id = "12345"
resp = requests.patch(
    f"{WORKSPACE}/api/2.0/repos/{repo_id}",
    headers=headers,
    json={"branch": "main"}
)
print("Repo updated:", resp.json())
27.2

Clusters

A cluster is a set of computers (nodes) that run your Spark code. In Databricks, you never manage the infrastructure manually โ€” you simply configure your cluster and Databricks provisions it on the cloud automatically.

โšก
Interactive Clusters Core โ–ผ
What is an Interactive Cluster?

Interactive clusters (also called All-Purpose clusters) are long-running clusters you attach notebooks to for interactive development. They stay alive until you manually terminate them or auto-termination kicks in.

๐Ÿ’ก Analogy
An interactive cluster is like keeping your computer turned on at your desk all day. Fast to use when you need it โ€” but it costs money the whole time it's on.
๐Ÿ“‹ When to Use
โœ… Development and exploration
โœ… Running ad-hoc queries
โœ… Collaborative notebook work
โŒ Not for production scheduled jobs (too expensive)
JSON โ€” Cluster Config (Databricks REST API)
{
  "cluster_name": "dev-interactive",
  "spark_version": "14.3.x-scala2.12",
  "node_type_id": "Standard_DS3_v2",
  "num_workers": 2,
  "autotermination_minutes": 30,
  "spark_conf": {
    "spark.sql.shuffle.partitions": "200",
    "spark.databricks.io.cache.enabled": "true"
  }
}
โš ๏ธ Cost Warning
Always set autotermination_minutes! Forgetting this is one of the most common causes of unexpected cloud bills. Set it to 30โ€“60 minutes for development clusters.
โš™๏ธ
Job Clusters Production โ–ผ
What is a Job Cluster?

Job clusters are ephemeral โ€” they spin up when a job starts, run the job, and immediately terminate when done. They're the right choice for production pipelines because you only pay for the actual compute time used.

๐Ÿ’ก Analogy
A job cluster is like an Uber. It arrives when you need it, takes you where you're going, and disappears. You're only charged for the ride โ€” not for waiting around.
AspectInteractive ClusterJob Cluster
LifetimeLong-runningEphemeral (per job)
CostHigher (idle time)Lower (pay per run)
Use CaseDev/explorationProduction jobs
Startup timeZero (already on)5โ€“10 min cold start
Multi-userYesSingle job only
JSON โ€” Job Cluster Config inside a Workflow
{
  "job_cluster_key": "etl-job-cluster",
  "new_cluster": {
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS4_v2",
    "num_workers": 4,
    "spark_conf": {
      "spark.sql.shuffle.partitions": "400"
    },
    "init_scripts": [{
      "dbfs": {"destination": "dbfs:/init_scripts/install_libs.sh"}
    }]
  }
}
๐Ÿ“
Cluster Policies Governance โ–ผ
What are Cluster Policies?

Cluster policies are administrator-defined templates with guardrails. They restrict what cluster configurations users can choose โ€” enforcing cost control, security standards, and best practices automatically.

๐Ÿ“‹ Real Example
Without policies: A developer creates a 50-node cluster and forgets to set auto-termination โ†’ $5,000 wasted over a weekend.

With policies: Policy enforces max 8 workers, auto-terminate after 60 minutes. Developer can't create anything outside those bounds.
JSON โ€” Cluster Policy Definition
{
  "name": "Dev Policy โ€” Cost Controlled",
  "definition": {
    // Force auto-termination to 60 minutes max
    "autotermination_minutes": {
      "type": "range",
      "maxValue": 60,
      "defaultValue": 30
    },
    // Restrict max workers to 8
    "num_workers": {
      "type": "range",
      "maxValue": 8,
      "defaultValue": 2
    },
    // Force a specific runtime version
    "spark_version": {
      "type": "allowlist",
      "values": ["14.3.x-scala2.12", "13.3.x-scala2.12"]
    }
  }
}
๐Ÿ“ˆ
Autoscaling Performance โ–ผ
How Autoscaling Works

Databricks autoscaling automatically adds or removes worker nodes based on workload. You set a minimum and maximum number of workers, and Databricks scales within that range as needed.

๐Ÿ’ก Analogy
Autoscaling is like a restaurant hiring extra staff on Friday night and sending them home on Tuesday afternoon. The kitchen scales to demand automatically โ€” you don't call each cook individually.
JSON โ€” Autoscaling Cluster Config
{
  "cluster_name": "autoscaling-cluster",
  "spark_version": "14.3.x-scala2.12",
  "node_type_id": "Standard_DS4_v2",
  // Replace num_workers with autoscale block
  "autoscale": {
    "min_workers": 2,
    "max_workers": 10
  },
  "autotermination_minutes": 30
}
๐Ÿ“Œ Enhanced Autoscaling
Databricks has Enhanced Autoscaling (available on Premium tier) which is smarter โ€” it scales based on the actual Spark task queue, not just CPU. It's 2โ€“3x faster at scaling decisions than standard autoscaling.
โ„น๏ธ Autoscaling Limitation
Autoscaling doesn't work well for Structured Streaming jobs โ€” it can cause rebalancing that interrupts micro-batches. For streaming, use a fixed number of workers.
27.3

Databricks Runtime

The Databricks Runtime (DBR) is the software environment that runs on each cluster. It bundles Apache Spark with optimized libraries, security patches, and Databricks-specific enhancements โ€” all pre-configured and tested together.

๐Ÿš€
Runtime Versions Core โ–ผ
Understanding DBR Versions

Databricks releases runtime versions regularly. The version string format is: DBR_VERSION.x-scala2.12. Long-term support (LTS) versions are maintained for 2 years โ€” always prefer LTS in production.

VERSION FORMAT EXPLAINED
# Format: {DBR_version}.x-scala{scala_version}
"14.3.x-scala2.12"
#  โ”‚    โ”‚  โ””โ”€ Scala 2.12 (most common, Scala 2.13 also available)
#  โ”‚    โ””โ”€โ”€โ”€โ”€ .x = latest patch within DBR 14.3
#  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DBR major.minor version

# ML Runtime (includes ML libraries like TensorFlow, PyTorch)
"14.3.x-cpu-ml-scala2.12"

# GPU Runtime (for deep learning)
"14.3.x-gpu-ml-scala2.12"

# Long Term Support marker
# LTS versions: 12.2 LTS, 13.3 LTS, 14.3 LTS (check Databricks docs for current)
Runtime TypeUse CaseIncluded
StandardGeneral data engineeringSpark, Delta Lake, common Python libs
MLMachine learningStandard + MLflow, scikit-learn, TensorFlow, PyTorch
GPU MLDeep learningML + CUDA, GPU-optimized libraries
PhotonSQL / analytical workloadsStandard + Photon vectorized engine
โšก
Photon Engine Performance โ–ผ
What is Photon?

Photon is Databricks' vectorized query engine written in C++ (not JVM). It replaces Spark's Java-based execution engine for SQL and DataFrame operations, delivering 2โ€“8x faster performance on analytical queries โ€” especially for aggregations, joins, and sorts on large datasets.

๐Ÿ’ก Analogy
Spark's regular engine is like a worker who processes one item at a time. Photon is like a forklift โ€” it processes a whole pallet at once (vectorized = batch of rows at once in CPU registers). Same task, massively faster throughput.
๐Ÿ”ข
Vectorized Execution
Processes thousands of rows per CPU instruction using SIMD (Single Instruction Multiple Data)
๐Ÿ’พ
Native C++ Code
No JVM overhead โ€” eliminates GC pauses that slow down Java-based Spark execution
๐Ÿ”„
Transparent
No code changes needed โ€” Photon automatically accelerates supported operations
๐Ÿ“Š
Best For
GROUP BY, JOIN, ORDER BY, aggregations, Delta reads โ€” SQL-heavy workloads
PYTHON โ€” Check if Photon is enabled
# Photon is enabled at cluster level in the UI
# Check via Spark conf
is_photon = spark.conf.get("spark.databricks.photon.enabled", "false")
print(f"Photon enabled: {is_photon}")

# Run a query โ€” Photon kicks in automatically for SQL
result = spark.sql("""
  SELECT 
    customer_id,
    SUM(amount) as total_amount,
    COUNT(*) as order_count
  FROM orders
  GROUP BY customer_id
  ORDER BY total_amount DESC
  LIMIT 100
""")

# Check explain plan to see "Photon" in the physical plan
result.explain("formatted")
# Look for "PhotonGroupingAgg", "PhotonSort", "PhotonShuffleExchange"
๐Ÿ”ง
Runtime Optimization Features Advanced โ–ผ
Disk Cache (Delta Cache)

Databricks Runtime includes a disk cache (not to be confused with Spark's memory cache). It caches remote data (from S3/ADLS) locally on NVMe SSDs on each worker node. This means repeated reads of the same data files are served from fast local disk instead of slow network storage.

PYTHON โ€” Enable and manage Delta Cache
# Enable disk cache (set in cluster config or spark conf)
spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxMetaDataCache", "1g")

# Cache specific tables for fast repeated access
spark.sql("CACHE SELECT * FROM sales.orders")

# View cache status
spark.sql("SHOW CACHE STATUS").show()

# Clear cache if needed
spark.sql("UNCACHE TABLE sales.orders")
๐Ÿ“Œ Delta Cache vs Spark Cache
Delta Cache: Caches raw Parquet files on local disk โ†’ automatic, transparent, persists between queries.
Spark Cache: Caches DataFrame in memory/disk as Spark's internal format โ†’ manual, cleared when cluster restarts.
27.4

Notebooks

Databricks notebooks are interactive, web-based documents that combine live code, visualizations, and markdown text. They support Python, SQL, Scala, and R โ€” and you can switch between languages within a single notebook using magic commands.

๐Ÿ““
Notebook Development & Magic Commands Core โ–ผ
Magic Commands

Magic commands let you switch the language of a cell, run shell commands, or import external files. They start with % and must be the first line in a cell.

DATABRICKS NOTEBOOK โ€” Magic Commands
# %python โ€” run Python (default)
%python
df = spark.read.csv("/data/sales.csv", header=True)
df.show(5)

---
# %sql โ€” run Spark SQL
%sql
SELECT customer_id, SUM(amount) FROM sales GROUP BY customer_id

---
# %scala โ€” run Scala
%scala
val df = spark.read.csv("/data/sales.csv")

---
# %sh โ€” run shell commands on driver node
%sh
ls /dbfs/data/
pip show pandas

---
# %fs โ€” Databricks File System (DBFS) commands
%fs
ls /data/
head /data/sample.csv

---
# %md โ€” Markdown for documentation
%md
## Pipeline Step 1: Load Data
This cell loads raw CSV data from DBFS into a Spark DataFrame.

---
# %run โ€” run another notebook (import its definitions)
%run ./utils/common_functions  # relative path
%run /Repos/myrepo/utils/helpers  # absolute path
dbutils โ€” Databricks Utilities

dbutils is a special object available in all Databricks notebooks. It provides utilities for file system operations, secret management, notebook chaining, and widgets โ€” things you can't do with standard Python or Spark alone.

PYTHON โ€” dbutils key utilities
# โ”€โ”€โ”€ File System (dbutils.fs) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# List files
files = dbutils.fs.ls("dbfs:/data/")
for f in files:
    print(f.name, f.size)

# Copy files
dbutils.fs.cp("dbfs:/raw/file.csv", "dbfs:/processed/file.csv")

# Move files
dbutils.fs.mv("dbfs:/staging/file.csv", "dbfs:/archive/file.csv")

# Delete files
dbutils.fs.rm("dbfs:/temp/file.csv", recurse=True)

# โ”€โ”€โ”€ Secrets (dbutils.secrets) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Read a secret from a Databricks secret scope
db_password = dbutils.secrets.get(scope="production", key="db_password")
api_key = dbutils.secrets.get(scope="production", key="api_key")

# Secrets are redacted in notebook output โ€” [REDACTED]
print(db_password)  # prints: [REDACTED]

# โ”€โ”€โ”€ Notebook utilities (dbutils.notebook) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Run another notebook and get its return value
result = dbutils.notebook.run(
    "/Repos/myrepo/etl/process_orders",
    timeout_seconds=3600,
    arguments={"date": "2024-01-15", "env": "prod"}
)
print(f"Child notebook returned: {result}")

# Exit the current notebook with a return value
dbutils.notebook.exit("SUCCESS: 1000 rows processed")

# โ”€โ”€โ”€ Widgets (dbutils.widgets) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Create input widgets (covered in next subtopic)
dbutils.widgets.text("date", "2024-01-01", "Processing Date")
date_param = dbutils.widgets.get("date")
๐ŸŽ›๏ธ
Widgets & Parameters Interactive โ–ผ
What are Widgets?

Widgets are input controls (text boxes, dropdowns, sliders) that appear at the top of a Databricks notebook. They let you pass parameters to a notebook โ€” either manually via the UI, or programmatically when calling the notebook from a workflow or API.

PYTHON โ€” All Widget Types
# 1. TEXT widget โ€” free-form text input
dbutils.widgets.text(
    name="processing_date",
    defaultValue="2024-01-01",
    label="Processing Date (YYYY-MM-DD)"
)

# 2. DROPDOWN widget โ€” select from a list
dbutils.widgets.dropdown(
    name="environment",
    defaultValue="dev",
    choices=["dev", "staging", "prod"],
    label="Environment"
)

# 3. COMBOBOX โ€” dropdown + allow custom text
dbutils.widgets.combobox(
    name="table_name",
    defaultValue="orders",
    choices=["orders", "customers", "products"],
    label="Table Name"
)

# 4. MULTISELECT โ€” choose multiple options
dbutils.widgets.multiselect(
    name="regions",
    defaultValue="US",
    choices=["US", "EU", "APAC", "LATAM"],
    label="Regions"
)

# โ”€โ”€โ”€ Read widget values โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
date = dbutils.widgets.get("processing_date")
env = dbutils.widgets.get("environment")
table = dbutils.widgets.get("table_name")
regions_str = dbutils.widgets.get("regions")
regions = regions_str.split(",")

print(f"Processing {table} for {env} on {date} for regions: {regions}")

# Use in SQL directly with $ syntax
spark.sql(f"""
  SELECT * FROM {env}.{table}
  WHERE process_date = '{date}'
""")

# Clean up widgets
dbutils.widgets.removeAll()  # remove all widgets
dbutils.widgets.remove("processing_date")  # remove one
Scheduling Notebooks

You can schedule a notebook to run on a cron-like schedule directly from the Databricks UI. However, for production use, always use Workflows (section 27.5) for better monitoring, dependency management, and retry logic.

๐Ÿ“‹ Notebook Scheduling vs Workflows
Notebook scheduling: Quick, simple, single notebook only
Workflows: Multi-task, dependencies, retry policies, alerts โ€” always use in production
27.5

Workflows

Databricks Workflows is the native orchestration engine. You define jobs with multiple tasks (notebooks, Python scripts, SQL queries, Delta Live Table pipelines), set dependencies between them, and schedule the whole workflow to run automatically.

โš™๏ธ
Jobs & Multi-Task Jobs Core โ–ผ
Anatomy of a Databricks Job

A Job is a container for one or more Tasks. Each task runs on a cluster (shared or its own job cluster) and can be a notebook, Python script, SQL query, JAR, or Delta Live Tables pipeline.

JOB: "Daily Sales ETL" โ”œโ”€โ”€ Task 1: "extract_raw_data" [Notebook] depends_on: [] โ”œโ”€โ”€ Task 2: "validate_schema" [Python] depends_on: [Task 1] โ”œโ”€โ”€ Task 3: "transform_to_silver" [Notebook] depends_on: [Task 2] โ”‚ โ”œโ”€โ”€ Task 4a: "load_us_region" [Notebook] depends_on: [Task 3] โ”‚ โ””โ”€โ”€ Task 4b: "load_eu_region" [Notebook] depends_on: [Task 3] โ””โ”€โ”€ Task 5: "send_completion_alert" [Python] depends_on: [Task 4a, Task 4b] Tasks 4a and 4b run IN PARALLEL because they both depend on Task 3. Task 5 waits for BOTH 4a and 4b to complete.
JSON โ€” Multi-Task Job Definition (Databricks API)
{
  "name": "Daily Sales ETL",
  "job_clusters": [{
    "job_cluster_key": "etl-cluster",
    "new_cluster": {
      "spark_version": "14.3.x-scala2.12",
      "node_type_id": "Standard_DS4_v2",
      "num_workers": 4
    }
  }],
  "tasks": [
    {
      "task_key": "extract_raw_data",
      "notebook_task": {
        "notebook_path": "/Repos/myrepo/etl/01_extract",
        "base_parameters": {"date": "{{job.start_time.iso_date}}"}
      },
      "job_cluster_key": "etl-cluster"
    },
    {
      "task_key": "transform_to_silver",
      "depends_on": [{"task_key": "extract_raw_data"}],
      "notebook_task": {
        "notebook_path": "/Repos/myrepo/etl/02_transform"
      },
      "job_cluster_key": "etl-cluster"
    },
    {
      "task_key": "send_alert",
      "depends_on": [{"task_key": "transform_to_silver"}],
      "python_wheel_task": {
        "package_name": "my_etl_package",
        "entry_point": "send_completion_alert"
      },
      "job_cluster_key": "etl-cluster"
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",  // Daily at 6 AM UTC
    "timezone_id": "UTC"
  },
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_success": ["data-team@company.com"]
  },
  "max_concurrent_runs": 1
}
Task Types

Each task in a Databricks Workflow can be one of several types, each suited for different workloads.

Task TypeUse CaseExample
NotebookInteractive notebooks in workspace/reposData exploration, EDA, quick transforms
Python ScriptStandalone .py files from ReposProduction ETL scripts
Python WheelPackaged Python librariesReusable framework code
SQLSQL queries on a SQL WarehouseTransformations in SQL
JARCompiled Scala/Java Spark jobsHigh-performance Spark jobs
DLT PipelineDelta Live Tables pipelinesStreaming + batch declarative ETL
dbtdbt models natively in WorkflowsSQL-based transformation layers
Dependencies & Scheduling

Tasks declare dependencies via depends_on. Databricks automatically runs tasks in the correct order and parallelizes tasks that don't depend on each other.

PYTHON โ€” Trigger a Workflow via Databricks SDK
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.jobs import RunNowRequest

# Initialize Databricks SDK client
w = WorkspaceClient(
    host="https://<your-workspace>.azuredatabricks.net",
    token="your_pat_token"
)

# Trigger an existing job with parameters
run = w.jobs.run_now(
    job_id=123456,
    notebook_params={
        "date": "2024-01-15",
        "env": "prod"
    }
)
print(f"Job run started: run_id={run.run_id}")

# Wait for completion
result = w.jobs.wait_get_run_job_terminated_or_skipped(run_id=run.run_id)
print(f"Job finished with state: {result.state.result_state}")
27.6

Unity Catalog

Unity Catalog is Databricks' unified governance solution. It provides a single place to manage all your data assets (tables, views, volumes, functions), control who can access what, and track data lineage โ€” across all your workspaces and cloud accounts.

๐Ÿ—„๏ธ
Three-Level Namespace: Catalogs, Schemas, Tables Foundation โ–ผ
The Three-Level Hierarchy

Unity Catalog organizes data in a 3-level hierarchy: Catalog โ†’ Schema (Database) โ†’ Table/View. You always reference objects with the full 3-part name: catalog.schema.table.

UNITY CATALOG NAMESPACE โ””โ”€โ”€ Catalog: "production" โ† top level, often = environment or domain โ”œโ”€โ”€ Schema: "sales" โ† like a database โ”‚ โ”œโ”€โ”€ Table: "orders" โ†’ production.sales.orders โ”‚ โ”œโ”€โ”€ Table: "customers" โ†’ production.sales.customers โ”‚ โ””โ”€โ”€ View: "active_customers" โ†’ production.sales.active_customers โ”œโ”€โ”€ Schema: "finance" โ”‚ โ”œโ”€โ”€ Table: "invoices" โ†’ production.finance.invoices โ”‚ โ””โ”€โ”€ Table: "payments" โ†’ production.finance.payments โ””โ”€โ”€ Schema: "_audit" โ””โ”€โ”€ Table: "access_logs" โ†’ production._audit.access_logs Catalog: "dev" โ””โ”€โ”€ Schema: "sales" โ””โ”€โ”€ Table: "orders" โ†’ dev.sales.orders (dev copy)
PYTHON / SQL โ€” Working with Unity Catalog
# โ”€โ”€โ”€ Create catalog, schema, table โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
spark.sql("CREATE CATALOG IF NOT EXISTS production")
spark.sql("CREATE SCHEMA IF NOT EXISTS production.sales")

spark.sql("""
  CREATE TABLE IF NOT EXISTS production.sales.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    order_date   DATE,
    amount       DECIMAL(10,2),
    status       STRING
  )
  USING DELTA
  COMMENT 'Fact table for all customer orders'
""")

# โ”€โ”€โ”€ Set default catalog/schema for the session โ”€โ”€โ”€โ”€โ”€โ”€
spark.sql("USE CATALOG production")
spark.sql("USE SCHEMA sales")

# Now you can use 1-part names (within sales schema)
df = spark.table("orders")         # = production.sales.orders
df2 = spark.table("customers")

# Always use 3-part names in production for clarity
df = spark.table("production.sales.orders")

# โ”€โ”€โ”€ Read/write with full 3-part names โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df.write.mode("append").saveAsTable("production.sales.orders")

# โ”€โ”€โ”€ List catalogs/schemas/tables โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
spark.sql("SHOW CATALOGS").show()
spark.sql("SHOW SCHEMAS IN production").show()
spark.sql("SHOW TABLES IN production.sales").show()
Volumes

Volumes are Unity Catalog objects that represent a directory of files on cloud storage (S3, ADLS, GCS). They let you apply Unity Catalog governance (access control, lineage) to files โ€” not just tables.

SQL โ€” Creating and Using Volumes
-- Create a managed volume (Databricks manages the storage location)
CREATE VOLUME production.sales.raw_files;

-- Create an external volume (you specify the storage path)
CREATE EXTERNAL VOLUME production.sales.landing_zone
  LOCATION 's3://my-bucket/landing/'
  COMMENT 'Landing zone for incoming CSV files';

-- Access volume files via /Volumes path
-- Path format: /Volumes/catalog/schema/volume_name/...
PYTHON โ€” Reading from a Volume
# Read CSV from a Unity Catalog Volume
df = spark.read.csv(
    "/Volumes/production/sales/landing_zone/2024-01-15/orders.csv",
    header=True,
    inferSchema=True
)

# Write processed data back to a volume
df.write.parquet(
    "/Volumes/production/sales/processed_files/2024-01-15/"
)

# List files in a volume using dbutils
files = dbutils.fs.ls("/Volumes/production/sales/landing_zone/")
for f in files:
    print(f.name)
๐Ÿ›ก๏ธ
Permissions, Row & Column Security Security โ–ผ
Granting Permissions

Unity Catalog uses a standard SQL GRANT/REVOKE model. You grant privileges on catalog objects to users or groups. Permissions cascade down: granting USE CATALOG lets you see schemas; granting SELECT lets you read tables.

SQL โ€” Permission Management
-- Grant access to a catalog
GRANT USE CATALOG ON CATALOG production TO `data-analysts`;

-- Grant access to a schema
GRANT USE SCHEMA ON SCHEMA production.sales TO `data-analysts`;

-- Grant read access to a specific table
GRANT SELECT ON TABLE production.sales.orders TO `data-analysts`;

-- Grant write access to data engineers
GRANT SELECT, MODIFY ON TABLE production.sales.orders TO `data-engineers`;

-- Grant all privileges to table owner
GRANT ALL PRIVILEGES ON TABLE production.sales.orders TO `etl-service-account`;

-- Revoke access
REVOKE SELECT ON TABLE production.sales.customers FROM `data-analysts`;

-- View grants on a table
SHOW GRANTS ON TABLE production.sales.orders;
Row-Level Security

Row-Level Security (RLS) in Unity Catalog uses Row Filters โ€” functions that automatically filter rows based on the current user's identity or group membership. Users never see rows they're not authorized to see.

SQL โ€” Row Filter (Row-Level Security)
-- 1. Create a function that returns TRUE for rows the user can see
CREATE FUNCTION production.sales.region_filter(region_col STRING)
RETURN
  -- Admins see everything
  IS_ACCOUNT_GROUP_MEMBER('admin')
  OR
  -- Users in 'us-team' group see US rows only
  (IS_ACCOUNT_GROUP_MEMBER('us-team') AND region_col = 'US')
  OR
  -- Users in 'eu-team' group see EU rows only
  (IS_ACCOUNT_GROUP_MEMBER('eu-team') AND region_col = 'EU');

-- 2. Apply the row filter to the table
ALTER TABLE production.sales.orders
  SET ROW FILTER production.sales.region_filter(region);

-- Now when alice (in us-team) runs:
SELECT * FROM production.sales.orders;
-- She automatically sees ONLY rows where region = 'US'
Column-Level Security (Column Masks)

Column Masks hide or transform column values based on who is querying. A non-authorized user sees NULL or masked data in a sensitive column like SSN or credit card number.

SQL โ€” Column Mask (Column-Level Security)
-- 1. Create a masking function for SSN
CREATE FUNCTION production.sales.mask_ssn(ssn STRING)
RETURN
  CASE
    WHEN IS_ACCOUNT_GROUP_MEMBER('pii-access') THEN ssn   -- see real value
    ELSE CONCAT('XXX-XX-', RIGHT(ssn, 4))              -- masked
  END;

-- 2. Apply the mask to the SSN column
ALTER TABLE production.customers
  ALTER COLUMN ssn
  SET MASK production.sales.mask_ssn;

-- User WITHOUT pii-access sees: XXX-XX-1234
-- User WITH pii-access sees:    123-45-1234
27.7

Delta Live Tables (DLT)

Delta Live Tables is a declarative framework for building reliable ETL pipelines. Instead of writing imperative code (step by step), you declare what your tables should look like and define quality expectations โ€” Databricks figures out how to run it, handle failures, and maintain data quality automatically.

๐Ÿ”„
DLT Pipeline Creation & Core Concepts Core โ–ผ
Declarative vs Imperative Pipelines

In traditional Spark ETL, you write imperative code: read data, transform it, write it. In DLT, you write declarative definitions of what your tables should contain. DLT handles orchestration, retries, checkpointing, and lineage automatically.

๐Ÿ’ก Analogy
Traditional ETL: You are a chef who follows a recipe step by step โ€” "first boil water, then add pasta, then drain, then..."

DLT: You are a restaurant manager who says "I need pasta ready at 7pm" โ€” someone else figures out the steps, handles the equipment failures, and retries if the oven breaks.
PYTHON โ€” DLT Pipeline Definition
import dlt
from pyspark.sql import functions as F

# โ”€โ”€โ”€ BRONZE LAYER: Raw ingestion โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@dlt.table(
    name="bronze_orders",
    comment="Raw orders loaded from CSV landing zone",
    table_properties={"quality": "bronze"}
)
def bronze_orders():
    return (
        spark.read.format("cloudFiles")  # Auto Loader
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load("/Volumes/production/sales/landing_zone/")
    )

# โ”€โ”€โ”€ SILVER LAYER: Cleaned & validated โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@dlt.table(
    name="silver_orders",
    comment="Cleaned and validated orders"
)
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect("valid_status", "status IN ('pending','completed','cancelled')")
def silver_orders():
    return (
        dlt.read("bronze_orders")
        .withColumn("amount", F.col("amount").cast("decimal(10,2)"))
        .withColumn("order_date", F.to_date("order_date"))
        .withColumn("_loaded_at", F.current_timestamp())
        .dropDuplicates(["order_id"])
    )

# โ”€โ”€โ”€ GOLD LAYER: Aggregated business metrics โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
@dlt.table(
    name="gold_daily_sales",
    comment="Daily sales aggregations for reporting"
)
def gold_daily_sales():
    return (
        dlt.read("silver_orders")
        .filter(F.col("status") == "completed")
        .groupBy("order_date")
        .agg(
            F.sum("amount").alias("total_revenue"),
            F.count("order_id").alias("order_count"),
            F.avg("amount").alias("avg_order_value")
        )
    )
Expectations โ€” Data Quality Enforcement

DLT Expectations are data quality rules. You define them as SQL expressions, and DLT enforces them on every record. Three enforcement modes give you flexibility in how to handle bad data.

DecoratorAction on ViolationUse When
@dlt.expectRecord violation as metric, keep rowMonitor quality without losing data
@dlt.expect_or_dropDrop violating rows (quarantine)Bad rows must not reach Silver/Gold
@dlt.expect_or_failFail the entire pipelineCritical constraint โ€” halt on any violation
PYTHON โ€” All Expectation Types
import dlt

@dlt.table()
# WARN: track violations, keep all rows
@dlt.expect("valid_phone", "phone_number RLIKE '^[0-9]{10}$'")
# DROP: drop bad rows silently
@dlt.expect_or_drop("not_null_email", "email IS NOT NULL")
# FAIL: stop the pipeline on any violation
@dlt.expect_or_fail("primary_key_unique", "COUNT(DISTINCT customer_id) = COUNT(*)")
# Multiple expectations on same decorator
@dlt.expect_all({
    "age_valid": "age BETWEEN 0 AND 120",
    "city_not_null": "city IS NOT NULL",
    "country_code_valid": "LENGTH(country_code) = 2"
})
def silver_customers():
    return dlt.read("bronze_customers")
CDC Pipelines with DLT (APPLY CHANGES INTO)

DLT has a built-in CDC (Change Data Capture) API: APPLY CHANGES INTO. It automatically handles upserts and deletes from a CDC source (like Kafka/Debezium events) into a target Delta table โ€” no manual MERGE required.

PYTHON โ€” DLT CDC with APPLY CHANGES INTO
import dlt

# 1. Define source โ€” raw CDC events from Kafka
@dlt.table(name="cdc_events_raw")
def cdc_events_raw():
    return (
        spark.readStream.format("kafka")
        .option("kafka.bootstrap.servers", "kafka:9092")
        .option("subscribe", "orders-cdc")
        .load()
    )

# 2. Apply CDC changes to target table automatically
# DLT handles INSERT, UPDATE, DELETE from the CDC stream
dlt.apply_changes(
    target="silver_orders_cdc",     # target table to create/update
    source="cdc_events_raw",        # source CDC stream
    keys=["order_id"],              # primary key columns
    sequence_by="cdc_timestamp",   # ordering column (latest wins)
    apply_as_deletes=F.expr("op = 'd'"),   # 'd' events are deletes
    except_column_list=["op", "cdc_timestamp"]  # exclude CDC metadata cols
)
# DLT automatically handles SCD Type 1 (upsert) on silver_orders_cdc
27.8

Auto Loader

Auto Loader is Databricks' optimized solution for incrementally ingesting files from cloud storage (S3, ADLS, GCS). It automatically discovers new files as they arrive, processes only new data, handles schema evolution, and scales to millions of files efficiently.

๐Ÿ“ฅ
File Discovery, Incremental Ingestion & Schema Evolution Core โ–ผ
How Auto Loader Works

Auto Loader uses two modes to discover new files: Directory Listing (simple, lists all files and finds new ones) and File Notification (event-driven via S3 Event Notifications โ†’ SQS โ†’ Auto Loader, scales to millions of files). Auto Loader tracks which files have been processed using a checkpoint.

๐Ÿ’ก Analogy
Directory Listing mode: You check your mailbox every hour and process new letters.
File Notification mode: You get a doorbell alert the moment mail arrives. Much more efficient for high-volume scenarios.
PYTHON โ€” Auto Loader Full Example
from pyspark.sql import functions as F

# โ”€โ”€โ”€ Basic Auto Loader Ingestion โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df = (
    spark.readStream
    .format("cloudFiles")             # Auto Loader format
    .option("cloudFiles.format", "json")  # underlying file format
    .option("cloudFiles.schemaLocation", "dbfs:/checkpoints/orders_schema")
    .load("s3://my-bucket/landing/orders/")
)

# Write to Delta table (streaming)
(
    df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "dbfs:/checkpoints/orders_bronze")
    .trigger(availableNow=True)         # process all available, then stop
    .toTable("production.sales.bronze_orders")
)

# โ”€โ”€โ”€ Auto Loader with Schema Evolution โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df_evolved = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", "dbfs:/schemas/orders")
    # Schema evolution options:
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    # addNewColumns: new columns added automatically
    # rescue: unknown columns saved to _rescued_data
    # failOnNewColumns: pipeline fails if new column found
    .option("cloudFiles.inferColumnTypes", "true")
    .option("header", "true")
    .load("s3://my-bucket/landing/orders/")
    .withColumn("_source_file", F.input_file_name())
    .withColumn("_ingested_at", F.current_timestamp())
)

# โ”€โ”€โ”€ File Notification Mode (for scale) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
df_notify = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "parquet")
    .option("cloudFiles.useNotifications", "true")  # event-driven mode
    .option("cloudFiles.region", "us-east-1")
    .option("cloudFiles.schemaLocation", "dbfs:/schemas/events")
    .load("s3://my-bucket/events/")
)
Trigger Modes for Auto Loader

Auto Loader integrates with Structured Streaming trigger modes to control when files are processed.

Trigger ModeBehaviorUse Case
trigger(availableNow=True)Process all new files, then stop cleanlyScheduled batch runs (most common)
trigger(processingTime="5 minutes")Run every 5 minutes continuouslyNear-real-time ingestion
trigger(once=True)Process one micro-batch, stop (legacy)Deprecated, use availableNow
27.9

MLflow

MLflow is an open-source platform for the complete ML lifecycle โ€” experiment tracking, model packaging, model registry, and model serving. It's deeply integrated into Databricks, with the MLflow tracking server hosted automatically in every workspace.

๐Ÿงช
Experiment Tracking Core โ–ผ
What is Experiment Tracking?

Every ML experiment produces metrics (accuracy, loss), parameters (learning rate, n_estimators), and artifacts (trained model files). MLflow automatically logs these so you can compare runs, reproduce results, and audit which model went to production.

๐Ÿ’ก Analogy
MLflow is like a lab notebook for scientists. Without it, you'd train a model, get great results, and then later not remember what settings you used. MLflow logs everything so you can reproduce your best experiment every time.
PYTHON โ€” MLflow Experiment Tracking
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Set the experiment (creates it if doesn't exist)
mlflow.set_experiment("/Users/alice@company.com/churn_prediction")

# โ”€โ”€โ”€ Manual logging โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
with mlflow.start_run(run_name="rf_100_trees"):
    # Log parameters
    n_estimators = 100
    max_depth = 10
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    mlflow.log_param("feature_count", X_train.shape[1])

    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators, max_depth=max_depth, random_state=42
    )
    model.fit(X_train, y_train)

    # Log metrics
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)

    # Log the trained model
    mlflow.sklearn.log_model(model, artifact_path="model")

    # Log any file as an artifact
    mlflow.log_artifact("/tmp/feature_importance.png")

    print(f"Run ID: {mlflow.active_run().info.run_id}")
    print(f"Accuracy: {accuracy:.4f}")

# โ”€โ”€โ”€ Auto-logging (logs everything automatically!) โ”€โ”€โ”€โ”€โ”€
mlflow.sklearn.autolog()  # enable autologging for sklearn

with mlflow.start_run():
    model = RandomForestClassifier(n_estimators=200)
    model.fit(X_train, y_train)
    # MLflow automatically logs: params, metrics, model, feature importance!
๐Ÿ“ฆ
Model Registry & Model Serving Production โ–ผ
Model Registry

The MLflow Model Registry is a centralized store for managing the lifecycle of ML models. You register models from experiments, and promote them through stages: None โ†’ Staging โ†’ Production โ†’ Archived.

PYTHON โ€” Model Registry Workflow
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# 1. Register a model from a run
run_id = "abc123..."
model_uri = f"runs:/{run_id}/model"

registered = mlflow.register_model(
    model_uri=model_uri,
    name="churn_prediction_model"
)
print(f"Registered version: {registered.version}")

# 2. Transition to Staging for testing
client.transition_model_version_stage(
    name="churn_prediction_model",
    version=registered.version,
    stage="Staging"
)

# 3. After testing, promote to Production
client.transition_model_version_stage(
    name="churn_prediction_model",
    version=registered.version,
    stage="Production",
    archive_existing_versions=True  # archive old production model
)

# 4. Load the production model for inference
production_model = mlflow.pyfunc.load_model(
    "models:/churn_prediction_model/Production"
)
predictions = production_model.predict(X_new)

# 5. Use in Spark for batch scoring
import mlflow.pyfunc
batch_scorer = mlflow.pyfunc.spark_udf(
    spark,
    model_uri="models:/churn_prediction_model/Production",
    result_type="double"
)
df_scored = df.withColumn("churn_probability", batch_scorer(*feature_cols))
27.10

Databricks SQL

Databricks SQL (DBSQL) is the analytics and BI layer of the platform. SQL analysts and BI developers use it to run SQL queries against Delta Lake tables, build dashboards, and set up alerts โ€” without needing to know Spark or Python.

๐Ÿ›๏ธ
SQL Warehouses, Dashboards & Query Optimization Core โ–ผ
SQL Warehouses

A SQL Warehouse is a compute endpoint specifically for SQL workloads in Databricks SQL. Unlike a Spark cluster (general-purpose), a SQL Warehouse is optimized for fast, concurrent SQL queries and scales independently from your Spark clusters.

TypeHow it WorksBest For
ClassicTraditional cluster-per-warehouseBasic SQL workloads
ProPhoton-enabled, faster startupProduction BI workloads
ServerlessInstant startup, no cluster managementCost-effective, variable workloads
SQL โ€” Databricks SQL Queries & Features
-- Query optimization: use ANALYZE TABLE for CBO
ANALYZE TABLE production.sales.orders
  COMPUTE STATISTICS FOR ALL COLUMNS;

-- Result caching: identical queries served from cache
-- Automatic in Databricks SQL โ€” no action needed

-- Query history + explain
EXPLAIN SELECT * FROM production.sales.orders
WHERE order_date = '2024-01-15';

-- Create a named saved query (via UI or API)
-- Saved queries can be scheduled and used in dashboards

-- Parameterized query (for dashboards with filters)
SELECT *
FROM production.sales.orders
WHERE status = '{{status}}'          -- UI dropdown parameter
  AND order_date BETWEEN '{{start_date}}' AND '{{end_date}}';

-- Create an Alert: notify when a metric crosses a threshold
-- Example: alert when daily_orders < 100
SELECT COUNT(*) AS daily_orders
FROM production.sales.orders
WHERE order_date = CURRENT_DATE();
27.11

Databricks Connect

Databricks Connect lets you run PySpark code from your local IDE (VS Code, PyCharm, Jupyter) against a remote Databricks cluster. You write code locally, press run, and execution happens on the powerful cluster โ€” combining the comfort of local development with the power of Databricks.

๐Ÿ”Œ
Local Development & Remote Execution Core โ–ผ
How Databricks Connect Works

Databricks Connect v2 (DBR 13+) is built on Spark Connect โ€” a client-server architecture where the Spark driver runs on the Databricks cluster, not your laptop. Your local code sends logical plan instructions over gRPC to the remote driver, which executes them.

LOCAL MACHINE (VS Code / PyCharm / Jupyter) โ””โ”€โ”€ Python script with PySpark code โ””โ”€โ”€ databricks-connect library (client) โ””โ”€โ”€ gRPC over network โ†’ DATABRICKS CLUSTER (remote) โ””โ”€โ”€ Spark Driver (receives gRPC calls) โ””โ”€โ”€ Executes Spark jobs on workers โ””โ”€โ”€ Results stream back to local client
BASH โ€” Setup Databricks Connect
# 1. Install databricks-connect (match DBR version)
pip install databricks-connect==14.3.*

# 2. Configure connection
databricks configure --token

# OR set environment variables
export DATABRICKS_HOST="https://<workspace>.azuredatabricks.net"
export DATABRICKS_TOKEN="dapi..."
export DATABRICKS_CLUSTER_ID="0115-123456-abc123"
PYTHON โ€” Using Databricks Connect in local IDE
from databricks.connect import DatabricksSession

# Connect to remote Databricks cluster
spark = DatabricksSession.builder.getOrCreate()

# Now use spark exactly as you would in a notebook!
# This runs on the remote cluster, not your laptop
df = spark.table("production.sales.orders")
print(f"Row count: {df.count()}")   # executes on cluster
df.show(5)                          # results come back locally

# Transformations work the same
result = (
    df
    .filter("status = 'completed'")
    .groupBy("customer_id")
    .agg({"amount": "sum"})
    .sort("sum(amount)", ascending=False)
)
result.show()

# Explicitly connect with config
spark2 = (
    DatabricksSession.builder
    .remote(
        host="https://<workspace>.azuredatabricks.net",
        token="dapi...",
        cluster_id="0115-123456-abc123"
    )
    .getOrCreate()
)
โš ๏ธ Limitation
Databricks Connect v2 doesn't support RDDs or SparkContext operations (it's DataFrame/SQL only). Also, dbutils is not available locally โ€” use the Databricks SDK for file system operations instead.
27.12

Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) is the official CI/CD framework for Databricks. You define your entire deployment โ€” notebooks, workflows, clusters, DLT pipelines, permissions โ€” as YAML configuration files, and deploy them to any environment (dev/staging/prod) with a single command.

๐Ÿ“ฆ
CI/CD & Deployment Automation DevOps โ–ผ
Bundle Structure

A Databricks Asset Bundle is a project folder containing source code + a databricks.yml configuration file that defines all the Databricks resources to deploy.

my-etl-project/ โ”œโ”€โ”€ databricks.yml โ† Root bundle config โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ notebooks/ โ”‚ โ”‚ โ”œโ”€โ”€ 01_extract.py โ”‚ โ”‚ โ””โ”€โ”€ 02_transform.py โ”‚ โ””โ”€โ”€ etl/ โ”‚ โ””โ”€โ”€ pipeline.py โ”œโ”€โ”€ resources/ โ”‚ โ”œโ”€โ”€ jobs/ โ”‚ โ”‚ โ””โ”€โ”€ daily_etl_job.yml โ† Job definition โ”‚ โ””โ”€โ”€ pipelines/ โ”‚ โ””โ”€โ”€ dlt_pipeline.yml โ† DLT pipeline definition โ””โ”€โ”€ tests/ โ””โ”€โ”€ test_transformations.py
YAML โ€” databricks.yml (Root Bundle Config)
bundle:
  name: daily-sales-etl

# Workspace configuration per environment
targets:
  dev:
    workspace:
      host: https://dev-workspace.azuredatabricks.net
    default: true

  staging:
    workspace:
      host: https://staging-workspace.azuredatabricks.net

  prod:
    workspace:
      host: https://prod-workspace.azuredatabricks.net
    mode: production   # adds "prod" prefix to all resource names

# Include resource files
include:
  - resources/*.yml

# Variables (overrideable per target)
variables:
  catalog:
    description: Target catalog name
    default: dev
  env:
    description: Environment name
    default: dev
YAML โ€” resources/jobs/daily_etl_job.yml
resources:
  jobs:
    daily_etl_job:
      name: "Daily Sales ETL [${bundle.target}]"
      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: "UTC"
      
      job_clusters:
        - job_cluster_key: etl-cluster
          new_cluster:
            spark_version: 14.3.x-scala2.12
            node_type_id: Standard_DS4_v2
            num_workers: 4

      tasks:
        - task_key: extract
          job_cluster_key: etl-cluster
          notebook_task:
            notebook_path: ../src/notebooks/01_extract.py
            base_parameters:
              catalog: ${var.catalog}
              env: ${var.env}

        - task_key: transform
          depends_on:
            - task_key: extract
          job_cluster_key: etl-cluster
          notebook_task:
            notebook_path: ../src/notebooks/02_transform.py

      email_notifications:
        on_failure:
          - data-team@company.com
BASH โ€” DABs CLI Commands
# Validate the bundle (check for errors)
databricks bundle validate

# Deploy to dev (default target)
databricks bundle deploy

# Deploy to specific target
databricks bundle deploy --target staging
databricks bundle deploy --target prod

# Run a specific job from the bundle
databricks bundle run daily_etl_job

# Run with variable overrides
databricks bundle run daily_etl_job --var="catalog=production"

# Destroy all deployed resources
databricks bundle destroy --target dev

# GitHub Actions CI/CD workflow
# .github/workflows/deploy.yml
# - on push to main: bundle deploy --target prod
# - on push to feature: bundle deploy --target dev
27.13

Serverless Compute

Serverless compute in Databricks means no cluster management at all. You submit your job or query and Databricks instantly provides compute from a pre-warmed pool โ€” no 5-minute cluster startup wait, and you only pay for the exact time your code runs.

โ˜๏ธ
Serverless Jobs & Serverless SQL Modern โ–ผ
Serverless Jobs

Serverless Jobs run notebooks and Python scripts on serverless compute. The cluster is provisioned in seconds (not minutes) from Databricks' pre-warmed pool of machines. You configure the job โ€” not the cluster.

AspectClassic Job ClusterServerless Jobs
Startup time5โ€“10 minutes~5โ€“30 seconds
Cluster configYou configure node type, workersAutomatic โ€” Databricks manages
CostDBU per VM hourDBU per compute second (finer billing)
AvailabilityAll regionsSelect regions (growing)
JSON โ€” Serverless Job Config
{
  "name": "Serverless ETL Job",
  "tasks": [{
    "task_key": "main_task",
    "notebook_task": {
      "notebook_path": "/Repos/myrepo/etl/main"
    },
    // No job_cluster_key! Use serverless instead:
    "environment_key": "serverless_env"
  }],
  "environments": [{
    "environment_key": "serverless_env",
    "spec": {
      "client": "1",
      "dependencies": [
        "pandas==2.0.0",
        "scikit-learn==1.3.0"
      ]
    }
  }]
}
Serverless SQL Warehouses

Serverless SQL Warehouses start instantly (seconds vs minutes for classic) and scale to zero when idle โ€” you pay only when queries are running. This is the recommended choice for most Databricks SQL workloads today.

๐Ÿ“‹ Real-World Impact
A team runs 50 ad-hoc queries per day. With a Classic Pro warehouse that idles between queries (30 min auto-suspend), they pay for 30 min ร— idle periods. With Serverless, they pay only for the seconds each query runs. Monthly savings: 60โ€“80% for sporadic workloads.
27.14

Lakeflow

Lakeflow is Databricks' next-generation data engineering product that unifies Delta Live Tables (pipeline authoring) with Workflows (orchestration) into a single, declarative, end-to-end data pipeline platform. It's the future direction of ETL on Databricks.

๐ŸŒŠ
Declarative Pipelines & Workflow Management Emerging โ–ผ
What is Lakeflow?

Lakeflow combines three key components:

๐Ÿ”„
Lakeflow Pipelines
The next evolution of Delta Live Tables โ€” declarative, streaming + batch, automatic quality enforcement
๐Ÿ“‹
Lakeflow Jobs
Unified orchestration โ€” run any task type (notebooks, DLT pipelines, dbt, Python) with dependencies
๐Ÿ”
Lakeflow Observability
Built-in lineage, data quality metrics, and pipeline health monitoring via Unity Catalog
๐Ÿ’ก Analogy
Before Lakeflow: Your data pipeline is like a relay race where each runner uses different equipment and there's no unified scoreboard.

With Lakeflow: Everyone runs the same race on the same track with a unified dashboard showing every runner's position, speed, and any falls in real-time.
Lakeflow Declarative Pipelines

Lakeflow Pipelines extend DLT with more powerful features: materialized views that auto-refresh, streaming tables that ingest continuously, and unified mode that seamlessly blends batch and streaming in one pipeline.

PYTHON โ€” Lakeflow / DLT v2 Concepts
import dlt
from pyspark.sql import functions as F

# STREAMING TABLE: continuously ingests new data
# Best for: source data that keeps arriving
@dlt.table()
def orders_streaming():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/Volumes/prod/landing/orders/")
    )

# MATERIALIZED VIEW: auto-refreshes on trigger
# Best for: aggregations over complete datasets
@dlt.view()
def daily_revenue_mv():
    return (
        dlt.read("orders_streaming")
        .groupBy(F.to_date("event_time").alias("date"))
        .agg(F.sum("amount").alias("revenue"))
    )

# PIPELINE configuration: set update mode
# continuous: always running (streaming)
# triggered: run on schedule (batch-like)
# Set in pipeline UI or databricks.yml:
# continuous: true   โ† for streaming pipelines
# continuous: false  โ† for scheduled batch pipelines
Lakeflow & Unity Catalog Integration

When you use Lakeflow Pipelines with Unity Catalog, you get automatic data lineage โ€” Unity Catalog tracks exactly which source data fed which transformation which produced which output table. You can visualize the entire pipeline graph without any manual instrumentation.

๐Ÿ“Œ What Lakeflow Gives You Automatically
โœ… Column-level lineage (source column โ†’ target column mapping)
โœ… Data quality metrics per table (tracked over time)
โœ… Pipeline event history (what ran, when, what changed)
โœ… Automatic schema documentation in Unity Catalog
โœ… Integrated monitoring in Databricks Workflows UI
Module Summary โ€” What You've Learned

You've now covered the complete Databricks platform. Here's a quick mental model of how everything fits together:

DATABRICKS PLATFORM โ€” HOW IT ALL FITS CODE & DEVELOPMENT โ”œโ”€โ”€ Workspace + Repos โ†’ Where your code lives (Git-synced) โ”œโ”€โ”€ Notebooks โ†’ Interactive development + dbutils โ””โ”€โ”€ Databricks Connect โ†’ Code in your local IDE, run on remote cluster COMPUTE โ”œโ”€โ”€ Interactive Clusters โ†’ Development (shared, long-running) โ”œโ”€โ”€ Job Clusters โ†’ Production (ephemeral, cost-efficient) โ”œโ”€โ”€ Serverless Compute โ†’ Instant startup, fully managed โ””โ”€โ”€ SQL Warehouses โ†’ Optimized for SQL analytics + BI DATA & GOVERNANCE โ”œโ”€โ”€ Unity Catalog โ†’ Catalog.Schema.Table namespace + access control โ”œโ”€โ”€ Delta Lake โ†’ ACID storage layer (under everything) โ””โ”€โ”€ Volumes โ†’ File-level governance in Unity Catalog PIPELINES & ORCHESTRATION โ”œโ”€โ”€ Workflows / Jobs โ†’ Multi-task orchestration with dependencies โ”œโ”€โ”€ Delta Live Tables โ†’ Declarative ETL with quality expectations โ”œโ”€โ”€ Auto Loader โ†’ Incremental file ingestion from cloud storage โ””โ”€โ”€ Lakeflow โ†’ Unified next-gen pipeline platform ML & ANALYTICS โ”œโ”€โ”€ MLflow โ†’ Experiment tracking, model registry, serving โ””โ”€โ”€ Databricks SQL โ†’ BI queries, dashboards, alerts DEPLOYMENT โ””โ”€โ”€ Asset Bundles (DABs) โ†’ CI/CD as code โ€” deploy everything via YAML