CI/CD & Production Deployment
Shipping PySpark pipelines to production reliably requires automated testing, packaging, environment promotion, and infrastructure management. This module covers the complete deployment lifecycle — from a developer's commit to a running production Spark job — using GitHub Actions, GitLab CI, Jenkins, Terraform, and Databricks Asset Bundles.
CI/CD Fundamentals for Data Engineering
Understanding what CI and CD mean specifically for Spark pipeline development and how they differ from traditional software CI/CD.
For Spark pipelines, CI typically means: lint the Python code → run unit tests with a local SparkSession → validate schemas → build the package artifact.
- Continuous Delivery: Code is always ready to deploy; a human approves the final push to prod
- Continuous Deployment: Fully automated — code goes all the way to production without human gates (less common in DE)
# .github/workflows/ci.yml ← lives in your repo
# Every push to main triggers this pipeline
name: PySpark CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run tests
run: pytest tests/ -v
Artifact versioning follows semantic versioning:
MAJOR.MINOR.PATCH — e.g. 1.4.2# Build the wheel with a version tag from Git
VERSION=$(git describe --tags --abbrev=0)
# Build the wheel file
python setup.py bdist_wheel
# Output: dist/my_pipeline-1.4.2-py3-none-any.whl
# Upload to S3 artifact store
aws s3 cp dist/my_pipeline-${VERSION}-py3-none-any.whl \
s3://my-artifacts/pyspark-pipelines/
GitHub Actions
GitHub Actions is the most widely used CI/CD tool for open-source and enterprise data engineering. Define workflows in YAML, trigger on push/PR/schedule, run tests, build packages, and deploy to Databricks or EMR.
# .github/workflows/spark_pipeline_ci.yml
name: Spark Pipeline CI/CD # Display name in GitHub UI
# ── TRIGGERS ──────────────────────────────────────
on:
push:
branches: [main, develop] # Run on push to these branches
pull_request:
branches: [main] # Run on PRs targeting main
schedule:
- cron: '0 6 * * 1' # Every Monday at 6am UTC
# ── JOBS ──────────────────────────────────────────
jobs:
ci: # Job ID
runs-on: ubuntu-latest # Runner environment
env:
JAVA_HOME: /usr/lib/jvm/java-11-openjdk-amd64
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Set up Java (Spark needs Java)
uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '11'
- name: Install Python dependencies
run: |
pip install --upgrade pip
pip install -r requirements.txt
- name: Lint with flake8
run: flake8 src/ --max-line-length=120
- name: Format check with black
run: black --check src/
- name: Run unit tests
run: pytest tests/ -v --tb=short
- name: Build wheel package
run: python setup.py bdist_wheel
- name: Upload wheel artifact
uses: actions/upload-artifact@v4
with:
name: pipeline-wheel
path: dist/*.whl
| Trigger | When it fires | Use case |
|---|---|---|
push | On every commit push | Run tests on every code change |
pull_request | When a PR is opened/updated | Gate merges on test pass |
schedule | On a cron schedule | Nightly regression test suite |
workflow_dispatch | Manual trigger via GitHub UI | Manual deploy to prod |
release | When a GitHub Release is published | Deploy on official release tag |
SparkSession.builder.master("local[2]") so no cluster is needed.# tests/conftest.py
import pytest
from pyspark.sql import SparkSession
@pytest.fixture(scope="session")
def spark():
"""Create a local SparkSession for the entire test session."""
spark = (
SparkSession.builder
.master("local[2]") # 2 local threads
.appName("CITests")
.config("spark.sql.shuffle.partitions", "2") # small for tests
.getOrCreate()
)
yield spark
spark.stop()
# tests/test_transforms.py
from src.transforms import clean_customer_data
def test_null_email_removed(spark):
# Arrange
data = [(1, "Alice", "alice@test.com"),
(2, "Bob", None)] # null email — should be dropped
df = spark.createDataFrame(data, ["id", "name", "email"])
# Act
result = clean_customer_data(df)
# Assert
assert result.count() == 1
assert result.first()["name"] == "Alice"
# Install linting tools
pip install flake8 black
# Check for style errors (E: errors, W: warnings, F: pyflakes)
flake8 src/ --max-line-length=120 --ignore=E203,W503
# Check formatting (--check means "would black change this?" — exits non-zero if yes)
black --check src/
# Auto-format (run locally, not in CI — CI just checks)
black src/
# .flake8 config file in project root
cat .flake8
# [flake8]
# max-line-length = 120
# ignore = E203, W503
# exclude = .git, __pycache__, dist
build-and-publish:
needs: [ci] # only runs if CI job passes
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' # only on main branch
steps:
- uses: actions/checkout@v4
- name: Build wheel
run: python setup.py bdist_wheel
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Upload wheel to S3
run: |
VERSION=$(python setup.py --version)
aws s3 cp dist/*.whl \
s3://my-artifacts/pyspark-pipelines/v${VERSION}/
deploy-databricks:
needs: [build-and-publish]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Databricks CLI
run: pip install databricks-cli
- name: Configure Databricks CLI
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
run: |
# Configure the CLI with workspace URL and PAT token
databricks configure --token <<EOF
$DATABRICKS_HOST
$DATABRICKS_TOKEN
EOF
- name: Upload wheel to DBFS
run: databricks fs cp dist/*.whl dbfs:/libraries/ --overwrite
- name: Deploy using Asset Bundles
run: |
databricks bundle deploy --target prod
# Reads databricks.yml in the repo root
# Deploys jobs, pipelines, notebooks defined there
${{ secrets.SECRET_NAME }}.DATABRICKS_TOKEN: "dapi1234abc..." — tokens committed to Git are a critical security vulnerability and will be exposed in all forks and clones.env:
# ✅ Correct — reference from secrets store
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
# Set secrets: GitHub repo → Settings → Secrets → New repository secret
# They are masked in all logs automatically
GitLab CI
GitLab CI uses a .gitlab-ci.yml file at the repo root. It supports stages, shared runners (cloud) or self-hosted runners, artifact storage, and environment-specific deployments — popular in enterprise data engineering.
# .gitlab-ci.yml — PySpark Pipeline CI/CD
# Define all stages (run top to bottom)
stages:
- lint
- test
- build
- deploy-dev
- deploy-prod
# Global variables available to all jobs
variables:
PYTHON_VERSION: "3.11"
PYSPARK_VERSION: "3.5.0"
# ── STAGE: lint ──────────────────────────────────
lint-code:
stage: lint
image: python:3.11-slim
script:
- pip install flake8 black
- flake8 src/ --max-line-length=120
- black --check src/
only:
- merge_requests # Run on every MR
- main
# ── STAGE: test ──────────────────────────────────
unit-tests:
stage: test
image: python:3.11
before_script:
- apt-get update && apt-get install -y default-jdk
- pip install pyspark==3.5.0 pytest pytest-cov
- pip install -r requirements.txt
script:
- pytest tests/ -v --cov=src --cov-report=xml
coverage: '/TOTAL.*\s+(\d+%)$/' # Parse coverage from output
artifacts:
reports:
coverage_report:
coverage_format: cobertura
path: coverage.xml
# ── STAGE: build ──────────────────────────────────
build-wheel:
stage: build
image: python:3.11
script:
- pip install wheel setuptools
- python setup.py bdist_wheel
artifacts:
paths:
- dist/*.whl # Store the wheel for downstream jobs
expire_in: 30 days
only:
- main
# ── STAGE: deploy-dev ──────────────────────────────
deploy-to-dev:
stage: deploy-dev
image: amazon/aws-cli
script:
- aws s3 cp dist/*.whl s3://artifacts-dev/pipelines/
- databricks bundle deploy --target dev
environment:
name: development
only:
- main
# ── STAGE: deploy-prod ─────────────────────────────
deploy-to-prod:
stage: deploy-prod
image: amazon/aws-cli
script:
- aws s3 cp dist/*.whl s3://artifacts-prod/pipelines/
- databricks bundle deploy --target prod
environment:
name: production
when: manual # Requires a human to click "Deploy" in GitLab UI
only:
- main
# Install GitLab Runner on your server or EC2 instance
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash
sudo apt-get install gitlab-runner
# Register it with your GitLab instance
sudo gitlab-runner register \
--url https://gitlab.com \
--registration-token YOUR_TOKEN \
--executor docker \
--docker-image python:3.11 \
--description "spark-runner" \
--tag-list spark,pyspark # Jobs can target this runner by tag
# In .gitlab-ci.yml, use this runner with tags:
# tags:
# - spark
artifacts: keyword. Jobs in later stages can download artifacts from earlier stages automatically.# Build stage creates the artifact
build-wheel:
stage: build
script:
- python setup.py bdist_wheel
artifacts:
paths:
- dist/ # Persist this directory
expire_in: 7 days # Auto-delete after 7 days
# Deploy stage automatically has access to dist/ from build stage
deploy-to-dev:
stage: deploy-dev
dependencies:
- build-wheel # Download artifacts from this job
script:
- ls dist/ # Can access the .whl file built above
- aws s3 cp dist/*.whl s3://artifacts-dev/
deploy-emr:
stage: deploy-prod
image: amazon/aws-cli
variables:
AWS_DEFAULT_REGION: us-east-1
script:
- # Upload PySpark script and wheel to S3
- aws s3 cp src/main_pipeline.py s3://my-emr-scripts/
- aws s3 cp dist/*.whl s3://my-emr-scripts/
- # Submit an EMR step
- |
aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps Type=Spark,Name="DailyPipeline",\
Args=[--deploy-mode,cluster,\
--py-files,s3://my-emr-scripts/my_pipeline.whl,\
s3://my-emr-scripts/main_pipeline.py],\
ActionOnFailure=CONTINUE
environment:
name: production
when: manual
Jenkins
Jenkins is the most mature CI/CD tool, widely used in large enterprises. A Jenkinsfile (stored in your repo) defines declarative pipelines with stages, error handling, and cloud integrations.
pipeline { } block with sections for agent, environment, stages, and post-build actions. It's more readable than the older Scripted syntax.// Jenkinsfile — PySpark Pipeline Build & Deploy
pipeline {
agent {
docker {
image 'python:3.11' // Run all stages in this Docker image
args '-v /var/run/docker.sock:/var/run/docker.sock'
}
}
environment {
DATABRICKS_HOST = credentials('databricks-host') // Jenkins credential store
DATABRICKS_TOKEN = credentials('databricks-token')
AWS_CREDENTIALS = credentials('aws-de-credentials') // UsernamePassword binding
}
stages {
stage('Checkout') {
steps {
checkout scm // Pull code from Git
}
}
stage('Install Dependencies') {
steps {
sh '''
pip install --upgrade pip
pip install -r requirements.txt
apt-get install -y default-jdk -q
'''
}
}
stage('Lint') {
steps {
sh 'flake8 src/ --max-line-length=120'
sh 'black --check src/'
}
}
stage('Unit Tests') {
steps {
sh 'pytest tests/ -v --junitxml=test-results.xml'
}
post {
always {
junit 'test-results.xml' // Publish test results in Jenkins UI
}
}
}
stage('Build Wheel') {
steps {
sh 'python setup.py bdist_wheel'
archiveArtifacts artifacts: 'dist/*.whl' // Store in Jenkins
}
}
stage('Deploy to Dev') {
when {
branch 'main' // Only on main branch
}
steps {
sh '''
databricks fs cp dist/*.whl dbfs:/libraries/ --overwrite
databricks jobs run-now --job-id $DEV_JOB_ID
'''
}
}
stage('Deploy to Prod') {
when {
branch 'main'
}
input {
message "Deploy to PRODUCTION?" // Human approval gate
ok "Yes, deploy"
}
steps {
sh 'databricks bundle deploy --target prod'
}
}
}
post {
failure {
emailext subject: "Pipeline FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}",
body: "Check Jenkins: ${env.BUILD_URL}",
to: 'data-team@company.com'
}
success {
echo 'Pipeline completed successfully!'
}
}
}
// Use withAWS() from the Pipeline: AWS Steps plugin
stage('Upload to S3') {
steps {
withAWS(credentials: 'aws-de-credentials', region: 'us-east-1') {
s3Upload(
file: 'dist/my_pipeline-1.0.0-py3-none-any.whl',
bucket: 'my-artifacts',
path: 'pyspark-pipelines/'
)
}
}
}
// Trigger an EMR step
stage('Run EMR Job') {
steps {
withAWS(credentials: 'aws-de-credentials', region: 'us-east-1') {
sh """
aws emr add-steps \
--cluster-id j-XXXXXXXXXXX \
--steps Type=Spark,Name=DailyETL,\
ActionOnFailure=CONTINUE,\
Args=[--deploy-mode,cluster,\
s3://my-scripts/main.py]
"""
}
}
}
Packaging
Packaging your PySpark code makes it deployable, versionable, and installable on Spark clusters. The two main formats are Wheel (.whl) files for Python packages and ZIP files for script bundles. Good dependency management prevents "works on my machine" problems.
pip install and distributed to Spark clusters. You need a setup.py or pyproject.toml to define package metadata.my_pipeline/ # Project root
├── setup.py # or pyproject.toml
├── requirements.txt
├── src/
│ └── my_pipeline/
│ ├── __init__.py
│ ├── transforms.py # Your PySpark code
│ ├── utils.py
│ └── config.py
└── tests/
├── conftest.py
└── test_transforms.py
# setup.py
from setuptools import setup, find_packages
setup(
name="my_pipeline",
version="1.4.2",
description="Customer data pipeline",
author="Data Engineering Team",
python_requires=">=3.9",
packages=find_packages(where="src"),
package_dir={"": "src"},
install_requires=[
"pyspark==3.5.0",
"delta-spark==3.1.0",
"boto3>=1.34.0",
],
extras_require={
"dev": ["pytest", "black", "flake8"]
}
)
# pyproject.toml — modern Python packaging
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.backends.legacy:build"
[project]
name = "my_pipeline"
version = "1.4.2"
description = "Customer data pipeline"
requires-python = ">=3.9"
dependencies = [
"pyspark==3.5.0",
"delta-spark==3.1.0",
"boto3>=1.34.0",
]
[project.optional-dependencies]
dev = ["pytest", "black", "flake8"]
[tool.setuptools.packages.find]
where = ["src"]
# Build using setuptools (setup.py way)
python setup.py bdist_wheel
# Output: dist/my_pipeline-1.4.2-py3-none-any.whl
# Build using build module (pyproject.toml way — modern)
pip install build
python -m build
# Output: dist/my_pipeline-1.4.2-py3-none-any.whl
# Install the wheel to verify it works
pip install dist/my_pipeline-1.4.2-py3-none-any.whl
python -c "from my_pipeline.transforms import clean_customer_data; print('OK')"
# Upload wheel to S3 artifact store
aws s3 cp dist/my_pipeline-1.4.2-py3-none-any.whl \
s3://my-artifacts/pyspark-pipelines/
# In spark-submit — distribute wheel to all executors
spark-submit \
--py-files s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl \
main.py
# In Databricks — install as a cluster library
databricks libraries install \
--cluster-id 0101-XXXXX \
--whl s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl
--py-files. Simpler than wheels but less clean for complex packages.# Zip only your source code
cd src/
zip -r ../my_pipeline.zip my_pipeline/
cd ..
# Or zip with installed dependencies (venv approach)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Zip the installed packages
cd venv/lib/python3.11/site-packages/
zip -r ../../../../dependencies.zip .
cd ../../../../
# Use in spark-submit
spark-submit \
--py-files my_pipeline.zip,dependencies.zip \
main.py
--py-files is the spark-submit argument that distributes .py files, .zip files, or .egg files to all executor nodes. Without it, your custom modules are not available on executors.# Single zip file
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files my_pipeline.zip \
main.py
# Multiple files/packages
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files my_pipeline.zip,utils.py,config.py \
main.py
# With wheel file
spark-submit \
--master yarn \
--deploy-mode cluster \
--py-files my_pipeline-1.4.2-py3-none-any.whl \
--conf spark.executor.extraPythonPath=my_pipeline-1.4.2-py3-none-any.whl \
main.py
boto3 instead of boto3==1.34.0) mean a new version could break your pipeline silently.# requirements.txt — pin EVERYTHING for production
pyspark==3.5.0
delta-spark==3.1.0
boto3==1.34.58
botocore==1.34.58
s3fs==2024.2.0
pyarrow==15.0.0
pandas==2.2.0
great-expectations==0.18.12
# requirements-dev.txt — for local development only
pytest==8.1.0
black==24.2.0
flake8==7.0.0
pytest-cov==5.0.0
# Generate pinned requirements from current environment
pip freeze > requirements.txt
# Create a virtual environment
python -m venv .venv
# Activate it
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt
# Deactivate when done
deactivate
# .gitignore should exclude .venv/
echo ".venv/" >> .gitignore
pip-compile from pip-tools generates a fully resolved, pinned requirements.txt from a high-level requirements.in file.# Install pip-tools
pip install pip-tools
# requirements.in — high-level dependencies only
cat requirements.in
# pyspark==3.5.0
# boto3
# delta-spark
# pip-compile resolves all transitive deps and pins them
pip-compile requirements.in
# Generates requirements.txt with fully pinned versions
# e.g. boto3==1.34.58, botocore==1.34.58, etc.
# Install the pinned requirements
pip-sync requirements.txt # Also removes unlisted packages
Environment Promotion
Code doesn't go straight from a developer's laptop to production. It progresses through Dev → QA → UAT → Prod, with automated gates and human approvals ensuring quality at each stage.
Key characteristics:
- Interactive Databricks notebooks or local Spark sessions
- Small sample data (10K rows instead of 10B)
- Shared or personal dev clusters (auto-terminate in 30 min)
- Auto-deploy on every push to the
developbranch
deploy-dev:
needs: [unit-tests]
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/develop'
environment: development # GitHub environment
env:
DATABRICKS_HOST: ${{ secrets.DEV_DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DEV_DATABRICKS_TOKEN }}
steps:
- uses: actions/checkout@v4
- name: Deploy to Dev workspace
run: databricks bundle deploy --target dev
Key characteristics:
- Job clusters (not interactive) — mimics production
- Integration tests: reads real S3 data, writes to QA Delta tables
- Data Quality checks with Great Expectations or Deequ
- Auto-deploy on merge to
mainafter CI passes
# tests/integration/test_pipeline_qa.py
# Run in QA environment with real data connections
import pytest
from pyspark.sql import SparkSession
from my_pipeline.transforms import run_customer_pipeline
@pytest.fixture(scope="session")
def spark():
return SparkSession.builder.appName("QA-Integration").getOrCreate()
def test_pipeline_row_count(spark):
"""Validate output row count within expected range."""
result = run_customer_pipeline(
spark,
input_path="s3://qa-data/customers/2024-01-01/",
output_path="s3://qa-output/customers/"
)
row_count = result.count()
assert 900_000 < row_count < 1_100_000, \
f"Unexpected row count: {row_count}"
def test_no_null_customer_ids(spark):
"""Validate no nulls in primary key."""
result = run_customer_pipeline(
spark,
input_path="s3://qa-data/customers/2024-01-01/"
)
null_count = result.filter(result.customer_id.isNull()).count()
assert null_count == 0, "Found null customer_ids!"
Key characteristics:
- Production-like data (full dataset or 1-month sample)
- Business analysts run their dashboards against UAT data
- Performance testing — does the pipeline finish within the SLA window?
- Requires manual approval gate in CI/CD system
Key characteristics:
- Job clusters (auto-terminate after each run)
- CloudWatch/Datadog monitoring + SNS alerts on failure
- Manual deploy gate — a senior engineer approves every prod push
- Rollback: redeploy previous wheel version from artifact store
# To rollback: redeploy the previous wheel version
# 1. Find the previous version in S3
aws s3 ls s3://my-artifacts/pyspark-pipelines/
# my_pipeline-1.4.1-py3-none-any.whl ← previous good version
# my_pipeline-1.4.2-py3-none-any.whl ← broken version (rollback from this)
# 2. Install the previous version on the cluster
databricks libraries install \
--cluster-id 0101-PROD-XXXXX \
--whl s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.1-py3-none-any.whl
# 3. Or redeploy the git tag of the previous release
git checkout v1.4.1
databricks bundle deploy --target prod
Infrastructure as Code
Instead of clicking through cloud consoles to create clusters, databases, and storage, you define infrastructure in code. This makes it reproducible, version-controlled, and reviewable. Terraform, CloudFormation, and Databricks Asset Bundles are the key tools.
# main.tf — Provision a Databricks job cluster
terraform {
required_providers {
databricks = {
source = "databricks/databricks"
version = "~> 1.40"
}
}
# Remote state in S3 (never use local state in team settings)
backend "s3" {
bucket = "my-terraform-state"
key = "pyspark-pipelines/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-lock" # Prevents concurrent applies
}
}
provider "databricks" {
host = var.databricks_host
token = var.databricks_token
}
# Create a Databricks Job with a job cluster
resource "databricks_job" "customer_pipeline" {
name = "customer-data-pipeline"
new_cluster {
num_workers = 4
spark_version = "14.3.x-scala2.12"
node_type_id = "i3.xlarge"
spark_conf = {
"spark.sql.shuffle.partitions" = "200"
"spark.databricks.delta.optimizeWrite.enabled" = "true"
}
aws_attributes {
instance_profile_arn = var.instance_profile_arn
availability = "SPOT_WITH_FALLBACK"
}
}
spark_python_task {
python_file = "s3://my-scripts/main_pipeline.py"
parameters = ["--env", "prod"]
}
library {
whl = "s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl"
}
schedule {
quartz_cron_expression = "0 0 6 * * ?" # Every day at 6am UTC
timezone_id = "UTC"
}
email_notifications {
on_failure = ["data-team@company.com"]
}
}
# EMR cluster for PySpark jobs
resource "aws_emr_cluster" "spark_cluster" {
name = "data-pipeline-cluster"
release_label = "emr-6.15.0"
applications = ["Spark", "Hadoop"]
ec2_attributes {
subnet_id = var.private_subnet_id
emr_managed_master_security_group = aws_security_group.emr_master.id
emr_managed_slave_security_group = aws_security_group.emr_slave.id
instance_profile = aws_iam_instance_profile.emr.arn
}
master_instance_group {
instance_type = "m5.xlarge"
}
core_instance_group {
instance_type = "m5.2xlarge"
instance_count = 4
ebs_config {
size = 100
type = "gp2"
volumes_per_instance = 1
}
}
configurations_json = jsonencode([{
Classification = "spark-defaults"
Properties = {
"spark.sql.shuffle.partitions" = "200"
"spark.executor.memory" = "8g"
}
}])
service_role = aws_iam_role.emr_service.arn
auto_termination_policy { idle_timeout = 3600 }
tags = { Environment = var.environment }
}
# S3 bucket for pipeline artifacts
resource "aws_s3_bucket" "artifacts" {
bucket = "my-company-pipeline-artifacts-${var.environment}"
}
resource "aws_s3_bucket_versioning" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
versioning_configuration { status = "Enabled" }
}
# Initialize — download providers, configure backend
terraform init
# Plan — show what will be created/changed/destroyed
terraform plan -out=tfplan
# Apply — create/update infrastructure
terraform apply tfplan
# Destroy — tear down all resources (careful!)
terraform destroy
# Workspaces for dev/staging/prod
terraform workspace new dev
terraform workspace new prod
terraform workspace select prod
terraform apply -var="environment=prod"
databricks.yml file. Like Terraform but Databricks-specific and integrated with the Databricks CLI.# databricks.yml — Databricks Asset Bundle definition
bundle:
name: customer-pipeline
variables:
env:
default: dev
targets:
dev:
workspace:
host: https://my-company-dev.azuredatabricks.net
variables:
env: dev
prod:
workspace:
host: https://my-company-prod.azuredatabricks.net
variables:
env: prod
mode: production # Locks down prod settings
resources:
jobs:
customer_pipeline:
name: customer-data-pipeline
tasks:
- task_key: ingest
spark_python_task:
python_file: src/ingest.py
parameters: [--env, "${var.env}"]
new_cluster:
num_workers: 4
spark_version: "14.3.x-scala2.12"
node_type_id: i3.xlarge
libraries:
- whl: ./dist/my_pipeline-1.4.2-py3-none-any.whl
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: UTC
# Validate the bundle definition
databricks bundle validate
# Deploy to dev workspace
databricks bundle deploy --target dev
# Deploy to prod workspace
databricks bundle deploy --target prod
# Run the job immediately after deploy
databricks bundle run --target dev customer_pipeline
# Destroy (delete all bundle resources)
databricks bundle destroy --target dev
# cloudformation/glue-job.yml
AWSTemplateFormatVersion: '2010-09-09'
Description: Glue ETL Job for customer data pipeline
Parameters:
Environment:
Type: String
AllowedValues: [dev, qa, prod]
Resources:
CustomerPipelineGlueJob:
Type: AWS::Glue::Job
Properties:
Name: !Sub 'customer-pipeline-${Environment}'
Role: !GetAtt GlueExecutionRole.Arn
Command:
Name: glueetl
ScriptLocation: !Sub 's3://my-scripts/${Environment}/customer_pipeline.py'
PythonVersion: '3'
DefaultArguments:
'--environment': !Ref Environment
'--extra-py-files': !Sub 's3://my-artifacts/${Environment}/my_pipeline.whl'
GlueVersion: '4.0'
NumberOfWorkers: 10
WorkerType: G.1X
# Deploy with:
# aws cloudformation deploy \
# --template-file cloudformation/glue-job.yml \
# --stack-name customer-pipeline-prod \
# --parameter-overrides Environment=prod
Release Management
Release management defines how code changes are versioned, documented, communicated, and (if needed) rolled back. Good release hygiene prevents chaos when something goes wrong in production at 2am.
- MAJOR — breaking change (schema change, incompatible API, complete pipeline rewrite)
- MINOR — new feature, backward-compatible (new column added, new source added)
- PATCH — bug fix (null handling fix, performance fix, config correction)
| Version | Change type | Example |
|---|---|---|
2.0.0 | Breaking change | Changed output schema — downstream dashboards break |
1.5.0 | New feature | Added new `country_code` column to output |
1.4.3 | Bug fix | Fixed null pointer error when email is missing |
# Create an annotated Git tag for a release
git tag -a v1.5.0 -m "feat: add country_code to customer output"
# Push the tag to GitHub (triggers release workflow)
git push origin v1.5.0
# List all release tags
git tag --list 'v*'
# Get the current version from git tag in CI
VERSION=$(git describe --tags --abbrev=0 | sed 's/v//')
echo "Building version: $VERSION" # e.g. 1.5.0
# Changelog
## [1.5.0] - 2024-03-15
### Added
- New `country_code` column derived from phone number normalization
- Support for reading from MSK (Managed Kafka) as source
### Changed
- Improved join performance using broadcast hint for country dim table
## [1.4.3] - 2024-03-10
### Fixed
- Fixed NullPointerException when customer email field is missing
- Fixed incorrect SLA timestamp in audit table
## [1.4.2] - 2024-03-01
### Changed
- Upgraded PySpark from 3.4.0 to 3.5.0
- Reduced shuffle partitions from 400 to 200 (20% faster)
## [1.4.0] - 2024-02-15
### Added
- SCD Type 2 support for customer dimension
- Data Quality checks using Great Expectations
## [2.0.0] - Breaking
### BREAKING CHANGES
- Removed deprecated `cust_email` column (use `email` instead)
- Changed partition scheme from `created_date` to `event_date`
# .github/workflows/release.yml
name: Create Release
on:
push:
tags:
- 'v*' # Trigger on any version tag
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build wheel
run: python setup.py bdist_wheel
- name: Create GitHub Release
uses: softprops/action-gh-release@v2
with:
files: dist/*.whl
generate_release_notes: true # Auto-generate from PR titles
token: ${{ secrets.GITHUB_TOKEN }}
Quiz & Summary
Test your understanding of CI/CD and Production Deployment concepts.
--py-files argument in spark-submit do?databricks.yml file and deploys them via the CLI?Key Takeaways
needs: to chain jobs. Store secrets securely.artifacts: passes build outputs between stages. when: manual for prod gates.input {} block for human approval gates. Post-build notifications on failure.setup.py or pyproject.toml defines the package. Pin ALL versions in requirements.txt for production.