MODULE 34 — OVERVIEW

CI/CD & Production Deployment

Shipping PySpark pipelines to production reliably requires automated testing, packaging, environment promotion, and infrastructure management. This module covers the complete deployment lifecycle — from a developer's commit to a running production Spark job — using GitHub Actions, GitLab CI, Jenkins, Terraform, and Databricks Asset Bundles.

🔁

CI/CD Fundamentals

What Continuous Integration and Continuous Delivery means for Spark pipelines and data engineering teams.

⚙️

GitHub Actions

YAML-based workflows for testing, linting, building packages, and deploying to Databricks or EMR on every push.

🦊

GitLab CI

.gitlab-ci.yml pipelines with stages, runners, and artifact storage for full Spark test-build-deploy pipelines.

🔧

Jenkins

Declarative Jenkinsfiles for organizations with existing Jenkins infrastructure running Spark build and deploy stages.

📦

Packaging

Wheel files, ZIP packaging, dependency management with requirements.txt, virtual environments, and pinning.

🌍

Environment Promotion

Promoting code through Dev → QA → UAT → Prod with automated gates, approvals, and rollback strategies.

CI/CD Pipeline for PySpark — Big Picture Developer Commit │ ▼ ┌─────────────────────────────────────────────────────┐ │ CI Stage (Continuous Integration) │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Lint │ │ Unit │ │ Build Wheel / │ │ │ │ flake8 │→ │ Tests │→ │ ZIP Package │ │ │ │ black │ │ pytest │ │ setuptools │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────┐ │ CD Stage (Continuous Delivery / Deployment) │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Deploy │ │ Deploy │ │ Deploy │ │ │ │ to Dev │→ │ to QA │→ │ to Prod │ │ │ │ auto │ │ auto │ │ manual gate │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ └─────────────────────────────────────────────────────┘ │ ▼ Running Spark Job on Databricks / EMR / Kubernetes

Why CI/CD matters for Data Engineering: Without automation, deploying a Spark job means manually copying files, running tests locally, and hoping nothing breaks in production. CI/CD eliminates this — every commit is automatically tested, packaged, and deployed in a consistent, repeatable way.

34.1

CI/CD Fundamentals for Data Engineering

Understanding what CI and CD mean specifically for Spark pipeline development and how they differ from traditional software CI/CD.

🔁

What CI/CD means for Spark Pipelines

CORE CONCEPT ▾

Concept

Continuous Integration (CI)

CI means every time a developer pushes code, an automated system runs all tests, checks code style, and verifies the code compiles/installs correctly. The goal: catch bugs immediately, not weeks later in production.

For Spark pipelines, CI typically means: lint the Python code → run unit tests with a local SparkSession → validate schemas → build the package artifact.

Real example: A data engineer pushes a change to a join logic function. CI automatically runs 20 unit tests in under 2 minutes, finds a regression, and prevents the broken code from reaching even the Dev environment.

Concept

Continuous Delivery (CD)

CD means after CI passes, the code is automatically deployed to one or more environments. For data engineering:

Continuous Delivery: Code is always ready to deploy; a human approves the final push to prod
Continuous Deployment: Fully automated — code goes all the way to production without human gates (less common in DE)

Concept

Pipeline as Code

The CI/CD pipeline itself is defined in code (YAML files, Jenkinsfiles) and lives in the same Git repository as your Spark code. This means the deployment process is version-controlled, reviewable, and reproducible. If the CI config breaks, you can roll it back just like application code.

YAML — GitHub Actions

# .github/workflows/ci.yml  ← lives in your repo
# Every push to main triggers this pipeline

name: PySpark CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run tests
        run: pytest tests/ -v

Concept

Artifact Versioning

Every build produces a versioned artifact (a .whl wheel file or .zip package) that is stored in an artifact registry (PyPI, AWS CodeArtifact, Databricks, S3). This means you can always deploy or rollback to any previous version.

Artifact versioning follows semantic versioning: MAJOR.MINOR.PATCH — e.g. 1.4.2

Shell — Build and tag artifact

# Build the wheel with a version tag from Git
VERSION=$(git describe --tags --abbrev=0)

# Build the wheel file
python setup.py bdist_wheel

# Output: dist/my_pipeline-1.4.2-py3-none-any.whl
# Upload to S3 artifact store
aws s3 cp dist/my_pipeline-${VERSION}-py3-none-any.whl \
    s3://my-artifacts/pyspark-pipelines/

34.2

GitHub Actions

GitHub Actions is the most widely used CI/CD tool for open-source and enterprise data engineering. Define workflows in YAML, trigger on push/PR/schedule, run tests, build packages, and deploy to Databricks or EMR.

⚙️

Workflow Files (.yml)

FOUNDATION ▾

Subtopic

Workflow File Structure

A GitHub Actions workflow is a YAML file placed in .github/workflows/. It defines: when to run (triggers), what environment to use (runners), and what to do (jobs and steps).

YAML — Full workflow anatomy

# .github/workflows/spark_pipeline_ci.yml

name: Spark Pipeline CI/CD       # Display name in GitHub UI

# ── TRIGGERS ──────────────────────────────────────
on:
  push:
    branches: [main, develop]      # Run on push to these branches
  pull_request:
    branches: [main]               # Run on PRs targeting main
  schedule:
    - cron: '0 6 * * 1'            # Every Monday at 6am UTC

# ── JOBS ──────────────────────────────────────────
jobs:
  ci:                              # Job ID
    runs-on: ubuntu-latest         # Runner environment
    env:
      JAVA_HOME: /usr/lib/jvm/java-11-openjdk-amd64

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Set up Java (Spark needs Java)
        uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '11'

      - name: Install Python dependencies
        run: |
          pip install --upgrade pip
          pip install -r requirements.txt

      - name: Lint with flake8
        run: flake8 src/ --max-line-length=120

      - name: Format check with black
        run: black --check src/

      - name: Run unit tests
        run: pytest tests/ -v --tb=short

      - name: Build wheel package
        run: python setup.py bdist_wheel

      - name: Upload wheel artifact
        uses: actions/upload-artifact@v4
        with:
          name: pipeline-wheel
          path: dist/*.whl

Subtopic

Triggers (on:)

Triggers control when a workflow runs. The most important ones for data engineering:

Trigger	When it fires	Use case
`push`	On every commit push	Run tests on every code change
`pull_request`	When a PR is opened/updated	Gate merges on test pass
`schedule`	On a cron schedule	Nightly regression test suite
`workflow_dispatch`	Manual trigger via GitHub UI	Manual deploy to prod
`release`	When a GitHub Release is published	Deploy on official release tag

Subtopic

Spark Unit Test Job

PySpark tests need a local Spark session. The key trick: set PYSPARK_PYTHON and Java correctly in the CI environment. Tests use SparkSession.builder.master("local[2]") so no cluster is needed.

Python — conftest.py (pytest fixtures)

# tests/conftest.py
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    """Create a local SparkSession for the entire test session."""
    spark = (
        SparkSession.builder
        .master("local[2]")       # 2 local threads
        .appName("CITests")
        .config("spark.sql.shuffle.partitions", "2")  # small for tests
        .getOrCreate()
    )
    yield spark
    spark.stop()

# tests/test_transforms.py
from src.transforms import clean_customer_data

def test_null_email_removed(spark):
    # Arrange
    data = [(1, "Alice", "alice@test.com"),
            (2, "Bob",   None)]           # null email — should be dropped
    df = spark.createDataFrame(data, ["id", "name", "email"])

    # Act
    result = clean_customer_data(df)

    # Assert
    assert result.count() == 1
    assert result.first()["name"] == "Alice"

Subtopic

Linting and Formatting (flake8, black)

flake8 checks for Python style errors and unused imports. black checks that code formatting is consistent. Both prevent messy code from entering the codebase.

Shell — Linting commands

# Install linting tools
pip install flake8 black

# Check for style errors (E: errors, W: warnings, F: pyflakes)
flake8 src/ --max-line-length=120 --ignore=E203,W503

# Check formatting (--check means "would black change this?" — exits non-zero if yes)
black --check src/

# Auto-format (run locally, not in CI — CI just checks)
black src/

# .flake8 config file in project root
cat .flake8
# [flake8]
# max-line-length = 120
# ignore = E203, W503
# exclude = .git, __pycache__, dist

Subtopic

Building and Publishing Packages

After tests pass, CI builds a distributable package (wheel file) and uploads it to an artifact store — S3, Databricks DBFS, or AWS CodeArtifact.

YAML — Build and publish job

build-and-publish:
  needs: [ci]                     # only runs if CI job passes
  runs-on: ubuntu-latest
  if: github.ref == 'refs/heads/main'  # only on main branch

  steps:
    - uses: actions/checkout@v4

    - name: Build wheel
      run: python setup.py bdist_wheel

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-1

    - name: Upload wheel to S3
      run: |
        VERSION=$(python setup.py --version)
        aws s3 cp dist/*.whl \
          s3://my-artifacts/pyspark-pipelines/v${VERSION}/

Subtopic

Databricks Deployment via GitHub Actions

The Databricks CLI or Databricks Asset Bundles can be used from GitHub Actions to deploy notebooks, jobs, and libraries directly to a Databricks workspace.

YAML — Deploy to Databricks

deploy-databricks:
  needs: [build-and-publish]
  runs-on: ubuntu-latest

  steps:
    - uses: actions/checkout@v4

    - name: Install Databricks CLI
      run: pip install databricks-cli

    - name: Configure Databricks CLI
      env:
        DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
        DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
      run: |
        # Configure the CLI with workspace URL and PAT token
        databricks configure --token <<EOF
        $DATABRICKS_HOST
        $DATABRICKS_TOKEN
        EOF

    - name: Upload wheel to DBFS
      run: databricks fs cp dist/*.whl dbfs:/libraries/ --overwrite

    - name: Deploy using Asset Bundles
      run: |
        databricks bundle deploy --target prod
        # Reads databricks.yml in the repo root
        # Deploys jobs, pipelines, notebooks defined there

Subtopic

Secrets in GitHub Actions

Never hardcode tokens or credentials in YAML files. Store them as GitHub Secrets (Settings → Secrets and variables → Actions) and reference them with ${{ secrets.SECRET_NAME }}.

Never do this: DATABRICKS_TOKEN: "dapi1234abc..." — tokens committed to Git are a critical security vulnerability and will be exposed in all forks and clones.

YAML — Correct secrets usage

env:
  # ✅ Correct — reference from secrets store
  DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
  SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}

# Set secrets: GitHub repo → Settings → Secrets → New repository secret
# They are masked in all logs automatically

34.3

GitLab CI

GitLab CI uses a .gitlab-ci.yml file at the repo root. It supports stages, shared runners (cloud) or self-hosted runners, artifact storage, and environment-specific deployments — popular in enterprise data engineering.

🦊

.gitlab-ci.yml — Complete Spark Pipeline

FULL EXAMPLE ▾

Subtopic

Stages and Jobs

GitLab CI organizes work into stages (which run sequentially) containing jobs (which run in parallel within a stage). Each job runs in an isolated Docker container.

YAML — .gitlab-ci.yml

# .gitlab-ci.yml — PySpark Pipeline CI/CD

# Define all stages (run top to bottom)
stages:
  - lint
  - test
  - build
  - deploy-dev
  - deploy-prod

# Global variables available to all jobs
variables:
  PYTHON_VERSION: "3.11"
  PYSPARK_VERSION: "3.5.0"

# ── STAGE: lint ──────────────────────────────────
lint-code:
  stage: lint
  image: python:3.11-slim
  script:
    - pip install flake8 black
    - flake8 src/ --max-line-length=120
    - black --check src/
  only:
    - merge_requests          # Run on every MR
    - main

# ── STAGE: test ──────────────────────────────────
unit-tests:
  stage: test
  image: python:3.11
  before_script:
    - apt-get update && apt-get install -y default-jdk
    - pip install pyspark==3.5.0 pytest pytest-cov
    - pip install -r requirements.txt
  script:
    - pytest tests/ -v --cov=src --cov-report=xml
  coverage: '/TOTAL.*\s+(\d+%)$/'  # Parse coverage from output
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml

# ── STAGE: build ──────────────────────────────────
build-wheel:
  stage: build
  image: python:3.11
  script:
    - pip install wheel setuptools
    - python setup.py bdist_wheel
  artifacts:
    paths:
      - dist/*.whl           # Store the wheel for downstream jobs
    expire_in: 30 days
  only:
    - main

# ── STAGE: deploy-dev ──────────────────────────────
deploy-to-dev:
  stage: deploy-dev
  image: amazon/aws-cli
  script:
    - aws s3 cp dist/*.whl s3://artifacts-dev/pipelines/
    - databricks bundle deploy --target dev
  environment:
    name: development
  only:
    - main

# ── STAGE: deploy-prod ─────────────────────────────
deploy-to-prod:
  stage: deploy-prod
  image: amazon/aws-cli
  script:
    - aws s3 cp dist/*.whl s3://artifacts-prod/pipelines/
    - databricks bundle deploy --target prod
  environment:
    name: production
  when: manual             # Requires a human to click "Deploy" in GitLab UI
  only:
    - main

Subtopic

Runners

Runners are the machines that execute jobs. GitLab.com shared runners are cloud-hosted (free minutes included). Self-hosted runners run on your own infrastructure — useful when you need access to internal networks, Databricks workspaces, or faster machines.

Shell — Register a self-hosted runner

# Install GitLab Runner on your server or EC2 instance
curl -L https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh | sudo bash
sudo apt-get install gitlab-runner

# Register it with your GitLab instance
sudo gitlab-runner register \
  --url https://gitlab.com \
  --registration-token YOUR_TOKEN \
  --executor docker \
  --docker-image python:3.11 \
  --description "spark-runner" \
  --tag-list spark,pyspark         # Jobs can target this runner by tag

# In .gitlab-ci.yml, use this runner with tags:
# tags:
#   - spark

Subtopic

Artifact Storage

GitLab CI can store build artifacts (wheels, test reports, coverage) between jobs using the artifacts: keyword. Jobs in later stages can download artifacts from earlier stages automatically.

YAML — Artifact passing between stages

# Build stage creates the artifact
build-wheel:
  stage: build
  script:
    - python setup.py bdist_wheel
  artifacts:
    paths:
      - dist/                # Persist this directory
    expire_in: 7 days       # Auto-delete after 7 days

# Deploy stage automatically has access to dist/ from build stage
deploy-to-dev:
  stage: deploy-dev
  dependencies:
    - build-wheel            # Download artifacts from this job
  script:
    - ls dist/               # Can access the .whl file built above
    - aws s3 cp dist/*.whl s3://artifacts-dev/

Subtopic

Deployment to Databricks and EMR

GitLab CI can deploy Spark jobs to any target. Use environment-specific variables stored as GitLab CI/CD Variables (masked and protected for production secrets).

YAML — Deploy to EMR via GitLab CI

deploy-emr:
  stage: deploy-prod
  image: amazon/aws-cli
  variables:
    AWS_DEFAULT_REGION: us-east-1
  script:
    - # Upload PySpark script and wheel to S3
    - aws s3 cp src/main_pipeline.py s3://my-emr-scripts/
    - aws s3 cp dist/*.whl s3://my-emr-scripts/
    - # Submit an EMR step
    - |
      aws emr add-steps \
        --cluster-id $CLUSTER_ID \
        --steps Type=Spark,Name="DailyPipeline",\
                Args=[--deploy-mode,cluster,\
                      --py-files,s3://my-emr-scripts/my_pipeline.whl,\
                      s3://my-emr-scripts/main_pipeline.py],\
                ActionOnFailure=CONTINUE
  environment:
    name: production
  when: manual

34.4

Jenkins

Jenkins is the most mature CI/CD tool, widely used in large enterprises. A Jenkinsfile (stored in your repo) defines declarative pipelines with stages, error handling, and cloud integrations.

🔧

Jenkinsfile — Declarative Pipeline

FULL EXAMPLE ▾

Subtopic

Declarative Pipeline Structure

A Declarative Jenkinsfile uses a pipeline { } block with sections for agent, environment, stages, and post-build actions. It's more readable than the older Scripted syntax.

Groovy — Jenkinsfile

// Jenkinsfile — PySpark Pipeline Build & Deploy

pipeline {
    agent {
        docker {
            image 'python:3.11'          // Run all stages in this Docker image
            args  '-v /var/run/docker.sock:/var/run/docker.sock'
        }
    }

    environment {
        DATABRICKS_HOST  = credentials('databricks-host')   // Jenkins credential store
        DATABRICKS_TOKEN = credentials('databricks-token')
        AWS_CREDENTIALS  = credentials('aws-de-credentials') // UsernamePassword binding
    }

    stages {
        stage('Checkout') {
            steps {
                checkout scm                // Pull code from Git
            }
        }

        stage('Install Dependencies') {
            steps {
                sh '''
                    pip install --upgrade pip
                    pip install -r requirements.txt
                    apt-get install -y default-jdk -q
                '''
            }
        }

        stage('Lint') {
            steps {
                sh 'flake8 src/ --max-line-length=120'
                sh 'black --check src/'
            }
        }

        stage('Unit Tests') {
            steps {
                sh 'pytest tests/ -v --junitxml=test-results.xml'
            }
            post {
                always {
                    junit 'test-results.xml'    // Publish test results in Jenkins UI
                }
            }
        }

        stage('Build Wheel') {
            steps {
                sh 'python setup.py bdist_wheel'
                archiveArtifacts artifacts: 'dist/*.whl'  // Store in Jenkins
            }
        }

        stage('Deploy to Dev') {
            when {
                branch 'main'              // Only on main branch
            }
            steps {
                sh '''
                    databricks fs cp dist/*.whl dbfs:/libraries/ --overwrite
                    databricks jobs run-now --job-id $DEV_JOB_ID
                '''
            }
        }

        stage('Deploy to Prod') {
            when {
                branch 'main'
            }
            input {
                message "Deploy to PRODUCTION?"  // Human approval gate
                ok "Yes, deploy"
            }
            steps {
                sh 'databricks bundle deploy --target prod'
            }
        }
    }

    post {
        failure {
            emailext subject: "Pipeline FAILED: ${env.JOB_NAME} #${env.BUILD_NUMBER}",
                     body:    "Check Jenkins: ${env.BUILD_URL}",
                     to:      'data-team@company.com'
        }
        success {
            echo 'Pipeline completed successfully!'
        }
    }
}

Subtopic

Integration with Cloud Providers

Jenkins integrates with AWS, Azure, and GCP through plugins. For data engineering the key integrations are AWS credentials binding, S3 artifact upload, and triggering EMR/Glue jobs.

Groovy — AWS integration in Jenkins

// Use withAWS() from the Pipeline: AWS Steps plugin
stage('Upload to S3') {
    steps {
        withAWS(credentials: 'aws-de-credentials', region: 'us-east-1') {
            s3Upload(
                file:   'dist/my_pipeline-1.0.0-py3-none-any.whl',
                bucket: 'my-artifacts',
                path:   'pyspark-pipelines/'
            )
        }
    }
}

// Trigger an EMR step
stage('Run EMR Job') {
    steps {
        withAWS(credentials: 'aws-de-credentials', region: 'us-east-1') {
            sh """
                aws emr add-steps \
                  --cluster-id j-XXXXXXXXXXX \
                  --steps Type=Spark,Name=DailyETL,\
                          ActionOnFailure=CONTINUE,\
                          Args=[--deploy-mode,cluster,\
                                s3://my-scripts/main.py]
            """
        }
    }
}

Subtopic

Spark Build and Test in Jenkins

Same as GitHub Actions — tests use a local SparkSession. The difference: Jenkins typically runs on a persistent server or agent with Java already installed, making PySpark tests faster to set up.

Best practice: Use a custom Docker image with Java + Python + PySpark pre-installed as your Jenkins agent to avoid re-downloading 500MB of dependencies on every build.

34.5

Packaging

Packaging your PySpark code makes it deployable, versionable, and installable on Spark clusters. The two main formats are Wheel (.whl) files for Python packages and ZIP files for script bundles. Good dependency management prevents "works on my machine" problems.

📦

Wheel Files

MOST IMPORTANT ▾

Subtopic

Building with setuptools

A wheel file (.whl) is a built distribution of your Python package. It can be installed with pip install and distributed to Spark clusters. You need a setup.py or pyproject.toml to define package metadata.

Text — Project structure

my_pipeline/                  # Project root
├── setup.py                  # or pyproject.toml
├── requirements.txt
├── src/
│   └── my_pipeline/
│       ├── __init__.py
│       ├── transforms.py     # Your PySpark code
│       ├── utils.py
│       └── config.py
└── tests/
    ├── conftest.py
    └── test_transforms.py

Python — setup.py

# setup.py
from setuptools import setup, find_packages

setup(
    name="my_pipeline",
    version="1.4.2",
    description="Customer data pipeline",
    author="Data Engineering Team",
    python_requires=">=3.9",
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    install_requires=[
        "pyspark==3.5.0",
        "delta-spark==3.1.0",
        "boto3>=1.34.0",
    ],
    extras_require={
        "dev": ["pytest", "black", "flake8"]
    }
)

TOML — pyproject.toml (modern way)

# pyproject.toml — modern Python packaging
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.backends.legacy:build"

[project]
name = "my_pipeline"
version = "1.4.2"
description = "Customer data pipeline"
requires-python = ">=3.9"
dependencies = [
    "pyspark==3.5.0",
    "delta-spark==3.1.0",
    "boto3>=1.34.0",
]

[project.optional-dependencies]
dev = ["pytest", "black", "flake8"]

[tool.setuptools.packages.find]
where = ["src"]

Shell — Build the wheel

# Build using setuptools (setup.py way)
python setup.py bdist_wheel
# Output: dist/my_pipeline-1.4.2-py3-none-any.whl

# Build using build module (pyproject.toml way — modern)
pip install build
python -m build
# Output: dist/my_pipeline-1.4.2-py3-none-any.whl

# Install the wheel to verify it works
pip install dist/my_pipeline-1.4.2-py3-none-any.whl
python -c "from my_pipeline.transforms import clean_customer_data; print('OK')"

Subtopic

Wheel upload to artifact registry

Built wheels are stored in an artifact registry so any environment can install the exact same version. Common registries: S3 (simple), AWS CodeArtifact (managed), or Databricks DBFS.

Shell — Upload and install wheel on cluster

# Upload wheel to S3 artifact store
aws s3 cp dist/my_pipeline-1.4.2-py3-none-any.whl \
    s3://my-artifacts/pyspark-pipelines/

# In spark-submit — distribute wheel to all executors
spark-submit \
    --py-files s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl \
    main.py

# In Databricks — install as a cluster library
databricks libraries install \
    --cluster-id 0101-XXXXX \
    --whl s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl

🗜️

ZIP Packaging

ALTERNATIVE ▾

Subtopic

Zipping Python Dependencies

A ZIP approach bundles all Python source files (and optionally installed dependencies) into a .zip file. Spark can distribute it to all nodes via --py-files. Simpler than wheels but less clean for complex packages.

Shell — Create ZIP package

# Zip only your source code
cd src/
zip -r ../my_pipeline.zip my_pipeline/
cd ..

# Or zip with installed dependencies (venv approach)
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Zip the installed packages
cd venv/lib/python3.11/site-packages/
zip -r ../../../../dependencies.zip .
cd ../../../../

# Use in spark-submit
spark-submit \
    --py-files my_pipeline.zip,dependencies.zip \
    main.py

Subtopic

--py-files in spark-submit

--py-files is the spark-submit argument that distributes .py files, .zip files, or .egg files to all executor nodes. Without it, your custom modules are not available on executors.

Shell — spark-submit with py-files

# Single zip file
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --py-files my_pipeline.zip \
    main.py

# Multiple files/packages
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --py-files my_pipeline.zip,utils.py,config.py \
    main.py

# With wheel file
spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --py-files my_pipeline-1.4.2-py3-none-any.whl \
    --conf spark.executor.extraPythonPath=my_pipeline-1.4.2-py3-none-any.whl \
    main.py

📋

Dependency Management

IMPORTANT ▾

Subtopic

requirements.txt and Pinning Versions

Always pin exact versions for production. Unpinned deps (boto3 instead of boto3==1.34.0) mean a new version could break your pipeline silently.

Text — requirements.txt

# requirements.txt — pin EVERYTHING for production
pyspark==3.5.0
delta-spark==3.1.0
boto3==1.34.58
botocore==1.34.58
s3fs==2024.2.0
pyarrow==15.0.0
pandas==2.2.0
great-expectations==0.18.12

# requirements-dev.txt — for local development only
pytest==8.1.0
black==24.2.0
flake8==7.0.0
pytest-cov==5.0.0

# Generate pinned requirements from current environment
pip freeze > requirements.txt

Subtopic

Virtual Environments

Always develop inside a virtual environment so your project's dependencies don't conflict with system Python packages.

Shell — Virtual environment setup

# Create a virtual environment
python -m venv .venv

# Activate it
source .venv/bin/activate          # Linux/Mac
.venv\Scripts\activate             # Windows

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt

# Deactivate when done
deactivate

# .gitignore should exclude .venv/
echo ".venv/" >> .gitignore

Subtopic

Dependency Conflict Resolution

Conflicts happen when two packages require different versions of the same library. Key tool: pip-compile from pip-tools generates a fully resolved, pinned requirements.txt from a high-level requirements.in file.

Shell — pip-tools for conflict-free deps

# Install pip-tools
pip install pip-tools

# requirements.in — high-level dependencies only
cat requirements.in
# pyspark==3.5.0
# boto3
# delta-spark

# pip-compile resolves all transitive deps and pins them
pip-compile requirements.in
# Generates requirements.txt with fully pinned versions
# e.g. boto3==1.34.58, botocore==1.34.58, etc.

# Install the pinned requirements
pip-sync requirements.txt          # Also removes unlisted packages

34.6

Environment Promotion

Code doesn't go straight from a developer's laptop to production. It progresses through Dev → QA → UAT → Prod, with automated gates and human approvals ensuring quality at each stage.

DEV

Developer notebooks, interactive clusters, small data samples. Auto-deploy on commit.

QA

Automated tests, integration test clusters, full DQ checks. Auto-deploy after CI pass.

UAT

User acceptance testing, production-like data, performance validation. Requires approval.

PROD

Job clusters, monitoring, alerting. Manual gate + rollback strategy required.

🌍

Each Environment in Detail

ALL 4 ENVS ▾

Environment 1

Dev — Developer Notebooks and Interactive Clusters

Purpose: Fast iteration. Developers write and test code against small data samples using interactive (not job) clusters. No strict process — dev is meant to be experimental.

Key characteristics:

Interactive Databricks notebooks or local Spark sessions
Small sample data (10K rows instead of 10B)
Shared or personal dev clusters (auto-terminate in 30 min)
Auto-deploy on every push to the develop branch

YAML — Dev deploy (GitHub Actions)

deploy-dev:
  needs: [unit-tests]
  runs-on: ubuntu-latest
  if: github.ref == 'refs/heads/develop'
  environment: development          # GitHub environment
  env:
    DATABRICKS_HOST: ${{ secrets.DEV_DATABRICKS_HOST }}
    DATABRICKS_TOKEN: ${{ secrets.DEV_DATABRICKS_TOKEN }}
  steps:
    - uses: actions/checkout@v4
    - name: Deploy to Dev workspace
      run: databricks bundle deploy --target dev

Environment 2

QA — Automated Tests and Integration Test Clusters

Purpose: Validate that the pipeline works end-to-end with realistic data. Automated integration tests run against a real (but non-production) dataset.

Key characteristics:

Job clusters (not interactive) — mimics production
Integration tests: reads real S3 data, writes to QA Delta tables
Data Quality checks with Great Expectations or Deequ
Auto-deploy on merge to main after CI passes

Python — Integration test example

# tests/integration/test_pipeline_qa.py
# Run in QA environment with real data connections

import pytest
from pyspark.sql import SparkSession
from my_pipeline.transforms import run_customer_pipeline

@pytest.fixture(scope="session")
def spark():
    return SparkSession.builder.appName("QA-Integration").getOrCreate()

def test_pipeline_row_count(spark):
    """Validate output row count within expected range."""
    result = run_customer_pipeline(
        spark,
        input_path="s3://qa-data/customers/2024-01-01/",
        output_path="s3://qa-output/customers/"
    )
    row_count = result.count()
    assert 900_000 < row_count < 1_100_000, \
        f"Unexpected row count: {row_count}"

def test_no_null_customer_ids(spark):
    """Validate no nulls in primary key."""
    result = run_customer_pipeline(
        spark,
        input_path="s3://qa-data/customers/2024-01-01/"
    )
    null_count = result.filter(result.customer_id.isNull()).count()
    assert null_count == 0, "Found null customer_ids!"

Environment 3

UAT — User Acceptance Testing

Purpose: Business users validate that the data looks correct with production-like volumes. A human must approve before promotion to prod.

Key characteristics:

Production-like data (full dataset or 1-month sample)
Business analysts run their dashboards against UAT data
Performance testing — does the pipeline finish within the SLA window?
Requires manual approval gate in CI/CD system

Environment 4

Prod — Production with Monitoring and Rollback

Purpose: The live system. Pipelines run on job clusters (no interactive development), with full monitoring, alerting, and a defined rollback strategy.

Key characteristics:

Job clusters (auto-terminate after each run)
CloudWatch/Datadog monitoring + SNS alerts on failure
Manual deploy gate — a senior engineer approves every prod push
Rollback: redeploy previous wheel version from artifact store

Shell — Rollback to previous version

# To rollback: redeploy the previous wheel version

# 1. Find the previous version in S3
aws s3 ls s3://my-artifacts/pyspark-pipelines/
#   my_pipeline-1.4.1-py3-none-any.whl   ← previous good version
#   my_pipeline-1.4.2-py3-none-any.whl   ← broken version (rollback from this)

# 2. Install the previous version on the cluster
databricks libraries install \
    --cluster-id 0101-PROD-XXXXX \
    --whl s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.1-py3-none-any.whl

# 3. Or redeploy the git tag of the previous release
git checkout v1.4.1
databricks bundle deploy --target prod

34.7

Infrastructure as Code

Instead of clicking through cloud consoles to create clusters, databases, and storage, you define infrastructure in code. This makes it reproducible, version-controlled, and reviewable. Terraform, CloudFormation, and Databricks Asset Bundles are the key tools.

🏗️

Terraform for Databricks

MOST USED ▾

Subtopic

What is Terraform and Why Use It

Terraform is an open-source IaC tool by HashiCorp. You write .tf files declaring what infrastructure you want, and Terraform figures out how to create/update/destroy it. The same config deploys the same infra every time — no drift, no "I forgot to click that checkbox".

HCL — Terraform Databricks cluster

# main.tf — Provision a Databricks job cluster

terraform {
  required_providers {
    databricks = {
      source  = "databricks/databricks"
      version = "~> 1.40"
    }
  }
  # Remote state in S3 (never use local state in team settings)
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "pyspark-pipelines/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"    # Prevents concurrent applies
  }
}

provider "databricks" {
  host  = var.databricks_host
  token = var.databricks_token
}

# Create a Databricks Job with a job cluster
resource "databricks_job" "customer_pipeline" {
  name = "customer-data-pipeline"

  new_cluster {
    num_workers   = 4
    spark_version = "14.3.x-scala2.12"
    node_type_id  = "i3.xlarge"

    spark_conf = {
      "spark.sql.shuffle.partitions" = "200"
      "spark.databricks.delta.optimizeWrite.enabled" = "true"
    }

    aws_attributes {
      instance_profile_arn = var.instance_profile_arn
      availability          = "SPOT_WITH_FALLBACK"
    }
  }

  spark_python_task {
    python_file = "s3://my-scripts/main_pipeline.py"
    parameters  = ["--env", "prod"]
  }

  library {
    whl = "s3://my-artifacts/pyspark-pipelines/my_pipeline-1.4.2-py3-none-any.whl"
  }

  schedule {
    quartz_cron_expression = "0 0 6 * * ?"   # Every day at 6am UTC
    timezone_id            = "UTC"
  }

  email_notifications {
    on_failure = ["data-team@company.com"]
  }
}

Subtopic

Terraform for EMR

Provision EMR clusters, S3 buckets, IAM roles, and all supporting infrastructure in Terraform so the entire data platform is reproducible.

HCL — Terraform EMR cluster

# EMR cluster for PySpark jobs
resource "aws_emr_cluster" "spark_cluster" {
  name          = "data-pipeline-cluster"
  release_label = "emr-6.15.0"
  applications  = ["Spark", "Hadoop"]

  ec2_attributes {
    subnet_id                         = var.private_subnet_id
    emr_managed_master_security_group = aws_security_group.emr_master.id
    emr_managed_slave_security_group  = aws_security_group.emr_slave.id
    instance_profile                  = aws_iam_instance_profile.emr.arn
  }

  master_instance_group {
    instance_type = "m5.xlarge"
  }

  core_instance_group {
    instance_type  = "m5.2xlarge"
    instance_count = 4

    ebs_config {
      size                 = 100
      type                 = "gp2"
      volumes_per_instance = 1
    }
  }

  configurations_json = jsonencode([{
    Classification = "spark-defaults"
    Properties = {
      "spark.sql.shuffle.partitions" = "200"
      "spark.executor.memory"        = "8g"
    }
  }])

  service_role = aws_iam_role.emr_service.arn
  auto_termination_policy { idle_timeout = 3600 }

  tags = { Environment = var.environment }
}

# S3 bucket for pipeline artifacts
resource "aws_s3_bucket" "artifacts" {
  bucket = "my-company-pipeline-artifacts-${var.environment}"
}
resource "aws_s3_bucket_versioning" "artifacts" {
  bucket = aws_s3_bucket.artifacts.id
  versioning_configuration { status = "Enabled" }
}

Subtopic

Terraform Workflow

The standard Terraform workflow is: init → plan → apply. Always review the plan before applying in production.

terraform init

→

terraform plan

→

Review diff

→

terraform apply

→

Resources created

Shell — Terraform commands

# Initialize — download providers, configure backend
terraform init

# Plan — show what will be created/changed/destroyed
terraform plan -out=tfplan

# Apply — create/update infrastructure
terraform apply tfplan

# Destroy — tear down all resources (careful!)
terraform destroy

# Workspaces for dev/staging/prod
terraform workspace new dev
terraform workspace new prod
terraform workspace select prod
terraform apply -var="environment=prod"

🧱

Databricks Asset Bundles

DATABRICKS NATIVE ▾

Subtopic

What are Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) let you define Databricks jobs, pipelines, clusters, and notebooks in a databricks.yml file. Like Terraform but Databricks-specific and integrated with the Databricks CLI.

YAML — databricks.yml

# databricks.yml — Databricks Asset Bundle definition

bundle:
  name: customer-pipeline

variables:
  env:
    default: dev

targets:
  dev:
    workspace:
      host: https://my-company-dev.azuredatabricks.net
    variables:
      env: dev

  prod:
    workspace:
      host: https://my-company-prod.azuredatabricks.net
    variables:
      env: prod
    mode: production                    # Locks down prod settings

resources:
  jobs:
    customer_pipeline:
      name: customer-data-pipeline
      tasks:
        - task_key: ingest
          spark_python_task:
            python_file: src/ingest.py
            parameters: [--env, "${var.env}"]
          new_cluster:
            num_workers: 4
            spark_version: "14.3.x-scala2.12"
            node_type_id: i3.xlarge
          libraries:
            - whl: ./dist/my_pipeline-1.4.2-py3-none-any.whl

      schedule:
        quartz_cron_expression: "0 0 6 * * ?"
        timezone_id: UTC

Shell — Databricks Asset Bundle commands

# Validate the bundle definition
databricks bundle validate

# Deploy to dev workspace
databricks bundle deploy --target dev

# Deploy to prod workspace
databricks bundle deploy --target prod

# Run the job immediately after deploy
databricks bundle run --target dev customer_pipeline

# Destroy (delete all bundle resources)
databricks bundle destroy --target dev

☁️

CloudFormation

AWS NATIVE ▾

Subtopic

CloudFormation for Data Pipelines

CloudFormation is AWS's native IaC service. Defined in JSON or YAML, it creates AWS resources in a "stack". Prefer Terraform for cross-cloud portability, but CloudFormation integrates tightly with AWS services like Glue, EMR, and CodePipeline.

YAML — CloudFormation Glue Job

# cloudformation/glue-job.yml
AWSTemplateFormatVersion: '2010-09-09'
Description: Glue ETL Job for customer data pipeline

Parameters:
  Environment:
    Type: String
    AllowedValues: [dev, qa, prod]

Resources:
  CustomerPipelineGlueJob:
    Type: AWS::Glue::Job
    Properties:
      Name: !Sub 'customer-pipeline-${Environment}'
      Role: !GetAtt GlueExecutionRole.Arn
      Command:
        Name: glueetl
        ScriptLocation: !Sub 's3://my-scripts/${Environment}/customer_pipeline.py'
        PythonVersion: '3'
      DefaultArguments:
        '--environment': !Ref Environment
        '--extra-py-files': !Sub 's3://my-artifacts/${Environment}/my_pipeline.whl'
      GlueVersion: '4.0'
      NumberOfWorkers: 10
      WorkerType: G.1X

# Deploy with:
# aws cloudformation deploy \
#   --template-file cloudformation/glue-job.yml \
#   --stack-name customer-pipeline-prod \
#   --parameter-overrides Environment=prod

34.8

Release Management

Release management defines how code changes are versioned, documented, communicated, and (if needed) rolled back. Good release hygiene prevents chaos when something goes wrong in production at 2am.

🏷️

Semantic Versioning for Pipelines

STANDARD ▾

Subtopic

Semantic Versioning (SemVer)

Version numbers follow the pattern MAJOR.MINOR.PATCH. Increment:

MAJOR — breaking change (schema change, incompatible API, complete pipeline rewrite)
MINOR — new feature, backward-compatible (new column added, new source added)
PATCH — bug fix (null handling fix, performance fix, config correction)

Version	Change type	Example
`2.0.0`	Breaking change	Changed output schema — downstream dashboards break
`1.5.0`	New feature	Added new `country_code` column to output
`1.4.3`	Bug fix	Fixed null pointer error when email is missing

Shell — Tag a release in Git

# Create an annotated Git tag for a release
git tag -a v1.5.0 -m "feat: add country_code to customer output"

# Push the tag to GitHub (triggers release workflow)
git push origin v1.5.0

# List all release tags
git tag --list 'v*'

# Get the current version from git tag in CI
VERSION=$(git describe --tags --abbrev=0 | sed 's/v//')
echo "Building version: $VERSION"   # e.g. 1.5.0

Subtopic

Changelog Tracking

A CHANGELOG.md file tracks every release with what changed. It's the single source of truth for "what changed in this version?" — critical when you need to explain a prod incident.

Markdown — CHANGELOG.md

# Changelog

## [1.5.0] - 2024-03-15
### Added
- New `country_code` column derived from phone number normalization
- Support for reading from MSK (Managed Kafka) as source

### Changed
- Improved join performance using broadcast hint for country dim table

## [1.4.3] - 2024-03-10
### Fixed
- Fixed NullPointerException when customer email field is missing
- Fixed incorrect SLA timestamp in audit table

## [1.4.2] - 2024-03-01
### Changed
- Upgraded PySpark from 3.4.0 to 3.5.0
- Reduced shuffle partitions from 400 to 200 (20% faster)

## [1.4.0] - 2024-02-15
### Added
- SCD Type 2 support for customer dimension
- Data Quality checks using Great Expectations

## [2.0.0] - Breaking
### BREAKING CHANGES
- Removed deprecated `cust_email` column (use `email` instead)
- Changed partition scheme from `created_date` to `event_date`

Subtopic

Release Notes

Release notes are the human-readable summary of a CHANGELOG section, published when you create a GitHub Release. They help stakeholders (not just engineers) understand what changed.

YAML — Auto-generate GitHub Release from tag

# .github/workflows/release.yml
name: Create Release

on:
  push:
    tags:
      - 'v*'                  # Trigger on any version tag

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Build wheel
        run: python setup.py bdist_wheel

      - name: Create GitHub Release
        uses: softprops/action-gh-release@v2
        with:
          files: dist/*.whl
          generate_release_notes: true   # Auto-generate from PR titles
          token: ${{ secrets.GITHUB_TOKEN }}

Subtopic

Hotfix Process

A hotfix is an emergency fix pushed to production without going through the full Dev → QA → UAT cycle. Used only for critical production bugs that cause data loss or pipeline failure.

Hotfix Flow: Production Bug Detected │ ▼ Create hotfix branch from the PROD tag git checkout -b hotfix/1.4.4 v1.4.3 │ ▼ Make minimal fix — change ONLY what's broken │ ▼ Run unit tests + targeted integration test pytest tests/ -k "test_null_handling" │ ▼ Bump patch version → 1.4.4 tag: git tag -a v1.4.4 -m "fix: critical null handling" │ ▼ Fast-track deploy to Prod (skip UAT, inform stakeholders) databricks bundle deploy --target prod │ ▼ Merge hotfix branch back to main AND develop (so the fix is not lost in the next normal release)

MODULE 34 — REVIEW

Quiz & Summary

Test your understanding of CI/CD and Production Deployment concepts.

Q1. In GitHub Actions, what keyword is used to run a job only when a previous job succeeds?

Q2. You push a new wheel file to production and the pipeline immediately fails. What is the FASTEST recovery action?

Q3. What is the correct semantic version increment for a bug fix that does NOT change any APIs or schemas?

Q4. In a PySpark project, what does the --py-files argument in spark-submit do?

Q5. Which Databricks tool defines jobs, clusters, and pipelines in a databricks.yml file and deploys them via the CLI?

MODULE 34 — WHAT YOU LEARNED

Key Takeaways

🔁

CI/CD Fundamentals

CI catches bugs on every commit. CD automates deployment. Pipeline-as-code means your deployment is version-controlled.

⚙️

GitHub Actions

YAML workflows with triggers, jobs, and steps. Run Spark tests with local SparkSession. Use needs: to chain jobs. Store secrets securely.

🦊

GitLab CI

Stages run sequentially; jobs within a stage run in parallel. artifacts: passes build outputs between stages. when: manual for prod gates.

🔧

Jenkins

Declarative Jenkinsfile with pipeline { stages { } }. input {} block for human approval gates. Post-build notifications on failure.

📦

Packaging

Wheel files are the standard. setup.py or pyproject.toml defines the package. Pin ALL versions in requirements.txt for production.

🌍

Environments

Dev → QA → UAT → Prod. Each stage has different automation levels. Prod requires manual gate + rollback strategy. Always version your artifacts.

🏗️

IaC

Terraform for cross-cloud infra. Databricks Asset Bundles for Databricks-native deployment. CloudFormation for AWS-only. Always use remote state in Terraform.

🏷️

Release Management

SemVer: MAJOR.MINOR.PATCH. Maintain CHANGELOG.md. Tag Git releases. For hotfixes: branch from prod tag, minimal fix, fast-track deploy, merge back.

Module 34 Complete! You now have everything you need to build a production-grade CI/CD pipeline for PySpark. Next up: Module 35 — Enterprise Data Engineering Patterns (Medallion Architecture, Data Vault, Dimensional Modeling, Data Mesh, Event-Driven Architecture, CDC Frameworks, Data Contracts).