MODULE 28 — OVERVIEW

Spark on Kubernetes

Kubernetes has become the dominant way to run Spark in modern cloud-native data platforms. This module covers everything from why you'd choose K8s over YARN, to the internals of how Driver and Executor pods work, to production topics like dynamic allocation, pod templates, autoscaling, storage, security, and monitoring.

☸️

Kubernetes Native

No YARN dependency. Spark runs directly as K8s pods on any cluster — EKS, GKE, AKS, or on-prem.

⚡

Elastic Scaling

Spin up executors as pods, release them when done. Pay only for what you use.

🔒

Isolation

Each Spark job runs in its own pods with its own namespaces, RBAC, and network policies.

🌐

Multi-Tenant

Multiple teams share the same K8s cluster with resource quotas per namespace.

MODULE 28 — WHAT YOU WILL LEARN 28.1 Why Kubernetes for Spark → K8s vs YARN, cloud-native benefits 28.2 Architecture → Driver Pod, Executor Pods, how jobs run 28.3 Dynamic Allocation → Auto-scale executors, shuffleTracking 28.4 Spark Operator → SparkApplication CRD, submitting jobs declaratively 28.5 Pod Templates → Customising pods: init containers, sidecars, volumes 28.6 Resource Management → CPU/memory limits, node affinity, taints & tolerations 28.7 Autoscaling → KEDA, HPA, Cluster Autoscaler 28.8 Storage on K8s → PVCs, EmptyDir, S3, NFS 28.9 Security on K8s → RBAC, service accounts, secrets, network policies 28.10 Monitoring → Prometheus, Grafana, Spark UI, EFK logs

28.1

Why Kubernetes for Spark

Understand the motivation for running Spark on Kubernetes — and when to choose it over YARN or a managed service like Databricks or EMR.

⚖️

Kubernetes vs YARN CORE CONCEPT ▼

YARN — The Old Way

YARN (Yet Another Resource Negotiator) was Hadoop's resource manager. When Spark first emerged, YARN was the only cluster manager available in most enterprises. You needed a Hadoop cluster with YARN, HDFS, and a lot of operational overhead just to run Spark.

🏢 Analogy

YARN is like renting a fixed office building. You pay for the whole floor whether or not all desks are occupied. You need a dedicated Hadoop team to manage it. Kubernetes is like a WeWork — you get desk space exactly when you need it, pay per hour, and it works the same in any city (any cloud).

Dimension	YARN	Kubernetes
Dependencies	Requires Hadoop cluster	Cloud-native, no Hadoop needed
Resource model	NodeManager containers	Pods with CPU/memory limits
Multi-tenancy	YARN queues	Namespaces + RBAC + quotas
Cloud portability	Tightly coupled to HDFS	Works on EKS, GKE, AKS, on-prem
Container ecosystem	Limited Docker support	First-class Docker + OCI images
Scaling	Node-based, slower	Pod-level, fast autoscaling
Operational overhead	High (Hadoop ops team)	Medium (K8s ops team)

Resource Isolation

In YARN, multiple Spark jobs on the same cluster compete for containers and can interfere with each other. In Kubernetes, each Spark job runs in its own pods. You can enforce hard CPU and memory limits per pod so one job can never starve another.

📌 Key Benefit

K8s resource limits are enforced at the kernel level (cgroups). If an executor pod tries to use more memory than its limit, Kubernetes kills it with an OOMKilled status — preventing it from impacting other workloads on the node.

Multi-Tenancy

Multiple data engineering teams can share one Kubernetes cluster safely. Each team gets its own namespace with ResourceQuota limits. Team A can't accidentally consume all the CPUs and starve Team B.

yaml — Namespace + ResourceQuota per team

# Namespace for Team A's Spark jobs
apiVersion: v1
kind: Namespace
metadata:
  name: spark-team-a

---
# Enforce resource limits for Team A
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: spark-team-a
spec:
  hard:
    requests.cpu: "40"       # Team A can request max 40 CPU cores
    requests.memory: 160Gi   # Team A can request max 160 GB RAM
    limits.cpu: "80"         # Hard ceiling
    limits.memory: 320Gi

Cloud-Native Benefits

Kubernetes is available as a managed service on every major cloud — EKS (AWS), GKE (Google Cloud), AKS (Azure). You get the same Spark-on-K8s experience everywhere. Your Spark job definition (a YAML file or spark-submit command) works identically across all three clouds.

🐳

Docker images

Ship your Spark version, Python dependencies, and custom JARs as a container image. No cluster bootstrap scripts needed.

♻️

Reproducibility

The exact same Docker image runs in dev, staging, and prod — no "works on my cluster" problems.

💰

Cost efficiency

Pods start and stop in seconds. No idle YARN NodeManagers burning money between jobs.

28.2

Spark on Kubernetes Architecture

How Spark actually runs on Kubernetes — what pods are created, how they communicate, and what their lifecycle looks like.

🎯

Driver Pod DRIVER ▼

Driver Pod Lifecycle

When you run spark-submit --master k8s://..., Spark creates a Driver Pod in Kubernetes. This pod runs the Spark Driver — the brain of your application. The driver is responsible for creating the DAG, requesting executor pods, and coordinating the entire job.

spark-submit

→

K8s API creates Driver Pod

→

Driver starts, requests Executor Pods

→

Job runs

→

Driver Pod exits (Completed)

DRIVER POD INTERNALS ┌──────────────────────────────────────────────┐ │ Driver Pod (spark-myapp-driver) │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ JVM Process (Spark Driver) │ │ │ │ - SparkContext / SparkSession │ │ │ │ - DAG Scheduler │ │ │ │ - Task Scheduler │ │ │ │ - Block Manager Master │ │ │ └─────────────────────────────────────────┘ │ │ │ │ Port 4040 → Spark UI (via port-forward) │ │ Port 7078 → RPC with executors │ └──────────────────────────────────────────────┘

Driver Pod Configuration

You control the driver pod's resources with spark-submit flags. Setting appropriate CPU and memory for the driver is important — too little and the driver OOMs; too much wastes resources.

bash — spark-submit to Kubernetes

spark-submit \
  --master k8s://https://<k8s-api-server>:6443 \
  --deploy-mode cluster \
  --name my-pyspark-job \
  --conf spark.kubernetes.namespace=spark-jobs \
  --conf spark.kubernetes.container.image=myrepo/spark:3.5.0-python \
  \
  # Driver pod resource configuration
  --conf spark.driver.cores=2 \
  --conf spark.driver.memory=4g \
  --conf spark.kubernetes.driver.podTemplateFile=driver-template.yaml \
  \
  # Service account for the driver to call K8s API
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
  \
  # Your PySpark script
  local:///opt/spark/work-dir/my_job.py

📌 deploy-mode cluster

In cluster mode, the driver runs inside Kubernetes as a pod. In client mode, the driver runs on your local machine (where you ran spark-submit). For production, always use cluster mode — the driver pod is managed by Kubernetes and survives network disconnects.

Service Account for the Driver

The Driver Pod needs permission to call the Kubernetes API to create and delete Executor Pods. You give it this permission via a Kubernetes Service Account bound to a Role that allows pod operations.

yaml — ServiceAccount + Role + RoleBinding

# Service account for Spark driver
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-sa
  namespace: spark-jobs

---
# Role allowing driver to manage pods
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-role
  namespace: spark-jobs
rules:
- apiGroups: [""]
  resources: [pods, services, configmaps, persistentvolumeclaims]
  verbs: [create, get, list, watch, delete, patch]

---
# Bind the role to the service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-rolebinding
  namespace: spark-jobs
subjects:
- kind: ServiceAccount
  name: spark-sa
roleRef:
  kind: Role
  name: spark-role
  apiGroup: rbac.authorization.k8s.io

⚙️

Executor Pods EXECUTORS ▼

Executor Pod Lifecycle

The Driver Pod requests Executor Pods from the Kubernetes API. Kubernetes schedules them onto available nodes. Each Executor Pod runs a JVM with Spark executor threads. When the job completes (or an executor is no longer needed), Kubernetes deletes the pod.

EXECUTOR POD NAMING CONVENTION spark-<app-name>-exec-1 ← First executor pod spark-<app-name>-exec-2 ← Second executor pod spark-<app-name>-exec-N ← Nth executor pod Each pod runs on a K8s Worker Node: ┌──────────────────────────────────────────────────────┐ │ K8s Worker Node │ │ │ │ ┌────────────────────┐ ┌────────────────────┐ │ │ │ spark-app-exec-1 │ │ spark-app-exec-2 │ │ │ │ Executor JVM │ │ Executor JVM │ │ │ │ - Task threads │ │ - Task threads │ │ │ │ - Block Manager │ │ - Block Manager │ │ │ │ CPU: 2 cores │ │ CPU: 2 cores │ │ │ │ Memory: 4g │ │ Memory: 4g │ │ │ └────────────────────┘ └────────────────────┘ │ └──────────────────────────────────────────────────────┘

Executor Pod Configuration

You configure the number of executors, their CPU, and their memory via spark-submit flags. Kubernetes then schedules pods with those exact resource requests.

bash — Executor Configuration

spark-submit \
  --master k8s://https://<k8s-api-server>:6443 \
  --deploy-mode cluster \
  \
  # Executor configuration
  --num-executors 5 \
  --executor-cores 4 \
  --executor-memory 8g \
  \
  # Memory overhead for non-JVM memory (Python, off-heap)
  --conf spark.executor.memoryOverhead=2g \
  \
  # Image to use for executor pods (can differ from driver)
  --conf spark.kubernetes.executor.container.image=myrepo/spark:3.5.0-python \
  \
  local:///opt/spark/work-dir/my_job.py

⚠️ Memory Overhead

Always set spark.executor.memoryOverhead when running PySpark. Python processes run outside the JVM and consume memory that Kubernetes doesn't see. Without this, pods get OOMKilled by Kubernetes even though Spark thinks it has enough memory. A safe default is max(384m, 10% of executor memory).

Dynamic Allocation with Pods

With dynamic allocation enabled, the Driver Pod creates new executor pods as needed (when tasks are queued) and deletes idle executor pods after they've been idle for a configurable duration. This is very cost-efficient for variable workloads.

python — PySpark with Dynamic Allocation on K8s

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DynamicAllocationDemo") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "1") \
    .config("spark.dynamicAllocation.maxExecutors", "20") \
    .config("spark.dynamicAllocation.initialExecutors", "2") \
    .config("spark.dynamicAllocation.executorIdleTimeout", "60s") \
    # Required for K8s: shuffle tracking instead of external shuffle service
    .config("spark.dynamicAllocation.shuffleTracking.enabled", "true") \
    .getOrCreate()

# At the start of the job: 2 executor pods
# During a heavy shuffle: Spark creates up to 20 executor pods
# After the heavy stage completes: idle pods are removed
df = spark.read.parquet("s3://my-bucket/large-dataset/")
result = df.groupBy("country").count()
result.show()

28.3

Dynamic Allocation on Kubernetes

Dynamic allocation lets Spark scale executor pods up and down automatically based on workload — without requiring an External Shuffle Service (which doesn't work on K8s). It uses shuffle tracking instead.

🔄

External Shuffle Service Alternatives K8s SPECIFIC ▼

The Problem with External Shuffle Service on K8s

In YARN, External Shuffle Service (ESS) runs as a daemon on every NodeManager. When dynamic allocation removes an executor, shuffle data the executor wrote is still accessible via the ESS. On Kubernetes, there are no long-lived node daemons — so ESS doesn't work.

🏢 Analogy

Imagine executor pods as delivery drivers who leave packages in a locker (shuffle data). ESS is a permanent locker at the depot — drivers can leave and the package stays. Without ESS on K8s, when the driver (executor pod) leaves, the locker disappears too. Shuffle Tracking solves this by making sure Spark only removes executor pods after all their shuffle data has been consumed.

shuffleTracking — The K8s Solution

Shuffle Tracking (introduced in Spark 3.0) solves the ESS problem on Kubernetes. When enabled, the Spark driver tracks which executor pods hold shuffle data that other tasks still need. It will not decommission an executor pod until its shuffle data is no longer needed.

python — Dynamic Allocation with Shuffle Tracking

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ShuffleTrackingDemo") \
    \
    # Enable dynamic allocation
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "50") \
    \
    # K8s-specific: shuffle tracking replaces External Shuffle Service
    .config("spark.dynamicAllocation.shuffleTracking.enabled", "true") \
    \
    # How long an executor must be idle before removal
    .config("spark.dynamicAllocation.executorIdleTimeout", "120s") \
    \
    # How long before a cached-data executor is removed
    .config("spark.dynamicAllocation.cachedExecutorIdleTimeout", "300s") \
    \
    .getOrCreate()

# Spark will now scale executor pods between 2 and 50
# based on task queue length
df = spark.read.parquet("s3://bucket/data/")
df.groupBy("category").agg({"amount": "sum"}).write.parquet("s3://bucket/output/")

Decommissioning Executors

When Spark decides to remove an idle executor, it gracefully decommissions it — the executor finishes its current task, migrates shuffle blocks if needed, then signals Kubernetes to delete the pod. This is safer than hard-killing the pod.

python — Executor Decommission Config

# Enable graceful decommission
spark = SparkSession.builder \
    .config("spark.decommission.enabled", "true") \
    \
    # Max time to wait for graceful decommission before force-removing
    .config("spark.storage.decommission.fallbackStorage.path",
            "s3://my-bucket/spark-decommission/") \
    \
    # Shuffle data migration during decommission
    .config("spark.storage.decommission.shuffleBlocks.enabled", "true") \
    .config("spark.storage.decommission.rddBlocks.enabled", "true") \
    .getOrCreate()

28.4

Spark Operator

The Spark Operator is a Kubernetes-native way to submit and manage Spark jobs using Custom Resource Definitions (CRDs). Instead of running spark-submit manually, you declare your Spark job as a YAML file.

🎛️

What is Spark Operator OPERATOR PATTERN ▼

The Kubernetes Operator Pattern

A Kubernetes Operator is a custom controller that extends Kubernetes to manage complex applications. The Spark Operator (from Google, now maintained by the community) adds two new Kubernetes resource types: SparkApplication and ScheduledSparkApplication. Instead of calling spark-submit, you kubectl apply a YAML and Kubernetes handles the rest.

🤖 Analogy

Without the operator, running a Spark job on K8s is like manually booking each item in a restaurant — you call the kitchen, the wait staff, the cashier separately. The Spark Operator is like ordering via a single form — you describe what you want, the operator coordinates everything automatically.

SparkApplication CRD

The SparkApplication is a Custom Resource Definition (CRD) — a new kind of Kubernetes object that describes a Spark job. You define everything in YAML: the Docker image, driver resources, executor resources, the main Python file, arguments, and more.

yaml — SparkApplication CRD example

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: pyspark-pi
  namespace: spark-jobs
spec:
  type: Python            # Python job
  pythonVersion: "3"
  mode: cluster            # cluster mode (driver runs as pod)
  image: myrepo/spark:3.5.0-python
  imagePullPolicy: IfNotPresent
  mainApplicationFile: local:///opt/spark/examples/pi.py

  sparkVersion: "3.5.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20

  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.5.0
    serviceAccount: spark-sa

  executor:
    cores: 1
    instances: 2
    memory: "512m"
    labels:
      version: 3.5.0

bash — Submit and Monitor via kubectl

# Submit the Spark job
kubectl apply -f pyspark-pi.yaml

# Check status
kubectl get sparkapplication pyspark-pi -n spark-jobs

# Describe for detailed info
kubectl describe sparkapplication pyspark-pi -n spark-jobs

# Watch logs of the driver pod
kubectl logs -f pyspark-pi-driver -n spark-jobs

# List all Spark apps
kubectl get sparkapplications -n spark-jobs

# Delete/cancel the job
kubectl delete sparkapplication pyspark-pi -n spark-jobs

ScheduledSparkApplication CRD

For recurring jobs (like daily ETL), use ScheduledSparkApplication. It accepts a cron schedule and automatically creates a new SparkApplication at each trigger time — just like a cron job, but for Spark.

yaml — ScheduledSparkApplication (daily ETL)

apiVersion: sparkoperator.k8s.io/v1beta2
kind: ScheduledSparkApplication
metadata:
  name: daily-etl-job
  namespace: spark-jobs
spec:
  schedule: "0 2 * * *"       # Run at 2 AM every day
  concurrencyPolicy: Forbid    # Don't allow overlapping runs
  successfulRunHistoryLimit: 3  # Keep last 3 successful runs
  failedRunHistoryLimit: 1      # Keep last 1 failed run for debugging
  template:
    spec:
      type: Python
      image: myrepo/spark:3.5.0-python
      mainApplicationFile: local:///opt/spark/work-dir/daily_etl.py
      driver:
        cores: 2
        memory: "4g"
        serviceAccount: spark-sa
      executor:
        cores: 4
        instances: 10
        memory: "8g"

Operator Installation

Install the Spark Operator using Helm — the Kubernetes package manager.

bash — Install Spark Operator via Helm

# Add the Spark Operator Helm chart repo
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm repo update

# Install the operator into its own namespace
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \          # Enable pod template mutation webhook
  --set sparkJobNamespace=spark-jobs   # Watch this namespace for SparkApplications

# Verify it's running
kubectl get pods -n spark-operator
# Output: spark-operator-xxxx   Running

28.5

Pod Templates

Pod templates let you fully customise the Spark Driver and Executor pods — adding init containers, sidecars, custom volume mounts, environment variables, and any Kubernetes pod spec field.

📋

Driver and Executor Pod Templates CUSTOMISATION ▼

What are Pod Templates?

Spark lets you provide a Kubernetes Pod spec template for the driver and executor pods. Spark merges this template with its own auto-generated pod spec. This gives you access to the full Kubernetes pod API — things Spark doesn't natively expose.

📦 Example Use Cases

✅ Mount secrets as environment variables
✅ Add an init container to download a config file before Spark starts
✅ Add a sidecar container for log shipping (Fluentd)
✅ Mount a shared NFS volume for shuffle data
✅ Set node labels for pod scheduling (e.g., GPU nodes for ML)

Executor Pod Template with Volume Mount

yaml — executor-pod-template.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    team: data-engineering
    env: production
spec:
  # Node selector: run only on data-plane nodes
  nodeSelector:
    node-role: spark-workers

  # Tolerations: allow running on tainted spark nodes
  tolerations:
  - key: "spark-dedicated"
    operator: "Exists"
    effect: NoSchedule

  containers:
  - name: spark             # Spark's main container (name must be "spark")
    env:
    - name: AWS_REGION
      value: us-east-1
    - name: DB_PASSWORD      # Inject secret as env var
      valueFrom:
        secretKeyRef:
          name: db-credentials
          key: password
    volumeMounts:
    - name: spark-local-dir   # Fast local SSD for shuffle
      mountPath: /tmp/spark-local

  # Sidecar for log shipping
  - name: fluentd
    image: fluent/fluentd:v1.14
    volumeMounts:
    - name: spark-logs
      mountPath: /var/log/spark

  volumes:
  - name: spark-local-dir
    emptyDir: {}             # Fast ephemeral storage on node
  - name: spark-logs
    emptyDir: {}

bash — Reference template in spark-submit

spark-submit \
  --master k8s://https://<k8s-api>:6443 \
  --deploy-mode cluster \
  # Point to executor pod template file
  --conf spark.kubernetes.executor.podTemplateFile=executor-pod-template.yaml \
  # Point to driver pod template file
  --conf spark.kubernetes.driver.podTemplateFile=driver-pod-template.yaml \
  local:///opt/spark/work-dir/my_job.py

Custom Init Containers

An init container runs before the main Spark container starts. This is useful for tasks like downloading a config file from S3, warming up a cache, or running a pre-flight check.

yaml — Init Container in Pod Template

spec:
  initContainers:
  - name: download-config
    image: amazon/aws-cli:latest
    command: ["aws", "s3", "cp",
               "s3://my-config-bucket/app.conf",
               "/config/app.conf"]
    volumeMounts:
    - name: config-volume
      mountPath: /config

  containers:
  - name: spark
    volumeMounts:
    - name: config-volume
      mountPath: /config   # Spark can now read /config/app.conf

  volumes:
  - name: config-volume
    emptyDir: {}

28.6

Resource Management

Kubernetes gives you fine-grained control over how much CPU and memory Spark pods can use, which nodes they run on, and how they're scheduled alongside other workloads.

📊

Resource Quotas — CPU limits and requests RESOURCES ▼

Requests vs Limits

Every Kubernetes pod has two resource settings: requests (what the pod is guaranteed) and limits (what the pod can never exceed).

Setting	What it means	Consequence if exceeded
`requests.cpu`	Guaranteed CPU for scheduling	Scheduler won't place pod on node without this capacity
`limits.cpu`	Maximum CPU pod can use	CPU is throttled (not killed)
`requests.memory`	Guaranteed memory	Pod won't be scheduled without this
`limits.memory`	Maximum memory allowed	Pod is OOMKilled (killed) immediately

python — Setting Executor Resources for K8s

spark = SparkSession.builder \
    .appName("ResourceDemo") \
    \
    # Executor memory = Spark's JVM memory (becomes K8s memory request)
    .config("spark.executor.memory", "8g") \
    \
    # Overhead = Python/off-heap memory (added on top for K8s limit)
    .config("spark.executor.memoryOverhead", "2g") \
    # K8s memory limit = executor.memory + memoryOverhead = 10g
    \
    # CPU request for executor pod
    .config("spark.executor.cores", "4") \
    \
    # Set explicit K8s CPU limit (optional — defaults to request)
    .config("spark.kubernetes.executor.limit.cores", "4") \
    \
    # Same for driver
    .config("spark.driver.memory", "4g") \
    .config("spark.driver.memoryOverhead", "1g") \
    .config("spark.driver.cores", "2") \
    .getOrCreate()

Node Affinity

Node Affinity lets you tell Kubernetes which nodes Spark pods should (or must) run on, based on node labels. This is useful for sending Spark jobs to dedicated compute nodes with more memory, or to nodes with fast SSDs.

yaml — Node Affinity in Pod Template

spec:
  affinity:
    nodeAffinity:
      # MUST run on a node with this label
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-type
            operator: In
            values: [memory-optimized]

      # PREFER to run on us-east-1a (but can run elsewhere if needed)
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values: [us-east-1a]

Taints and Tolerations

Taints are applied to nodes to repel pods. Tolerations are applied to pods to allow them to be scheduled on tainted nodes. This is how you create dedicated Spark node pools — nodes that only run Spark executor pods and nothing else.

🏷️ Analogy

A taint is like a "Reserved — Data Engineering Only" sign on a desk. Tolerations are like the badge that lets data engineering pods sit at those desks. Other pods (web servers, apps) without the badge are repelled.

bash + yaml — Dedicated Spark Node Pool

# Step 1: Taint the spark-worker nodes so nothing else runs on them
kubectl taint nodes spark-node-1 spark-node-2 \
  spark-dedicated=true:NoSchedule

# Step 2: Add toleration in the Spark pod template so Spark CAN run there

yaml — Toleration in Pod Template

spec:
  tolerations:
  - key: "spark-dedicated"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

  # Also target spark nodes via nodeSelector
  nodeSelector:
    node-pool: spark-workers

28.7

Autoscaling

Three layers of autoscaling work together to make Spark on Kubernetes elastic: KEDA scales Spark executors based on queue metrics, HPA scales pods based on CPU/memory, and Cluster Autoscaler scales the underlying nodes.

📈

KEDA, HPA, and Cluster Autoscaler AUTOSCALING ▼

KEDA for Spark

KEDA (Kubernetes Event-Driven Autoscaling) can scale Spark executor pods based on external metrics — like Kafka consumer lag or a custom Spark pending-tasks metric. Unlike standard HPA (which uses CPU/memory), KEDA reacts to business-level signals.

yaml — KEDA ScaledJob for Spark on Kafka lag

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: spark-kafka-processor
  namespace: spark-jobs
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: spark-submit
          image: myrepo/spark-submitter:latest
          command: ["spark-submit", "--master", "k8s://...", "..."]

  # Scale based on Kafka consumer lag
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: kafka:9092
      consumerGroup: spark-streaming-group
      topic: events
      lagThreshold: "1000"    # Trigger a new Spark run per 1000 lagged messages

Horizontal Pod Autoscaler (HPA)

HPA scales pods based on CPU or memory utilisation. For Spark streaming jobs where the workload is steady and measurable by CPU, HPA can scale executor replicas up and down automatically.

yaml — HPA for Spark Streaming Executors

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: spark-executor-hpa
  namespace: spark-jobs
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: spark-executor-deployment
  minReplicas: 2
  maxReplicas: 30
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70   # Scale when CPU > 70%

Cluster Autoscaler

When all nodes are full and new executor pods can't be scheduled (they stay in Pending), Cluster Autoscaler adds new nodes to the Kubernetes cluster. When nodes are under-utilised, it removes them. This is the lowest-level autoscaling layer.

THREE LAYERS OF AUTOSCALING FOR SPARK ON K8S Layer 1 — KEDA / Spark Dynamic Allocation "I have more tasks to run" → Create more executor pods Layer 2 — Kubernetes Cluster Autoscaler "All nodes are full, executor pod is Pending" → Add a new node to the cluster (takes ~1-3 min on EKS/GKE) → Once node is Ready, pending executor pod gets scheduled Layer 3 — Scale Down "Executor pods are idle" → Spark dynamic allocation removes pods "Nodes have < 50% utilisation" → Cluster Autoscaler removes nodes Result: You pay only for what you use, automatically.

yaml — Node Pool Config for Autoscaling (EKS)

# In AWS EKS, configure a managed node group with autoscaling
# This is typically in Terraform or eksctl config

managedNodeGroups:
- name: spark-workers
  instanceType: r5.4xlarge      # 16 vCPU, 128 GB RAM — memory optimized
  minSize: 2
  maxSize: 20
  desiredCapacity: 2
  labels:
    node-pool: spark-workers
  taints:
  - key: spark-dedicated
    value: "true"
    effect: NO_SCHEDULE

28.8

Storage on Kubernetes

Spark on Kubernetes uses different storage strategies for different needs — fast local storage for shuffle, durable object storage for data, and persistent volumes for state.

💾

Storage Strategies STORAGE ▼

Persistent Volume Claims (PVC)

A PVC is a request for storage in Kubernetes. Spark can use PVCs for streaming checkpoints, shuffled data, or any data that needs to survive pod restarts. PVCs are backed by the cluster's StorageClass (EBS on AWS, Persistent Disk on GCP, etc.).

yaml — PVC for Spark Streaming Checkpoint

# Create a PVC for storing Spark streaming checkpoint
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: spark-checkpoint-pvc
  namespace: spark-jobs
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: gp3-ssd          # Fast SSD storage class
  resources:
    requests:
      storage: 100Gi

python — Use PVC for Streaming Checkpoint

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamingWithPVC").getOrCreate()

df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "events") \
    .load()

# Checkpoint to the PVC mounted at /checkpoint in the driver pod
query = df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/checkpoint/streaming-job") \
    .outputMode("append") \
    .start("s3://my-bucket/delta-output/")

query.awaitTermination()

EmptyDir for Shuffle

EmptyDir is a temporary directory that Kubernetes creates on the node for a pod. It's fast (uses local disk) and is deleted when the pod terminates. It's ideal for Spark shuffle data — you want speed, and you don't need the data after the job finishes.

yaml — EmptyDir for Spark Shuffle in Pod Template

spec:
  containers:
  - name: spark
    env:
    - name: SPARK_LOCAL_DIRS   # Tell Spark to use this for shuffle
      value: /mnt/spark-local
    volumeMounts:
    - name: spark-local
      mountPath: /mnt/spark-local

  volumes:
  - name: spark-local
    emptyDir:
      medium: ""           # Use node's default disk
      sizeLimit: 100Gi     # Limit shuffle size per pod

# For even faster shuffle, use Memory-backed emptyDir:
# medium: Memory  → shuffle goes to RAM (very fast but limited)

S3 as External Storage

For durable data (input, output, checkpoints), S3 (or GCS / Azure Blob) is the standard choice. Spark on Kubernetes accesses S3 via the Hadoop S3A connector, configured through IAM roles attached to the pod's service account (IRSA on EKS).

python — S3 access from Spark on EKS (IRSA)

spark = SparkSession.builder \
    .appName("S3Access") \
    # Use IAM Role for Service Accounts (IRSA) — no hardcoded credentials
    .config("spark.hadoop.fs.s3a.aws.credentials.provider",
            "com.amazonaws.auth.WebIdentityTokenCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.impl",
            "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.path.style.access", "false") \
    .getOrCreate()

# Read directly from S3 — credentials from pod's IAM role
df = spark.read.parquet("s3a://my-data-bucket/input/sales/")
df.write.mode("overwrite").parquet("s3a://my-data-bucket/output/sales_agg/")

NFS for Shared Data

NFS (Network File System) allows multiple pods to read and write the same shared directory simultaneously. Useful when multiple Spark jobs need access to the same lookup tables or config files stored on a shared volume.

yaml — Shared NFS Volume

volumes:
- name: shared-data
  nfs:
    server: nfs-server.spark-jobs.svc.cluster.local
    path: /shared/lookup-tables

# Mount in executor pods so all executors can read lookup tables
volumeMounts:
- name: shared-data
  mountPath: /data/lookup
  readOnly: true

28.9

Security on Kubernetes

Security for Spark on Kubernetes involves RBAC for access control, service accounts for identity, Kubernetes secrets for credentials, and network policies for traffic isolation.

🔒

RBAC, Service Accounts, Secrets, Network Policies SECURITY ▼

RBAC for Spark

RBAC (Role-Based Access Control) defines what Kubernetes resources a service account can access. For Spark, the driver pod needs permission to create/delete executor pods. You've already seen this in section 28.2 — here's a production-hardened version with least privilege.

yaml — Minimal Spark RBAC (Least Privilege)

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-minimal-role
  namespace: spark-jobs
rules:
# Only allow what Spark actually needs
- apiGroups: [""]
  resources: [pods]
  verbs: [create, delete, get, list, watch, patch]
- apiGroups: [""]
  resources: [pods/log]        # Read executor logs
  verbs: [get, list]
- apiGroups: [""]
  resources: [configmaps]     # For Spark config
  verbs: [create, get, delete]
- apiGroups: [""]
  resources: [services]       # For driver headless service
  verbs: [create, delete, get]
# Do NOT grant: nodes, namespaces, clusterroles, secrets (only if needed)

Secret Management

Never hardcode credentials in Spark config or pod specs. Use Kubernetes Secrets or an external secret manager (AWS Secrets Manager, HashiCorp Vault). Inject secrets as environment variables or mounted files into pods.

yaml — Create K8s Secret for DB Credentials

# Create secret (base64-encoded values)
kubectl create secret generic db-creds \
  --from-literal=url=jdbc:postgresql://db:5432/mydb \
  --from-literal=user=spark_user \
  --from-literal=password=supersecret \
  -n spark-jobs

# Kubernetes stores it encrypted (if KMS encryption is enabled)

yaml — Inject Secret into Spark Pod Template

spec:
  containers:
  - name: spark
    env:
    - name: DB_URL
      valueFrom:
        secretKeyRef:
          name: db-creds
          key: url
    - name: DB_PASSWORD
      valueFrom:
        secretKeyRef:
          name: db-creds
          key: password

python — Read Secret from Environment Variable

import os
from pyspark.sql import SparkSession

# Read DB credentials injected by Kubernetes from the Secret
db_url = os.environ["DB_URL"]
db_password = os.environ["DB_PASSWORD"]

spark = SparkSession.builder.appName("SecureJob").getOrCreate()

# Use credentials to read from JDBC without hardcoding
df = spark.read.format("jdbc") \
    .option("url", db_url) \
    .option("dbtable", "orders") \
    .option("user", "spark_user") \
    .option("password", db_password) \
    .load()

df.show()

Network Policies

Network Policies are Kubernetes firewall rules that control which pods can talk to which other pods. For Spark, you want to: allow driver ↔ executor communication, block executor pods from accessing the internet, and restrict external access to only the Spark UI.

yaml — Network Policy for Spark Jobs

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-network-policy
  namespace: spark-jobs
spec:
  podSelector:
    matchLabels:
      spark-role: executor
  policyTypes: [Ingress, Egress]
  ingress:
  # Allow incoming connections from driver only
  - from:
    - podSelector:
        matchLabels:
          spark-role: driver
  egress:
  # Allow executors to communicate with driver
  - to:
    - podSelector:
        matchLabels:
          spark-role: driver
  # Allow executors to access S3 (via VPC endpoint = within cluster range)
  - to:
    - ipBlock:
        cidr: 10.0.0.0/8   # Internal VPC CIDR only

28.10

Monitoring Spark on Kubernetes

Production Spark on Kubernetes needs observability at three levels: the Spark application (Spark UI), the Kubernetes infrastructure (Prometheus + Grafana), and the logs (EFK stack).

📡

Prometheus, Grafana, Spark UI, EFK Stack OBSERVABILITY ▼

Prometheus + Grafana

Spark exposes metrics in Prometheus format via the PrometheusServlet. A Prometheus server scrapes these metrics, and Grafana visualises them in dashboards. You get executor CPU, memory, shuffle read/write, GC time, and task duration metrics.

python — Enable Prometheus Metrics in Spark

spark = SparkSession.builder \
    .appName("MonitoredSparkJob") \
    \
    # Enable Prometheus metrics endpoint
    .config("spark.ui.prometheus.enabled", "true") \
    \
    # Expose metrics via servlet on driver port 4040
    .config("spark.metrics.conf.*.sink.prometheusServlet.class",
            "org.apache.spark.metrics.sink.PrometheusServlet") \
    .config("spark.metrics.conf.*.sink.prometheusServlet.path",
            "/metrics/prometheus") \
    \
    # Enable executor metrics (reported to driver)
    .config("spark.executor.processTreeMetrics.enabled", "true") \
    .getOrCreate()

yaml — Prometheus ServiceMonitor for Spark Driver

# Tells Prometheus Operator to scrape Spark driver metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: spark-driver-monitor
  namespace: spark-jobs
spec:
  selector:
    matchLabels:
      spark-role: driver
  endpoints:
  - port: spark-ui               # Port 4040
    path: /metrics/prometheus
    interval: 15s

Spark UI via Port-Forward

The Spark UI runs on port 4040 of the Driver Pod. To access it from your laptop, use kubectl port-forward. In production, you can expose it via a Kubernetes Service or Ingress (with authentication).

bash — Access Spark UI

# Get the driver pod name
kubectl get pods -n spark-jobs | grep driver
# Output: my-spark-app-driver   Running

# Forward port 4040 to your local machine
kubectl port-forward pod/my-spark-app-driver 4040:4040 -n spark-jobs

# Now open: http://localhost:4040 in your browser
# → You see the full Spark UI: Jobs, Stages, Executors, SQL, Storage

# Alternative: Expose via a Kubernetes Service
kubectl expose pod my-spark-app-driver \
  --type=ClusterIP \
  --port=4040 \
  --name=spark-ui-service \
  -n spark-jobs

Kubernetes Events

Kubernetes events tell you what happened at the pod level — scheduling decisions, OOM kills, image pull errors. Always check events when a Spark job fails.

bash — Checking Kubernetes Events for Spark Pods

# Events for the whole namespace
kubectl get events -n spark-jobs --sort-by='.lastTimestamp'

# Events for a specific pod
kubectl describe pod my-spark-app-exec-3 -n spark-jobs
# Look for:  OOMKilled → need more executor memory
#            Insufficient memory → need bigger nodes
#            ImagePullBackOff   → Docker image issue
#            Evicted            → node was under memory pressure

# Get all failed pods in namespace
kubectl get pods -n spark-jobs --field-selector=status.phase=Failed

Log Aggregation with EFK Stack

The EFK Stack (Elasticsearch + Fluentd + Kibana) aggregates logs from all Spark pods in a central place. Instead of kubectl logs (which disappears when the pod is deleted), every log line is shipped to Elasticsearch and queryable in Kibana.

EFK LOG PIPELINE FOR SPARK ON KUBERNETES Spark Driver Pod Spark Executor Pods (1..N) └── stdout/stderr + └── stdout/stderr │ │ ▼ ▼ Fluentd DaemonSet ← Reads from /var/log/containers/ on each node │ ▼ Elasticsearch ← Indexed, searchable, retention-managed │ ▼ Kibana Dashboard - Search by app name: "my-spark-app" - Search by log level: ERROR - Filter by time range - Create alerts on ERROR count spikes

python — Structured Logging from PySpark (to stdout → EFK)

import logging
import json
from pyspark.sql import SparkSession

# Structured JSON logging — EFK can parse and index each field
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("SparkJob")

def log_metric(event, **kwargs):
    """Emit a structured log line that EFK will parse."""
    record = {"event": event, **kwargs}
    logger.info(json.dumps(record))

spark = SparkSession.builder.appName("ProductionJob").getOrCreate()

df = spark.read.parquet("s3://bucket/input/")
row_count = df.count()

# This JSON line will be indexed in Elasticsearch by Fluentd
log_metric("data_read",
           table="input",
           row_count=row_count,
           app_id=spark.sparkContext.applicationId)

result = df.groupBy("region").count()
result.write.mode("overwrite").parquet("s3://bucket/output/")

log_metric("job_complete",
           output_table="output",
           rows_written=result.count())

# Kibana query to find all runs of this job:
# event:"job_complete" AND app_id:*

28.11

Module Summary

You've completed Module 28 — Spark on Kubernetes. Here's a consolidated mental model of how everything fits together.

🗺️

Complete Architecture Overview ▼

Full Picture — Spark on Kubernetes

SPARK ON KUBERNETES — COMPLETE PICTURE ┌─────────────────────────────────────────────────────────────────┐ │ SUBMISSION LAYER │ │ spark-submit --master k8s://... OR kubectl apply -f crd │ │ (manual) (Spark Operator) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ KUBERNETES CONTROL PLANE │ │ API Server → Scheduler → etcd │ │ RBAC enforced (spark-sa service account) │ └─────────────────────────────────────────────────────────────────┘ │ ┌───────────────┴────────────────┐ ▼ ▼ ┌─────────────────────────┐ ┌───────────────────────────────┐ │ DRIVER POD │ │ EXECUTOR PODS (1..N) │ │ spark-app-driver │────▶│ spark-app-exec-1..N │ │ Runs SparkContext │◀────│ Task execution, shuffle │ │ DAG + Stage scheduling │ │ EmptyDir for shuffle storage │ │ Spark UI on :4040 │ │ Memory limit enforced by K8s │ └─────────────────────────┘ └───────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ STORAGE LAYER │ │ S3 / GCS / ADLS → Input data, output data, checkpoints │ │ EmptyDir (node SSD) → Shuffle data (ephemeral) │ │ PVC (EBS/PD) → Streaming checkpoint (durable) │ │ K8s Secrets → DB passwords, API keys │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ OBSERVABILITY LAYER │ │ Prometheus → Scrapes Spark metrics (CPU, mem, shuffle) │ │ Grafana → Dashboards and alerts │ │ EFK Stack → Log aggregation from all pods │ │ Spark UI → Job DAGs, stage timelines, task details │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ AUTOSCALING LAYER │ │ Spark Dynamic Alloc → Adds/removes executor pods │ │ KEDA → Event-driven scaling (Kafka lag, etc.) │ │ Cluster Autoscaler → Adds/removes K8s nodes │ └─────────────────────────────────────────────────────────────────┘

Key Topics — Quick Reference

Topic	What you learned
28.1 Why K8s	K8s vs YARN, resource isolation, multi-tenancy, cloud-native portability
28.2 Architecture	Driver Pod lifecycle, Executor Pod lifecycle, service accounts, spark-submit flags
28.3 Dynamic Allocation	No ESS on K8s, shuffleTracking, executor decommission, scale-up/down
28.4 Spark Operator	SparkApplication CRD, ScheduledSparkApplication, Helm install, kubectl workflow
28.5 Pod Templates	Customise pods: init containers, sidecars, volume mounts, env vars from secrets
28.6 Resource Mgmt	CPU/memory requests vs limits, node affinity, taints and tolerations
28.7 Autoscaling	KEDA (event-driven), HPA (CPU-based), Cluster Autoscaler (node-level)
28.8 Storage	PVC for checkpoints, EmptyDir for shuffle, S3 for data, NFS for shared access
28.9 Security	RBAC, service accounts, K8s secrets injection, network policies
28.10 Monitoring	Prometheus metrics, Grafana dashboards, Spark UI port-forward, EFK log stack

Interview Q&A Cheatsheet

🎯 Common Interview Questions

Q: Why use Kubernetes over YARN for Spark?
A: No Hadoop dependency, true resource isolation per pod, cloud-native portability (same YAML on EKS/GKE/AKS), first-class Docker support, fine-grained RBAC.

Q: How does dynamic allocation work on K8s without ESS?
A: Use spark.dynamicAllocation.shuffleTracking.enabled=true. The driver tracks which executor pods hold unconsumed shuffle data and defers deleting them until their data is no longer needed.

Q: What is the Spark Operator?
A: A Kubernetes controller that introduces SparkApplication and ScheduledSparkApplication CRDs. You declare Spark jobs as YAML, and the operator handles spark-submit, monitoring, and restarts.

Q: How do you prevent one Spark job from consuming all cluster resources?
A: Namespaces + ResourceQuota (limits total CPU/memory per namespace), Pod CPU/memory limits, and node taints to dedicate specific nodes per team.

Q: Where does Spark shuffle data go on Kubernetes?
A: EmptyDir volumes on the executor pod's local node disk. When the executor pod is deleted, the shuffle data is gone. This is why shuffleTracking exists — to prevent premature deletion.