MODULE 31 — OVERVIEW

Spark Security

Security in Spark is a layered discipline: who you are (Authentication), what you can do (Authorization), how your data is protected (Encryption & Masking), where traffic flows (Network), and which regulations apply (Compliance). This module covers every layer with real configuration and code.

🪪

Authentication

Kerberos, LDAP, SSL/TLS — prove identity before any cluster access is granted.

🛡️

Authorization

Apache Ranger and Unity Catalog enforce who can read which tables, columns, and rows.

🔒

Data Security

Masking, tokenization, AES column-level encryption — protect sensitive data at rest and in transit.

🌐

Network & Compliance

VPCs, private endpoints, HIPAA / GDPR / PCI-DSS — secure Spark at the infra and regulatory level.

MODULE 31 — WHAT YOU WILL LEARN 31.1 Authentication → Kerberos (kinit, keytab), LDAP, SSL/TLS 31.2 Authorization → Apache Ranger (policies, row-filter, col-mask), Unity Catalog 31.3 Data Security → Masking, Tokenization, AES Encryption, Secrets Management 31.4 Network Security → VPC isolation, private endpoints, network policies, Kubernetes 31.5 Compliance → HIPAA (PHI, audit), GDPR (right to erasure), PCI-DSS (encryption)

Interview tip: Security questions appear regularly in senior data engineer interviews. Know the difference between Authentication vs Authorization, how Ranger column masking works, how to enable RPC encryption, and what GDPR "right to erasure" means for Delta/Iceberg.

31.1 — AUTHENTICATION

Authentication

Authentication = proving who you are. Spark supports three main authentication mechanisms: Kerberos (enterprise Hadoop clusters), LDAP (user directory integration), and SSL/TLS (encrypted RPC and UI). You configure these at the cluster and SparkSession level.

🎫

Kerberos

Enterprise Auth ▼

Concept

What is Kerberos?

Kerberos is a ticket-based authentication protocol used in enterprise Hadoop environments. Instead of passwords being sent over the wire, a Key Distribution Center (KDC) issues encrypted tickets. Spark uses Kerberos when deployed on Kerberized HDFS/YARN clusters.

Kerberos Flow: User / Service │ ▼ (1) Request Ticket Granting Ticket (TGT) KDC (Key Distribution Center) │ ▼ (2) Returns encrypted TGT (valid for N hours) kinit stores TGT in local credential cache │ ▼ (3) Spark uses TGT to request Service Ticket for HDFS/YARN HDFS / YARN accepts Service Ticket → grants access

kinit

kinit — obtaining a Kerberos ticket interactively

kinit is the CLI tool that authenticates a user against the KDC and stores the Ticket Granting Ticket (TGT) in the local credential cache. You run it before submitting Spark jobs in a Kerberized cluster.

bash

# Authenticate interactively (prompts for password)
kinit username@REALM.COM

# Check your current tickets
klist

# Example output:
# Ticket cache: FILE:/tmp/krb5cc_1000
# Default principal: spark_user@CORP.LOCAL
#
# Valid starting       Expires              Service principal
# 06/15/2026 09:00:00  06/15/2026 19:00:00  krbtgt/CORP.LOCAL@CORP.LOCAL

# Destroy tickets when done
kdestroy

keytab

keytab — non-interactive authentication for scheduled jobs

A keytab is an encrypted file containing the service principal's key. It allows Spark jobs to authenticate with Kerberos without human interaction — essential for Airflow-triggered or scheduled Spark jobs. The keytab is created by the Kerberos admin.

bash

# Authenticate using a keytab (no password prompt)
kinit -kt /etc/spark/spark.keytab spark_svc@CORP.LOCAL

# Verify
klist

Spark Config

Configuring Spark with Kerberos

Pass the principal and keytab to spark-submit. Spark's credential renewal thread automatically refreshes tokens before they expire during long-running jobs.

bash — spark-submit

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --principal spark_svc@CORP.LOCAL \
  --keytab /etc/spark/spark.keytab \
  --conf spark.yarn.access.hadoopFileSystems=hdfs://namenode:8020 \
  my_job.py

python — SparkSession with Kerberos

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("KerberosApp")
    .config("spark.yarn.principal", "spark_svc@CORP.LOCAL")
    .config("spark.yarn.keytab", "/etc/spark/spark.keytab")
    # Token renewal interval (default 75% of token lifetime)
    .config("spark.kerberos.access.hadoopFileSystems", "hdfs://namenode:8020")
    .getOrCreate())

# Spark automatically renews delegation tokens in background
df = spark.read.parquet("hdfs://namenode:8020/data/sales/")
df.show()

Token Renewal: Long-running Spark Streaming jobs must have token renewal configured. Without it, HDFS delegation tokens expire mid-job and the job fails. Set spark.kerberos.renewal.credentials = keytab to enable automatic renewal.

📋

LDAP

Directory Auth ▼

Concept

LDAP Integration with Spark

LDAP (Lightweight Directory Access Protocol) is a directory service protocol used to authenticate users against corporate directories (Active Directory, OpenLDAP). Spark's Thrift Server and Spark History Server support LDAP authentication so that only authorised users can query SQL or view job history.

Where LDAP is used in Spark: Spark itself does not have a built-in LDAP auth layer for the driver. LDAP is configured on Spark Thrift Server (HiveServer2-compatible SQL endpoint) and Spark History Server. Cluster managers (YARN, Kubernetes) handle broader LDAP integration.

Config

Enabling LDAP on Spark Thrift Server

Add these properties to hive-site.xml or pass as Spark configs. The Thrift Server will then require users to authenticate via your LDAP directory before executing SQL queries.

xml — hive-site.xml

<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://ldap.corp.local:389</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>dc=corp,dc=local</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.Domain</name>
  <value>corp.local</value>
</property>

bash — connecting beeline to LDAP-protected Thrift Server

beeline -u "jdbc:hive2://spark-thrift:10000/default;AuthMech=3" \
  -n alice@corp.local \
  -p MySecret123

History Server

LDAP Auth on Spark History Server

Protect the Spark History Server so only authorised users can view job DAGs and logs.

conf — spark-defaults.conf

spark.history.ui.acls.enable=true
spark.history.ui.admin.acls=spark-admins
spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
spark.ui.filters.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params=\
  type=ldap,\
  ldap.url=ldap://ldap.corp.local:389,\
  ldap.base.dn=dc=corp,dc=local

🔑

SSL/TLS

Encryption in Transit ▼

Concept

What SSL/TLS secures in Spark

SSL/TLS in Spark secures three channels: the Spark UI / History Server (HTTPS), RPC communication between Driver ↔ Executors (data shuffle, block transfers), and the Thrift Server (JDBC connections). Without TLS, shuffle data and UI credentials travel in plaintext across the network.

🌐 Spark UI (HTTPS)

Protects the web UI showing job DAGs, environment variables (which may expose secrets).

📡 RPC Encryption

Encrypts Driver ↔ Executor communication — task data, shuffle blocks, broadcasts.

Enabling SSL

Enabling SSL for Spark UI

You need a Java keystore containing the server certificate. Generate one with keytool, then point Spark at it.

bash — generate keystore

# Generate a self-signed certificate keystore
keytool -genkeypair \
  -alias spark \
  -keyalg RSA \
  -keysize 2048 \
  -keystore /etc/spark/spark-keystore.jks \
  -storepass changeit \
  -validity 365 \
  -dname "CN=spark.corp.local, OU=Data, O=Corp, L=City, ST=State, C=IN"

conf — spark-defaults.conf — Spark UI SSL

# Enable HTTPS for Spark UI
spark.ui.enabled=true
spark.ssl.ui.enabled=true
spark.ssl.ui.keyStore=/etc/spark/spark-keystore.jks
spark.ssl.ui.keyStorePassword=changeit
spark.ssl.ui.keyPassword=changeit
spark.ssl.ui.protocol=TLSv1.3

# Enable HTTPS for History Server
spark.ssl.historyServer.enabled=true
spark.ssl.historyServer.keyStore=/etc/spark/spark-keystore.jks
spark.ssl.historyServer.keyStorePassword=changeit

RPC Encryption

RPC Encryption — encrypting Driver ↔ Executor traffic

RPC encryption encrypts all messages between the Driver and Executors — including task submissions, shuffle data, and heartbeats. This is critical in shared clusters where network traffic could be intercepted.

conf — spark-defaults.conf — RPC encryption

# Enable RPC (driver-executor) encryption
spark.authenticate=true
spark.authenticate.secret=your-strong-shared-secret-here

# Encrypt data in transit for all Spark RPC
spark.network.crypto.enabled=true
spark.network.crypto.keyLength=256
spark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256

# Encrypt shuffle data on disk (when spilling)
spark.io.encryption.enabled=true
spark.io.encryption.keySizeBits=256

Performance note: RPC encryption adds CPU overhead (typically 5–15% throughput reduction on shuffle-heavy jobs). It's non-negotiable in regulated environments (HIPAA, PCI) but optional in isolated VPCs where network-level security is sufficient.

Certificate Management

In production, use CA-signed certificates rather than self-signed ones. Distribute the CA's truststore to all cluster nodes so they can verify each other's identity. Automate renewal with tools like cert-manager (Kubernetes) or HashiCorp Vault PKI.

bash — import CA certificate into truststore

# Import your corporate CA certificate into a Java truststore
keytool -importcert \
  -file /etc/ssl/certs/corp-ca.crt \
  -alias corp-ca \
  -keystore /etc/spark/spark-truststore.jks \
  -storepass trustpass \
  -noprompt

# Reference in spark config
# spark.ssl.ui.trustStore=/etc/spark/spark-truststore.jks
# spark.ssl.ui.trustStorePassword=trustpass

31.2 — AUTHORIZATION

Authorization

Authorization = controlling what an authenticated user can do. After Kerberos proves you are Alice, authorization decides: can Alice read table X? Can she see column salary? Apache Ranger and Unity Catalog are the two dominant authorization layers in enterprise Spark.

🏹

Apache Ranger

On-Prem / YARN ▼

Architecture

Ranger Architecture

Apache Ranger provides centralized security administration for the Hadoop ecosystem. It has a web UI to define fine-grained policies, and plugins installed on each service (Hive, HDFS, Spark SQL) that intercept and enforce those policies at runtime.

Ranger Architecture: ┌────────────────────────────────────────────────────┐ │ Ranger Admin (Web UI + REST API) │ │ Define policies: who can access what, when, how │ └─────────────────────┬──────────────────────────────┘ │ sync policies (every 30s) ┌────────────┼────────────────┐ ▼ ▼ ▼ Hive Plugin HDFS Plugin Spark SQL Plugin (intercepts (intercepts (intercepts SQL HiveServer2 HDFS ops) on Thrift Server) queries) Every query passes through the plugin → policy check → allow/deny

Policies

Ranger Policies for Spark SQL

Ranger policies are defined in the Admin UI and specify: Resource (database, table, column), Users/Groups, and Permissions (select, insert, update, drop, create). The Spark/Hive plugin enforces these at query execution time.

json — example Ranger policy (REST API payload)

{
  "name": "sales_analysts_read_policy",
  "service": "hive_spark_service",
  "resources": {
    "database": {"values": ["sales_db"]},
    "table":    {"values": ["orders", "customers"]},
    "column":   {"values": ["*"]}   // all columns
  },
  "policyItems": [{
    "groups":      ["sales_analysts"],
    "accesses":    [{"type": "select", "isAllowed": true}]
  }]
}

In practice: Ranger policies are usually managed via the web UI, not the API. Understand the UI concepts: Resource hierarchy (database → table → column), Policy conditions (IP range, time-based), and Masking policies (for column obfuscation).

Hive Plugin

Hive Plugin with Ranger

The Hive plugin intercepts all Spark SQL / HiveServer2 queries. Every SELECT, INSERT, DROP is checked against Ranger policies before execution. Unauthorized queries are denied with an error.

bash — install Ranger Hive plugin

# Unpack the ranger-hive-plugin archive
cd /opt/ranger-hive-plugin-2.4.0
./enable-hive-plugin.sh

# This copies the plugin JAR into HiveServer2's classpath
# and adds ranger-plugin-audit.xml, ranger-hive-security.xml
# Restart HiveServer2 / Spark Thrift Server after

HDFS Plugin

HDFS Plugin with Ranger

The HDFS plugin enforces path-level access control. Even if a Spark job tries to bypass Hive/SQL and read HDFS directly, Ranger HDFS policies block unauthorized path access.

json — HDFS Ranger policy

{
  "name": "data_lake_read_policy",
  "service": "hdfs_service",
  "resources": {
    "path": {"values": ["/data/gold/*"], "recursive": true}
  },
  "policyItems": [{
    "groups": ["data_scientists"],
    "accesses": [
      {"type": "read",    "isAllowed": true},
      {"type": "execute", "isAllowed": true}
    ]
  }]
}

Row-Level Filtering

Row-Level Filtering in Ranger

Ranger can automatically inject a WHERE clause into any query based on the user's identity. For example, a regional manager in APAC only ever sees rows where region = 'APAC' — the filter is applied transparently, without any changes to the application SQL.

Ranger UI — Row Filter Policy (conceptual)

Policy Name: apac_region_filter
Table: sales_db.orders
For group: apac_managers
Row filter expression: region = 'APAC'

Effect: any query by apac_managers against sales_db.orders
automatically becomes:
  SELECT ... FROM sales_db.orders WHERE region = 'APAC'

Column Masking

Column Masking with Ranger

Ranger's data masking policies transform sensitive column values at query time. The actual data in storage is not changed — Ranger applies masking functions as the data is returned to the user. Masking types include: Redact, Hash, Nullify, Show first/last N chars, Date year-only.

Ranger UI — Column Masking Policy (conceptual)

Policy: pii_masking_policy
Table: customers
Column: email
For groups: analysts (non-PII-authorised)
Masking Type: Hash (SHA256)

Column: phone
Masking Type: Redact  →  output: XXXXXXXXXX

Column: credit_card
Masking Type: Show last 4  →  output: ****-****-****-1234

-- What analysts actually see when they run SELECT *:
-- email:       a3f5d2...  (SHA256 hash)
-- phone:       XXXXXXXXXX
-- credit_card: ****-****-****-1234

🗂️

Unity Catalog Security

Databricks ▼

Overview

Unity Catalog — Unified Governance for Databricks

Unity Catalog (UC) is Databricks' unified governance layer that manages access to all data objects in a single metastore. It supports table/view/volume/function grants, row filters, column masks, and built-in data lineage — all in one place without needing Ranger or a separate tool.

Unity Catalog — 3-Level Namespace: catalog └── schema (database) └── table / view / volume / function Example: main.sales.orders ──── ───── ────── catalog schema table

Permissions

Workspace / Catalog / Table-Level Permissions

UC uses SQL GRANT / REVOKE statements. Permissions are inherited: granting USE CATALOG → USE SCHEMA → SELECT on table gives full read access.

sql — Unity Catalog GRANT statements

-- Grant access to a catalog
GRANT USE CATALOG ON CATALOG main TO `analysts@corp.com`;

-- Grant access to a schema
GRANT USE SCHEMA ON SCHEMA main.sales TO `analysts@corp.com`;

-- Grant SELECT on a specific table
GRANT SELECT ON TABLE main.sales.orders TO `analysts@corp.com`;

-- Grant to a group (preferred)
GRANT SELECT ON TABLE main.sales.orders TO `data_analysts`;

-- Revoke
REVOKE SELECT ON TABLE main.sales.orders FROM `contractors`;

-- View grants on a table
SHOW GRANTS ON TABLE main.sales.orders;

Row Filters

Row Filters in Unity Catalog

UC row filters are implemented as SQL functions attached to a table. When any user queries the table, the filter function is evaluated and only matching rows are returned — entirely transparent to the query.

sql — Unity Catalog row filter

-- Step 1: Create a filter function
-- Returns TRUE for rows the current user is allowed to see
CREATE OR REPLACE FUNCTION main.security.region_filter(region_col STRING)
RETURN
  is_account_group_member('apac_team') AND region_col = 'APAC'
  OR is_account_group_member('emea_team') AND region_col = 'EMEA'
  OR is_account_group_member('data_admins');  -- admins see all

-- Step 2: Attach the filter to the table
ALTER TABLE main.sales.orders
SET ROW FILTER main.security.region_filter ON (region);

-- Now: APAC team member runs SELECT * FROM main.sales.orders
-- They only see rows where region = 'APAC' — filter applied automatically

Column Masks

Column Masks in Unity Catalog

UC column masks are also SQL functions. They define what value a user sees for a given column — full value, hashed value, null, or a custom transformation — based on their group membership.

sql — Unity Catalog column mask

-- Mask function for the email column
CREATE OR REPLACE FUNCTION main.security.mask_email(email STRING)
RETURN
  CASE
    WHEN is_account_group_member('pii_authorised') THEN email  -- see real value
    ELSE regexp_replace(email, '^(.{2}).*(@.*)', '$1***$2')  -- al***@corp.com
  END;

-- Attach mask to table column
ALTER TABLE main.sales.customers
ALTER COLUMN email
SET MASK main.security.mask_email;

-- Analyst (not in pii_authorised) sees: al***@corp.com
-- PII-authorised user sees: alice@corp.com

Lineage & Audit

Data Lineage and Access Audit

Unity Catalog automatically captures column-level lineage (which source columns produced each target column) and access audit logs (who accessed what, when). Audit logs stream to a Delta table in your account.

sql — query Unity Catalog audit logs

-- Unity Catalog audit logs land in the system catalog
SELECT
  event_time,
  user_identity.email   AS user_email,
  request_params.table  AS table_accessed,
  action_name,
  response.status_code
FROM system.access.audit
WHERE action_name = 'commandSubmit'
  AND event_time >= current_timestamp() - INTERVAL 7 DAYS
ORDER BY event_time DESC
LIMIT 100;

31.3 — DATA SECURITY

Data Security

Data security covers what happens to sensitive data values themselves: masking (replace values at display time), tokenization (replace with reversible tokens), encryption (encrypt values at storage), and secrets management (never hardcode credentials). These complement authorization — authorization controls who can query, data security controls what they see.

🎭

Data Masking

PII Protection ▼

Concept

Static vs Dynamic Masking

Static masking permanently replaces sensitive values in a copy of the dataset (used for dev/test environments — production data never leaves the secure zone). Dynamic masking applies transformations at query time without altering stored data — the same table has different views for different users.

Static Masking

Creates a sanitised copy of data. Used for non-prod environments. Data is permanently changed in the copy.

Dynamic Masking

Original data unchanged. Masking applied at query time based on user identity. Ranger / Unity Catalog handle this.

Static Masking in PySpark

Static Data Masking with PySpark

Use PySpark transformations to create a masked copy of sensitive datasets for development environments. This ensures developers never touch real PII.

python — static masking pipeline

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("MaskingPipeline").getOrCreate()

# Source: production customer table (restricted access)
prod_df = spark.table("prod.customers")

# Static masking transformations
masked_df = prod_df.withColumn(
    "email",
    # Replace everything before @ with ***
    F.regexp_replace("email", r"^[^@]+", "***")

).withColumn(
    "phone",
    # Keep last 4 digits only
    F.concat(F.lit("XXX-XXX-"), F.substring("phone", -4, 4))

).withColumn(
    "ssn",
    F.lit("***-**-****")  # always fully masked

).withColumn(
    "date_of_birth",
    # Only keep year (remove month/day)
    F.year("date_of_birth").cast("string")

).withColumn(
    "credit_card",
    # Show last 4 digits
    F.concat(F.lit("****-****-****-"),
             F.substring(F.regexp_replace("credit_card", "-", ""), -4, 4))
)

# Write to dev environment — safe for developers to use
masked_df.write.format("delta").mode("overwrite").saveAsTable("dev.customers_masked")

print("Masked dataset written to dev environment")

Dynamic Masking in Spark SQL

Dynamic Masking in Spark SQL (without Ranger/UC)

For environments where Ranger or Unity Catalog is not available, you can implement dynamic masking via SQL views that apply masking functions and grant analysts access only to the view — not the base table.

sql — view-based dynamic masking

-- Base table (restricted: only data engineers can access)
-- CREATE TABLE customers (id BIGINT, name STRING, email STRING, salary DOUBLE, ...)

-- Masked view (analysts get access to this view only)
CREATE OR REPLACE VIEW customers_masked AS
SELECT
  id,
  name,
  -- mask email
  regexp_replace(email, '^(.{2}).*(@.*)', '$1***$2') AS email,
  -- salary in bands, not exact
  CASE
    WHEN salary < 500000  THEN '<5L'
    WHEN salary < 1000000 THEN '5L-10L'
    ELSE '>10L'
  END AS salary_band,
  created_date  -- non-sensitive, show as-is
FROM customers;

-- Grant analysts access to the view, NOT the table
GRANT SELECT ON VIEW customers_masked TO analysts;

🎫

Tokenization

Reversible Substitution ▼

Concept

What is Tokenization?

Tokenization replaces a sensitive value (e.g. credit card number 4111-1111-1111-1111) with a non-sensitive token (e.g. TOK-8f3a2b). The mapping is stored in a secure token vault. Authorised systems can detokenize (reverse) the token to retrieve the original value. Unlike hashing, tokenization is reversible.

Tokenization vs Masking vs Hashing: Original PAN: 4111-1111-1111-1111 Masking: ****-****-****-1111 (not reversible, display only) Hashing: sha256 → a3f5d2... (not reversible, good for lookup) Tokenization: TOK-8f3a2b9c (reversible via token vault) Use tokenization when: downstream systems must process/transmit the "card number" but must not see the real number (PCI scope reduction).

PII Tokenization

PII Tokenization in PySpark

Implement a simple hash-based token in PySpark for environments where a full token vault is not available. For production PCI or HIPAA workloads, use a dedicated tokenization service (e.g. Protegrity, Voltage, AWS Payment Cryptography).

python — hash-based tokenization

from pyspark.sql import functions as F
import hashlib

# Option 1: Built-in Spark SHA-2 (fast, no UDF overhead)
df_tokenized = df.withColumn(
    "email_token",
    F.sha2(F.concat(F.col("email"), F.lit("MY_HMAC_SECRET")), 256)
).withColumn(
    "phone_token",
    F.sha2(F.concat(F.col("phone"), F.lit("MY_HMAC_SECRET")), 256)
).drop("email", "phone")  # remove originals

# Option 2: Format-Preserving Tokenization via UDF
# (preserves the "look" of a card number — required for PCI)
import hmac, hashlib, struct

def format_preserving_token(value: str, secret: str) -> str:
    """Generates a token that looks like the original (same length, char type)"""
    h = hmac.new(secret.encode(), value.encode(), hashlib.sha256).digest()
    # Convert first 8 bytes to a 16-digit-like decimal string
    num = struct.unpack('>Q', h[:8])[0] % (10**16)
    return f"{num:016d}"  # 16-digit token

fp_token_udf = F.udf(
    lambda v: format_preserving_token(v, "MY_HMAC_SECRET"),
    "string"
)
df_fp = df.withColumn("card_token", fp_token_udf("card_number"))

🔐

Encryption

At Rest & In Transit ▼

AES in Spark

AES Encryption in Spark (Spark 3.3+)

Spark 3.3+ has built-in aes_encrypt() and aes_decrypt() functions. These allow column-level encryption so that even if someone gains access to the Parquet file, they cannot read the sensitive values without the key.

python — AES column-level encryption

from pyspark.sql import functions as F

# AES key — in production, retrieve from AWS Secrets Manager or Vault
# Key length: 16 (AES-128), 24 (AES-192), or 32 bytes (AES-256)
AES_KEY = "0123456789abcdef"  # 16 bytes = AES-128

# ─── Encryption ───────────────────────────────────────────────
df = spark.createDataFrame([
    (1, "alice@corp.com", "4111111111111111"),
    (2, "bob@corp.com",   "5500005555555559"),
], ["id", "email", "card_number"])

encrypted_df = df.withColumn(
    "email_enc",
    F.aes_encrypt(F.col("email"), F.lit(AES_KEY), F.lit("ECB"))
).withColumn(
    "card_enc",
    F.aes_encrypt(F.col("card_number"), F.lit(AES_KEY), F.lit("ECB"))
).drop("email", "card_number")  # remove plaintext

# Write encrypted data to storage
encrypted_df.write.format("delta").mode("overwrite").save("/data/customers_enc")

# ─── Decryption (authorised process only) ─────────────────────
enc_df = spark.read.format("delta").load("/data/customers_enc")

decrypted_df = enc_df.withColumn(
    "email",
    F.aes_decrypt(F.col("email_enc"), F.lit(AES_KEY), F.lit("ECB")).cast("string")
).withColumn(
    "card_number",
    F.aes_decrypt(F.col("card_enc"), F.lit(AES_KEY), F.lit("ECB")).cast("string")
)

decrypted_df.show(truncate=False)

ECB vs GCM mode: ECB mode is simpler but less secure (identical plaintexts produce identical ciphertext). For production PCI/HIPAA data, use GCM mode with an IV: F.aes_encrypt(col, key, lit("GCM"), lit("DEFAULT"), F.expr("rand_bytes(12)")). GCM also provides authentication (detects tampering).

Encryption at Rest

Encryption at Rest (Storage Layer)

For S3 or HDFS, encryption at rest is typically handled at the storage layer, not in Spark code. This means even if someone copies the raw Parquet files, the bytes are encrypted. S3 SSE-KMS is the standard approach.

python — write to S3 with SSE-KMS encryption

# Spark writes to S3 — S3 encrypts files at rest using KMS key
spark.conf.set(
    "spark.hadoop.fs.s3a.server-side-encryption-algorithm",
    "SSE-KMS"
)
spark.conf.set(
    "spark.hadoop.fs.s3a.server-side-encryption.key",
    "arn:aws:kms:ap-south-1:123456789:key/your-kms-key-id"
)

# Now all writes are automatically encrypted at rest
df.write.parquet("s3a://my-secure-bucket/data/customers/")
# Files on S3 are AES-256 encrypted — unreadable without KMS access

🗝️

Secrets Management

Never Hardcode Credentials ▼

Rule #1

Never hardcode credentials

Hardcoding passwords in Spark code is the #1 security mistake. Credentials in code end up in Git history, Spark UI environment tabs, and application logs. Always retrieve secrets from an external secrets store at runtime.

Anti-pattern — never do this:
spark.conf.set("sfPassword", "MyPassword123")
This appears in Spark UI → Environment tab, Spark event logs, and any error stack traces.

Databricks Secrets

Databricks has a built-in secrets backend. Store credentials using the Databricks CLI, then retrieve them in notebooks/jobs with dbutils.secrets.get(). Secrets are redacted from notebook output automatically.

bash — create Databricks secret

# Create a secret scope
databricks secrets create-scope --scope data-pipeline

# Store the secret
databricks secrets put --scope data-pipeline --key snowflake_password

python — retrieve Databricks secret in PySpark

# In Databricks notebook or job
sf_password = dbutils.secrets.get(scope="data-pipeline", key="snowflake_password")
sf_url      = dbutils.secrets.get(scope="data-pipeline", key="snowflake_url")

snowflake_options = {
    "sfURL":      sf_url,
    "sfUser":     "spark_service_user",
    "sfPassword": sf_password,
    "sfDatabase": "SALES_DB",
    "sfSchema":   "PUBLIC",
    "sfWarehouse":"SPARK_WH",
}

df = (spark.read.format("snowflake")
      .options(**snowflake_options)
      .option("dbtable", "orders")
      .load())

AWS Secrets Manager

AWS Secrets Manager with Spark

In EMR or on-prem Spark, retrieve secrets from AWS Secrets Manager using boto3 at job startup. The IAM role running the job must have secretsmanager:GetSecretValue permission.

python — retrieve secret from AWS Secrets Manager

import boto3, json
from pyspark.sql import SparkSession

def get_secret(secret_name: str, region: str = "ap-south-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    resp = client.get_secret_value(SecretId=secret_name)
    return json.loads(resp["SecretString"])

# Retrieve at startup — runs once on the Driver
db_creds = get_secret("prod/spark/postgres_credentials")

spark = SparkSession.builder.appName("SecureJob").getOrCreate()

# Use credentials — never logged, never in config files
jdbc_df = (spark.read
    .format("jdbc")
    .option("url",      f"jdbc:postgresql://{db_creds['host']}:5432/{db_creds['dbname']}")
    .option("dbtable", "public.orders")
    .option("user",     db_creds["username"])
    .option("password", db_creds["password"])
    .load())

jdbc_df.show()

Azure & HashiCorp Vault

Azure Key Vault and HashiCorp Vault

The same principle applies across clouds and on-prem. Azure Key Vault integrates with Synapse and Databricks on Azure. HashiCorp Vault works everywhere and is popular for multi-cloud or on-prem Spark clusters.

python — HashiCorp Vault with hvac

import hvac  # pip install hvac
import os

def get_vault_secret(path: str) -> dict:
    client = hvac.Client(
        url="https://vault.corp.local:8200",
        token=os.environ["VAULT_TOKEN"]  # injected by orchestrator
    )
    resp = client.secrets.kv.v2.read_secret_version(path=path)
    return resp["data"]["data"]

creds = get_vault_secret("secret/data/spark/snowflake")
# creds = {"sfPassword": "...", "sfUser": "...", ...}

31.4 — NETWORK SECURITY

Network Security

Network security in Spark ensures that cluster traffic never traverses public internet and that only authorised components can communicate. This is enforced through VPC design, private endpoints, security groups, and Kubernetes network policies.

🌐

VPC and Subnet Isolation

AWS / Azure / GCP ▼

Design

VPC Isolation for Spark Clusters

All Spark cluster nodes (Driver and Executors) should run in private subnets — subnets with no direct internet route. Outbound internet access (for package installs) goes through a NAT Gateway. S3 and other AWS service access uses VPC Endpoints to stay on the AWS backbone.

VPC Design for Spark: ┌──────────────────────────────────────────────────────────────┐ │ VPC: 10.0.0.0/16 │ │ │ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │ │ Private Subnet A │ │ Private Subnet B │ │ │ │ 10.0.1.0/24 │ │ 10.0.2.0/24 │ │ │ │ │ │ │ │ │ │ EMR Master / Driver │ │ EMR Core / Executors │ │ │ │ RDS (metadata DB) │ │ (no public IPs) │ │ │ └──────────┬──────────┘ └────────────────────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ ┌──────────────────────────┐ │ │ │ NAT Gateway │ │ S3 VPC Endpoint (Gateway)│ │ │ │ (outbound internet) │ │ Glue VPC Endpoint (Iface)│ │ │ └──────────────────────┘ │ KMS VPC Endpoint (Iface) │ │ │ └──────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ Key: Spark → S3 traffic stays inside AWS network (no internet hop) via S3 Gateway Endpoint

Security Groups

Security Groups vs NACLs for Spark

Security Groups act as stateful instance-level firewalls. For Spark, configure them to allow Driver ↔ Executor ports, block all inbound from the internet, and allow outbound only to required services.

text — EMR Spark security group rules

Master Security Group (sg-master):
  Inbound:
    - Port 22 (SSH)       from Bastion Host SG only
    - Port 18080 (Spark History UI) from VPN CIDR only
    - All ports            from Worker SG (sg-worker)
  Outbound:
    - All traffic         to Worker SG (sg-worker)
    - HTTPS (443)         to S3 VPC endpoint, Secrets Manager endpoint
    - Port 443            to KMS endpoint

Worker Security Group (sg-worker):
  Inbound:
    - All ports            from Master SG (sg-master)
    - All ports            from Worker SG itself (executor ↔ executor)
  Outbound:
    - All traffic          to Master SG
    - HTTPS (443)          to S3 VPC endpoint

VPC Endpoints

VPC Endpoints — keeping S3/Glue/KMS traffic private

Without VPC endpoints, Spark → S3 traffic exits your VPC to the public internet, then re-enters AWS. S3 Gateway Endpoint routes S3 traffic directly within AWS at no extra cost. Interface endpoints (Glue, KMS, Secrets Manager) run inside your VPC as private ENIs.

terraform — create S3 VPC Gateway Endpoint

# S3 Gateway Endpoint — free, routes S3 traffic within AWS
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.spark_vpc.id
  service_name      = "com.amazonaws.ap-south-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
  tags = { Name = "spark-s3-endpoint" }
}

# Glue Interface Endpoint — for Spark to reach Glue Catalog privately
resource "aws_vpc_endpoint" "glue" {
  vpc_id              = aws_vpc.spark_vpc.id
  service_name        = "com.amazonaws.ap-south-1.glue"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id]
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

🔌

Private Endpoints for Spark Clusters

No Public IPs ▼

Concept

Why private endpoints matter

In many regulated industries, cluster nodes must not have public IP addresses. Databricks on AWS supports "No Public IP" (NPIP) mode. EMR supports fully private clusters. Kubernetes Spark runs in private node pools. Access is only through VPN or Direct Connect.

terraform — Databricks workspace with no public IPs

resource "databricks_mws_workspaces" "secure_ws" {
  account_id     = var.databricks_account_id
  workspace_name = "secure-data-platform"

  network_id = databricks_mws_networks.private_net.network_id

  # No public IPs on cluster nodes
  custom_tags = {
    no_public_ip = "true"
  }
}

☸️

Network Policies on Kubernetes

EKS / GKE / AKS ▼

K8s Network Policy

Kubernetes Network Policies for Spark

In Spark on Kubernetes, the Driver Pod and Executor Pods run in the cluster. A NetworkPolicy restricts which pods can talk to which — ensuring Spark pods can only communicate with each other and necessary services (S3, Iceberg catalog), not with other workloads in the cluster.

yaml — Kubernetes NetworkPolicy for Spark

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-network-policy
  namespace: spark-jobs
spec:
  # Apply to all Spark pods (driver and executors)
  podSelector:
    matchLabels:
      spark-role: executor  # also apply separate policy for driver

  policyTypes:
    - Ingress
    - Egress

  ingress:
    # Executors only accept traffic from the Driver pod
    - from:
        - podSelector:
            matchLabels:
              spark-role: driver
      ports:
        - port: 7078   # executor port
        - port: 7079   # blockmanager port

  egress:
    # Allow egress to other Spark pods in same namespace
    - to:
        - namespaceSelector:
            matchLabels:
              name: spark-jobs
    # Allow egress to S3 (via VPC endpoint, port 443)
    - ports:
        - port: 443
          protocol: TCP

YARN Firewall

Firewall Rules for YARN Clusters

For on-prem Spark/YARN, iptables or firewalld rules replace Kubernetes NetworkPolicies. Restrict the key Spark ports so only cluster nodes can communicate internally.

text — key Spark ports to control

Driver port:              4040  (Spark UI — restrict to internal VPN)
History Server:           18080 (restrict to internal VPN)
Executor blockmanager:    Random ephemeral (allow within cluster CIDR)
Thrift Server JDBC:       10000 (allow from app servers only)
YARN ResourceManager UI:  8088  (restrict to ops team)
HDFS NameNode:            8020  (allow from cluster nodes only)
Kerberos KDC:             88    (allow from all cluster nodes)

31.5 — COMPLIANCE FRAMEWORKS

Compliance Frameworks

Compliance frameworks define legal and regulatory requirements for how data must be handled. As a data engineer, you need to understand what each framework demands so you can implement the right Spark patterns. HIPAA, GDPR, and PCI-DSS are the three most commonly encountered.

🏥

HIPAA

Healthcare ▼

What is HIPAA?

HIPAA — Health Insurance Portability and Accountability Act

HIPAA applies to any system handling Protected Health Information (PHI) — data that can identify a patient and relates to their health. Examples: diagnosis codes, treatment records, medical images, insurance claims, patient names+dates. Violations result in massive fines.

PHI Example	DE Implication
Patient name + diagnosis	Must be encrypted at rest and in transit
DOB + ZIP code	Together can identify a patient → treat as PHI
Medical images	De-identify before storing in data lake
Insurance claim amounts	Access-controlled; audit logged

PHI Handling

PHI Handling in Spark Pipelines

De-identify PHI as early as possible in the pipeline — ideally in the Bronze → Silver transformation. De-identified data can be used freely; PHI-containing data must remain in the secure zone.

python — PHI de-identification in PySpark

from pyspark.sql import functions as F

# Bronze layer: raw PHI data (restricted access)
raw_claims = spark.table("bronze.medical_claims")

# De-identification for Silver layer (analysts can access)
deidentified = (raw_claims
    # Remove direct identifiers
    .drop("patient_name", "ssn", "address", "phone", "email")

    # Replace patient_id with one-way hash (pseudonymization)
    .withColumn("patient_hash",
                F.sha2(F.concat("patient_id", F.lit("SALT_HIPAA_2026")), 256))
    .drop("patient_id")

    # Date shift: shift all dates by a random patient-specific offset
    # (so date patterns are preserved but exact dates are obscured)
    .withColumn("service_year", F.year("service_date"))  # keep year only
    .drop("service_date", "dob")

    # Generalise ZIP to first 3 digits only (HIPAA safe harbor rule)
    .withColumn("zip3", F.substring("zip_code", 1, 3))
    .drop("zip_code")
)

deidentified.write.format("delta").mode("overwrite").saveAsTable("silver.claims_deidentified")

Access Controls

Access Controls for PHI

HIPAA's "minimum necessary" standard requires that only the minimum data needed for a task is accessed. In Spark, implement this through column-level grants (Unity Catalog / Ranger) and separate Bronze/Silver zones.

sql — HIPAA access control in Unity Catalog

-- PHI zone: only hipaa_authorised_users can access bronze
GRANT SELECT ON TABLE bronze.medical_claims TO `hipaa_authorised_users`;

-- Analysts can access de-identified silver layer
GRANT SELECT ON TABLE silver.claims_deidentified TO `data_analysts`;

-- Log all access to PHI (Unity Catalog audit)
SELECT user_identity.email, action_name, event_time
FROM system.access.audit
WHERE request_params.table = 'medical_claims'
ORDER BY event_time DESC;

Audit Logging

Audit Logging for HIPAA

HIPAA requires audit logs of who accessed PHI, when, and what they did. Implement structured audit logging in Spark pipelines that writes to an immutable audit table (Delta with no-deletes policy).

python — pipeline audit logging

import datetime
from pyspark.sql import Row

def write_audit_log(spark, pipeline_name, user, table_accessed, row_count, status):
    audit_row = [Row(
        audit_id=str(datetime.datetime.utcnow().timestamp()),
        pipeline_name=pipeline_name,
        accessed_by=user,
        table_accessed=table_accessed,
        access_time=str(datetime.datetime.utcnow()),
        rows_accessed=row_count,
        status=status,
        data_classification="PHI"
    )]
    audit_df = spark.createDataFrame(audit_row)
    # Append-only — HIPAA requires immutable audit trail
    audit_df.write.format("delta").mode("append").saveAsTable("audit.phi_access_log")

# Call after each PHI access
write_audit_log(spark,
    pipeline_name="claims_etl",
    user="spark_svc@corp.local",
    table_accessed="bronze.medical_claims",
    row_count=raw_claims.count(),
    status="SUCCESS"
)

🇪🇺

GDPR

EU Data Privacy ▼

What is GDPR?

GDPR — General Data Protection Regulation

GDPR applies to any processing of EU residents' personal data regardless of where your company is located. Key principles for data engineers: data minimisation, purpose limitation, storage limitation, and the right to erasure ("right to be forgotten").

Right to Erasure

Right to Erasure with Delta Lake / Iceberg

GDPR's right to erasure requires that when a user requests deletion of their data, you can actually delete it from all storage locations — including historical data. Delta Lake MERGE + DELETE + VACUUM and Iceberg row-level deletes make this possible in a Spark-based lakehouse.

python — GDPR right to erasure in Delta Lake

from delta.tables import DeltaTable

# User submits erasure request for user_id = 12345
USER_TO_DELETE = 12345

# Step 1: Delete from all Delta tables containing this user's data
tables_with_pii = [
    "gold.customers",
    "silver.orders",
    "silver.clickstream",
    "bronze.raw_events",
]

for table_name in tables_with_pii:
    dt = DeltaTable.forName(spark, table_name)
    dt.delete(f"user_id = {USER_TO_DELETE}")
    print(f"Deleted user {USER_TO_DELETE} from {table_name}")

# Step 2: VACUUM to remove old Parquet files containing the user's data
# Default retention is 7 days — GDPR erasure must VACUUM after that period
# For immediate erasure, you can override retention (with caution):
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

for table_name in tables_with_pii:
    spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS")
    print(f"Vacuumed old files from {table_name}")

# Step 3: Log the erasure request completion
spark.sql(f"""
    INSERT INTO gdpr.erasure_log VALUES (
      {USER_TO_DELETE}, current_timestamp(), 'COMPLETED', 'All tables cleaned'
    )
""")

Time travel caveat: After DELETE, Delta's time travel lets you still see old data within the retention window. VACUUM with RETAIN 0 HOURS removes historical files immediately. Only do this with careful coordination — you lose the ability to time-travel after VACUUM.

Data Minimisation

Data Minimisation in Pipelines

GDPR requires you only collect and retain data that is necessary for the stated purpose. In Spark pipelines, implement column dropping, purpose-based schemas, and retention policies.

python — data minimisation patterns

# Drop PII fields that are not needed for analytics purpose
ANALYTICS_PURPOSE_COLUMNS = [
    "order_id", "product_id", "quantity",
    "amount", "order_date", "country"
    # NOT: customer_name, email, phone, address
]

analytics_df = raw_orders.select(ANALYTICS_PURPOSE_COLUMNS)
analytics_df.write.saveAsTable("gold.order_analytics")

# Implement retention: delete records older than 2 years
DeltaTable.forName(spark, "silver.events").delete(
    "event_date < date_sub(current_date(), 730)"
)

💳

PCI-DSS

Payment Card ▼

What is PCI-DSS?

PCI-DSS — Payment Card Industry Data Security Standard

PCI-DSS applies to any system that stores, processes, or transmits cardholder data (PANs, CVVs, card expiry dates, PINs). The standard has 12 major requirements. For data engineers, the critical ones are: encryption of stored PANs, tokenization to remove PANs from analytics systems, strict access control, and comprehensive audit logging.

PCI Data Element	What you MUST do
Full PAN (card number)	Encrypt at rest (AES-256) OR tokenize — never store in plaintext
CVV/CVC	Never store after authorization — must delete immediately
Expiry date + cardholder name	Encrypt at rest if stored alongside PAN
PIN blocks	Never store in a data lake — transaction processor only

Encryption Requirements

PCI Encryption Requirements in Spark

PCI Requirement 3 mandates protection of stored cardholder data. AES-256 encryption is the standard. Key management must follow strict controls: keys stored in HSMs or KMS, separation of duties, annual key rotation.

python — PCI-compliant PAN handling in Spark

import boto3, json
from pyspark.sql import functions as F

# Step 1: Retrieve AES-256 key from KMS/Vault (never hardcode)
def get_pci_key():
    client = boto3.client("secretsmanager", region_name="ap-south-1")
    secret = client.get_secret_value(SecretId="pci/pan_encryption_key")
    return json.loads(secret["SecretString"])["aes_key"]

AES_KEY = get_pci_key()  # 32-char string = AES-256

# Step 2: Process payment data — tokenize PAN, drop CVV immediately
payment_df = spark.table("landing.payment_events")

pci_safe_df = (payment_df
    # Tokenize PAN: replace with SHA-256 HMAC token (non-reversible)
    # For reversible: use aes_encrypt with GCM mode
    .withColumn("pan_token",
                F.sha2(F.concat(F.col("card_pan"), F.lit(AES_KEY)), 256))

    # Show only last 4 digits for display/logging
    .withColumn("pan_last4",
                F.substring("card_pan", -4, 4))

    # CVV MUST be dropped immediately (PCI Req 3.2.1)
    .drop("card_pan", "cvv", "pin")
)

# Write to PCI-scoped table (separate catalog/schema with strict ACL)
pci_safe_df.write.format("delta").mode("append").saveAsTable("pci_scope.payment_safe")

Access Logging

PCI Access Logging (Requirement 10)

PCI Requirement 10 mandates that all access to system components and cardholder data must be logged. Log who accessed the PCI-scoped tables, when, from where, and what queries were run. Logs must be retained for 12 months (3 months immediately available).

python — PCI audit log writer

import datetime, socket, os

def pci_audit_log(spark, action, table, user=None, rows_affected=0):
    """Write PCI Requirement 10-compliant audit log entry"""
    if user is None:
        user = os.environ.get("SPARK_USER", "unknown")

    audit_record = [{
        "timestamp":     str(datetime.datetime.utcnow()),
        "user_id":       user,
        "action":        action,   # SELECT / INSERT / DELETE
        "object_type":   "TABLE",
        "object_name":   table,
        "rows_affected": rows_affected,
        "origin_ip":    socket.gethostbyname(socket.gethostname()),
        "success":       True,
        "pci_scope":    True
    }]
    audit_df = spark.createDataFrame(audit_record)
    # Immutable append-only audit table (Delta)
    audit_df.write.format("delta").mode("append").saveAsTable("pci_audit.access_log")

# Usage
pci_audit_log(spark, "SELECT", "pci_scope.payment_safe", rows_affected=50000)

MODULE 31 — SUMMARY

Summary & Quick Reference

Complete reference for Spark Security — the key concepts, commands, and patterns from each section.

📋

Module 31 — Complete Quick Reference

▼

31.1 Authentication

Authentication — Key Facts

kinit = interactive TGT keytab = non-interactive auth --principal + --keytab = spark-submit Kerberos args LDAP auth on Thrift Server via hive-site.xml spark.authenticate=true → RPC encryption spark.ssl.ui.enabled=true → HTTPS UI spark.network.crypto.enabled=true → shuffle encryption keytool = manage Java keystores/truststores

31.2 Authorization

Authorization — Key Facts

Ranger = on-prem policy enforcement (Hive, HDFS, Spark) Ranger row filter = auto-inject WHERE clause Ranger col mask = transform at query time Unity Catalog = Databricks 3-level namespace GRANT / REVOKE SQL statements Row filter = SQL function attached to table Column mask = SQL function on column system.access.audit = UC audit log table

31.3 Data Security

Data Security — Key Facts

Static masking = permanent copy, for dev/test Dynamic masking = query-time, same table, Ranger/UC Tokenization = reversible substitution (PCI) aes_encrypt(col, key, "GCM") = column encryption SSE-KMS = S3 encryption at rest dbutils.secrets.get() = Databricks secrets secretsmanager.get_secret_value() = AWS secrets

31.4 Network

Network Security — Key Facts

Private subnets = no public IPs on cluster nodes S3 Gateway Endpoint = free, keeps S3 traffic in AWS Interface Endpoints = Glue, KMS, Secrets Manager Security Groups = stateful, instance-level firewalls K8s NetworkPolicy = restrict pod-to-pod traffic NAT Gateway = outbound internet for patching only

31.5 Compliance

Compliance — Key Facts

Framework	Scope	Key DE Requirement
HIPAA	Healthcare PHI (US)	De-identify PHI, encrypt at rest, immutable audit log
GDPR	EU residents' personal data	Right to erasure (Delta DELETE+VACUUM), data minimisation
PCI-DSS	Payment card data	Tokenize PAN, drop CVV immediately, AES-256 if stored, audit Req 10

Interview Prep