MODULE 31 Spark Security
1 / 7 sections
MODULE 31 β€” OVERVIEW

Spark Security

Security in Spark is a layered discipline: who you are (Authentication), what you can do (Authorization), how your data is protected (Encryption & Masking), where traffic flows (Network), and which regulations apply (Compliance). This module covers every layer with real configuration and code.

πŸͺͺ
Authentication
Kerberos, LDAP, SSL/TLS β€” prove identity before any cluster access is granted.
πŸ›‘οΈ
Authorization
Apache Ranger and Unity Catalog enforce who can read which tables, columns, and rows.
πŸ”’
Data Security
Masking, tokenization, AES column-level encryption β€” protect sensitive data at rest and in transit.
🌐
Network & Compliance
VPCs, private endpoints, HIPAA / GDPR / PCI-DSS β€” secure Spark at the infra and regulatory level.
MODULE 31 β€” WHAT YOU WILL LEARN 31.1 Authentication β†’ Kerberos (kinit, keytab), LDAP, SSL/TLS 31.2 Authorization β†’ Apache Ranger (policies, row-filter, col-mask), Unity Catalog 31.3 Data Security β†’ Masking, Tokenization, AES Encryption, Secrets Management 31.4 Network Security β†’ VPC isolation, private endpoints, network policies, Kubernetes 31.5 Compliance β†’ HIPAA (PHI, audit), GDPR (right to erasure), PCI-DSS (encryption)
Interview tip: Security questions appear regularly in senior data engineer interviews. Know the difference between Authentication vs Authorization, how Ranger column masking works, how to enable RPC encryption, and what GDPR "right to erasure" means for Delta/Iceberg.
31.1 β€” AUTHENTICATION

Authentication

Authentication = proving who you are. Spark supports three main authentication mechanisms: Kerberos (enterprise Hadoop clusters), LDAP (user directory integration), and SSL/TLS (encrypted RPC and UI). You configure these at the cluster and SparkSession level.

🎫
Kerberos
Enterprise Auth β–Ό
Concept
What is Kerberos?
Kerberos is a ticket-based authentication protocol used in enterprise Hadoop environments. Instead of passwords being sent over the wire, a Key Distribution Center (KDC) issues encrypted tickets. Spark uses Kerberos when deployed on Kerberized HDFS/YARN clusters.
Kerberos Flow: User / Service β”‚ β–Ό (1) Request Ticket Granting Ticket (TGT) KDC (Key Distribution Center) β”‚ β–Ό (2) Returns encrypted TGT (valid for N hours) kinit stores TGT in local credential cache β”‚ β–Ό (3) Spark uses TGT to request Service Ticket for HDFS/YARN HDFS / YARN accepts Service Ticket β†’ grants access
kinit
kinit β€” obtaining a Kerberos ticket interactively
kinit is the CLI tool that authenticates a user against the KDC and stores the Ticket Granting Ticket (TGT) in the local credential cache. You run it before submitting Spark jobs in a Kerberized cluster.
bash
# Authenticate interactively (prompts for password)
kinit username@REALM.COM

# Check your current tickets
klist

# Example output:
# Ticket cache: FILE:/tmp/krb5cc_1000
# Default principal: spark_user@CORP.LOCAL
#
# Valid starting       Expires              Service principal
# 06/15/2026 09:00:00  06/15/2026 19:00:00  krbtgt/CORP.LOCAL@CORP.LOCAL

# Destroy tickets when done
kdestroy
keytab
keytab β€” non-interactive authentication for scheduled jobs
A keytab is an encrypted file containing the service principal's key. It allows Spark jobs to authenticate with Kerberos without human interaction β€” essential for Airflow-triggered or scheduled Spark jobs. The keytab is created by the Kerberos admin.
bash
# Authenticate using a keytab (no password prompt)
kinit -kt /etc/spark/spark.keytab spark_svc@CORP.LOCAL

# Verify
klist
Spark Config
Configuring Spark with Kerberos
Pass the principal and keytab to spark-submit. Spark's credential renewal thread automatically refreshes tokens before they expire during long-running jobs.
bash β€” spark-submit
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --principal spark_svc@CORP.LOCAL \
  --keytab /etc/spark/spark.keytab \
  --conf spark.yarn.access.hadoopFileSystems=hdfs://namenode:8020 \
  my_job.py
python β€” SparkSession with Kerberos
from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("KerberosApp")
    .config("spark.yarn.principal", "spark_svc@CORP.LOCAL")
    .config("spark.yarn.keytab", "/etc/spark/spark.keytab")
    # Token renewal interval (default 75% of token lifetime)
    .config("spark.kerberos.access.hadoopFileSystems", "hdfs://namenode:8020")
    .getOrCreate())

# Spark automatically renews delegation tokens in background
df = spark.read.parquet("hdfs://namenode:8020/data/sales/")
df.show()
Token Renewal: Long-running Spark Streaming jobs must have token renewal configured. Without it, HDFS delegation tokens expire mid-job and the job fails. Set spark.kerberos.renewal.credentials = keytab to enable automatic renewal.
πŸ“‹
LDAP
Directory Auth β–Ό
Concept
LDAP Integration with Spark
LDAP (Lightweight Directory Access Protocol) is a directory service protocol used to authenticate users against corporate directories (Active Directory, OpenLDAP). Spark's Thrift Server and Spark History Server support LDAP authentication so that only authorised users can query SQL or view job history.
Where LDAP is used in Spark: Spark itself does not have a built-in LDAP auth layer for the driver. LDAP is configured on Spark Thrift Server (HiveServer2-compatible SQL endpoint) and Spark History Server. Cluster managers (YARN, Kubernetes) handle broader LDAP integration.
Config
Enabling LDAP on Spark Thrift Server
Add these properties to hive-site.xml or pass as Spark configs. The Thrift Server will then require users to authenticate via your LDAP directory before executing SQL queries.
xml β€” hive-site.xml
<property>
  <name>hive.server2.authentication</name>
  <value>LDAP</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.url</name>
  <value>ldap://ldap.corp.local:389</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.baseDN</name>
  <value>dc=corp,dc=local</value>
</property>

<property>
  <name>hive.server2.authentication.ldap.Domain</name>
  <value>corp.local</value>
</property>
bash β€” connecting beeline to LDAP-protected Thrift Server
beeline -u "jdbc:hive2://spark-thrift:10000/default;AuthMech=3" \
  -n alice@corp.local \
  -p MySecret123
History Server
LDAP Auth on Spark History Server
Protect the Spark History Server so only authorised users can view job DAGs and logs.
conf β€” spark-defaults.conf
spark.history.ui.acls.enable=true
spark.history.ui.admin.acls=spark-admins
spark.ui.filters=org.apache.hadoop.security.authentication.server.AuthenticationFilter
spark.ui.filters.org.apache.hadoop.security.authentication.server.AuthenticationFilter.params=\
  type=ldap,\
  ldap.url=ldap://ldap.corp.local:389,\
  ldap.base.dn=dc=corp,dc=local
πŸ”‘
SSL/TLS
Encryption in Transit β–Ό
Concept
What SSL/TLS secures in Spark
SSL/TLS in Spark secures three channels: the Spark UI / History Server (HTTPS), RPC communication between Driver ↔ Executors (data shuffle, block transfers), and the Thrift Server (JDBC connections). Without TLS, shuffle data and UI credentials travel in plaintext across the network.
🌐 Spark UI (HTTPS)
Protects the web UI showing job DAGs, environment variables (which may expose secrets).
πŸ“‘ RPC Encryption
Encrypts Driver ↔ Executor communication β€” task data, shuffle blocks, broadcasts.
Enabling SSL
Enabling SSL for Spark UI
You need a Java keystore containing the server certificate. Generate one with keytool, then point Spark at it.
bash β€” generate keystore
# Generate a self-signed certificate keystore
keytool -genkeypair \
  -alias spark \
  -keyalg RSA \
  -keysize 2048 \
  -keystore /etc/spark/spark-keystore.jks \
  -storepass changeit \
  -validity 365 \
  -dname "CN=spark.corp.local, OU=Data, O=Corp, L=City, ST=State, C=IN"
conf β€” spark-defaults.conf β€” Spark UI SSL
# Enable HTTPS for Spark UI
spark.ui.enabled=true
spark.ssl.ui.enabled=true
spark.ssl.ui.keyStore=/etc/spark/spark-keystore.jks
spark.ssl.ui.keyStorePassword=changeit
spark.ssl.ui.keyPassword=changeit
spark.ssl.ui.protocol=TLSv1.3

# Enable HTTPS for History Server
spark.ssl.historyServer.enabled=true
spark.ssl.historyServer.keyStore=/etc/spark/spark-keystore.jks
spark.ssl.historyServer.keyStorePassword=changeit
RPC Encryption
RPC Encryption β€” encrypting Driver ↔ Executor traffic
RPC encryption encrypts all messages between the Driver and Executors β€” including task submissions, shuffle data, and heartbeats. This is critical in shared clusters where network traffic could be intercepted.
conf β€” spark-defaults.conf β€” RPC encryption
# Enable RPC (driver-executor) encryption
spark.authenticate=true
spark.authenticate.secret=your-strong-shared-secret-here

# Encrypt data in transit for all Spark RPC
spark.network.crypto.enabled=true
spark.network.crypto.keyLength=256
spark.network.crypto.keyFactoryAlgorithm=PBKDF2WithHmacSHA256

# Encrypt shuffle data on disk (when spilling)
spark.io.encryption.enabled=true
spark.io.encryption.keySizeBits=256
Performance note: RPC encryption adds CPU overhead (typically 5–15% throughput reduction on shuffle-heavy jobs). It's non-negotiable in regulated environments (HIPAA, PCI) but optional in isolated VPCs where network-level security is sufficient.
Certificate Management
Certificate Management
In production, use CA-signed certificates rather than self-signed ones. Distribute the CA's truststore to all cluster nodes so they can verify each other's identity. Automate renewal with tools like cert-manager (Kubernetes) or HashiCorp Vault PKI.
bash β€” import CA certificate into truststore
# Import your corporate CA certificate into a Java truststore
keytool -importcert \
  -file /etc/ssl/certs/corp-ca.crt \
  -alias corp-ca \
  -keystore /etc/spark/spark-truststore.jks \
  -storepass trustpass \
  -noprompt

# Reference in spark config
# spark.ssl.ui.trustStore=/etc/spark/spark-truststore.jks
# spark.ssl.ui.trustStorePassword=trustpass
31.2 β€” AUTHORIZATION

Authorization

Authorization = controlling what an authenticated user can do. After Kerberos proves you are Alice, authorization decides: can Alice read table X? Can she see column salary? Apache Ranger and Unity Catalog are the two dominant authorization layers in enterprise Spark.

🏹
Apache Ranger
On-Prem / YARN β–Ό
Architecture
Ranger Architecture
Apache Ranger provides centralized security administration for the Hadoop ecosystem. It has a web UI to define fine-grained policies, and plugins installed on each service (Hive, HDFS, Spark SQL) that intercept and enforce those policies at runtime.
Ranger Architecture: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Ranger Admin (Web UI + REST API) β”‚ β”‚ Define policies: who can access what, when, how β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ sync policies (every 30s) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό Hive Plugin HDFS Plugin Spark SQL Plugin (intercepts (intercepts (intercepts SQL HiveServer2 HDFS ops) on Thrift Server) queries) Every query passes through the plugin β†’ policy check β†’ allow/deny
Policies
Ranger Policies for Spark SQL
Ranger policies are defined in the Admin UI and specify: Resource (database, table, column), Users/Groups, and Permissions (select, insert, update, drop, create). The Spark/Hive plugin enforces these at query execution time.
json β€” example Ranger policy (REST API payload)
{
  "name": "sales_analysts_read_policy",
  "service": "hive_spark_service",
  "resources": {
    "database": {"values": ["sales_db"]},
    "table":    {"values": ["orders", "customers"]},
    "column":   {"values": ["*"]}   // all columns
  },
  "policyItems": [{
    "groups":      ["sales_analysts"],
    "accesses":    [{"type": "select", "isAllowed": true}]
  }]
}
In practice: Ranger policies are usually managed via the web UI, not the API. Understand the UI concepts: Resource hierarchy (database β†’ table β†’ column), Policy conditions (IP range, time-based), and Masking policies (for column obfuscation).
Hive Plugin
Hive Plugin with Ranger
The Hive plugin intercepts all Spark SQL / HiveServer2 queries. Every SELECT, INSERT, DROP is checked against Ranger policies before execution. Unauthorized queries are denied with an error.
bash β€” install Ranger Hive plugin
# Unpack the ranger-hive-plugin archive
cd /opt/ranger-hive-plugin-2.4.0
./enable-hive-plugin.sh

# This copies the plugin JAR into HiveServer2's classpath
# and adds ranger-plugin-audit.xml, ranger-hive-security.xml
# Restart HiveServer2 / Spark Thrift Server after
HDFS Plugin
HDFS Plugin with Ranger
The HDFS plugin enforces path-level access control. Even if a Spark job tries to bypass Hive/SQL and read HDFS directly, Ranger HDFS policies block unauthorized path access.
json β€” HDFS Ranger policy
{
  "name": "data_lake_read_policy",
  "service": "hdfs_service",
  "resources": {
    "path": {"values": ["/data/gold/*"], "recursive": true}
  },
  "policyItems": [{
    "groups": ["data_scientists"],
    "accesses": [
      {"type": "read",    "isAllowed": true},
      {"type": "execute", "isAllowed": true}
    ]
  }]
}
Row-Level Filtering
Row-Level Filtering in Ranger
Ranger can automatically inject a WHERE clause into any query based on the user's identity. For example, a regional manager in APAC only ever sees rows where region = 'APAC' β€” the filter is applied transparently, without any changes to the application SQL.
Ranger UI β€” Row Filter Policy (conceptual)
Policy Name: apac_region_filter
Table: sales_db.orders
For group: apac_managers
Row filter expression: region = 'APAC'

Effect: any query by apac_managers against sales_db.orders
automatically becomes:
  SELECT ... FROM sales_db.orders WHERE region = 'APAC'
Column Masking
Column Masking with Ranger
Ranger's data masking policies transform sensitive column values at query time. The actual data in storage is not changed β€” Ranger applies masking functions as the data is returned to the user. Masking types include: Redact, Hash, Nullify, Show first/last N chars, Date year-only.
Ranger UI β€” Column Masking Policy (conceptual)
Policy: pii_masking_policy
Table: customers
Column: email
For groups: analysts (non-PII-authorised)
Masking Type: Hash (SHA256)

Column: phone
Masking Type: Redact  β†’  output: XXXXXXXXXX

Column: credit_card
Masking Type: Show last 4  β†’  output: ****-****-****-1234

-- What analysts actually see when they run SELECT *:
-- email:       a3f5d2...  (SHA256 hash)
-- phone:       XXXXXXXXXX
-- credit_card: ****-****-****-1234
πŸ—‚οΈ
Unity Catalog Security
Databricks β–Ό
Overview
Unity Catalog β€” Unified Governance for Databricks
Unity Catalog (UC) is Databricks' unified governance layer that manages access to all data objects in a single metastore. It supports table/view/volume/function grants, row filters, column masks, and built-in data lineage β€” all in one place without needing Ranger or a separate tool.
Unity Catalog β€” 3-Level Namespace: catalog └── schema (database) └── table / view / volume / function Example: main.sales.orders ──── ───── ────── catalog schema table
Permissions
Workspace / Catalog / Table-Level Permissions
UC uses SQL GRANT / REVOKE statements. Permissions are inherited: granting USE CATALOG β†’ USE SCHEMA β†’ SELECT on table gives full read access.
sql β€” Unity Catalog GRANT statements
-- Grant access to a catalog
GRANT USE CATALOG ON CATALOG main TO `analysts@corp.com`;

-- Grant access to a schema
GRANT USE SCHEMA ON SCHEMA main.sales TO `analysts@corp.com`;

-- Grant SELECT on a specific table
GRANT SELECT ON TABLE main.sales.orders TO `analysts@corp.com`;

-- Grant to a group (preferred)
GRANT SELECT ON TABLE main.sales.orders TO `data_analysts`;

-- Revoke
REVOKE SELECT ON TABLE main.sales.orders FROM `contractors`;

-- View grants on a table
SHOW GRANTS ON TABLE main.sales.orders;
Row Filters
Row Filters in Unity Catalog
UC row filters are implemented as SQL functions attached to a table. When any user queries the table, the filter function is evaluated and only matching rows are returned β€” entirely transparent to the query.
sql β€” Unity Catalog row filter
-- Step 1: Create a filter function
-- Returns TRUE for rows the current user is allowed to see
CREATE OR REPLACE FUNCTION main.security.region_filter(region_col STRING)
RETURN
  is_account_group_member('apac_team') AND region_col = 'APAC'
  OR is_account_group_member('emea_team') AND region_col = 'EMEA'
  OR is_account_group_member('data_admins');  -- admins see all

-- Step 2: Attach the filter to the table
ALTER TABLE main.sales.orders
SET ROW FILTER main.security.region_filter ON (region);

-- Now: APAC team member runs SELECT * FROM main.sales.orders
-- They only see rows where region = 'APAC' β€” filter applied automatically
Column Masks
Column Masks in Unity Catalog
UC column masks are also SQL functions. They define what value a user sees for a given column β€” full value, hashed value, null, or a custom transformation β€” based on their group membership.
sql β€” Unity Catalog column mask
-- Mask function for the email column
CREATE OR REPLACE FUNCTION main.security.mask_email(email STRING)
RETURN
  CASE
    WHEN is_account_group_member('pii_authorised') THEN email  -- see real value
    ELSE regexp_replace(email, '^(.{2}).*(@.*)', '$1***$2')  -- al***@corp.com
  END;

-- Attach mask to table column
ALTER TABLE main.sales.customers
ALTER COLUMN email
SET MASK main.security.mask_email;

-- Analyst (not in pii_authorised) sees: al***@corp.com
-- PII-authorised user sees: alice@corp.com
Lineage & Audit
Data Lineage and Access Audit
Unity Catalog automatically captures column-level lineage (which source columns produced each target column) and access audit logs (who accessed what, when). Audit logs stream to a Delta table in your account.
sql β€” query Unity Catalog audit logs
-- Unity Catalog audit logs land in the system catalog
SELECT
  event_time,
  user_identity.email   AS user_email,
  request_params.table  AS table_accessed,
  action_name,
  response.status_code
FROM system.access.audit
WHERE action_name = 'commandSubmit'
  AND event_time >= current_timestamp() - INTERVAL 7 DAYS
ORDER BY event_time DESC
LIMIT 100;
31.3 β€” DATA SECURITY

Data Security

Data security covers what happens to sensitive data values themselves: masking (replace values at display time), tokenization (replace with reversible tokens), encryption (encrypt values at storage), and secrets management (never hardcode credentials). These complement authorization β€” authorization controls who can query, data security controls what they see.

🎭
Data Masking
PII Protection β–Ό
Concept
Static vs Dynamic Masking
Static masking permanently replaces sensitive values in a copy of the dataset (used for dev/test environments β€” production data never leaves the secure zone). Dynamic masking applies transformations at query time without altering stored data β€” the same table has different views for different users.
Static Masking
Creates a sanitised copy of data. Used for non-prod environments. Data is permanently changed in the copy.
Dynamic Masking
Original data unchanged. Masking applied at query time based on user identity. Ranger / Unity Catalog handle this.
Static Masking in PySpark
Static Data Masking with PySpark
Use PySpark transformations to create a masked copy of sensitive datasets for development environments. This ensures developers never touch real PII.
python β€” static masking pipeline
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("MaskingPipeline").getOrCreate()

# Source: production customer table (restricted access)
prod_df = spark.table("prod.customers")

# Static masking transformations
masked_df = prod_df.withColumn(
    "email",
    # Replace everything before @ with ***
    F.regexp_replace("email", r"^[^@]+", "***")

).withColumn(
    "phone",
    # Keep last 4 digits only
    F.concat(F.lit("XXX-XXX-"), F.substring("phone", -4, 4))

).withColumn(
    "ssn",
    F.lit("***-**-****")  # always fully masked

).withColumn(
    "date_of_birth",
    # Only keep year (remove month/day)
    F.year("date_of_birth").cast("string")

).withColumn(
    "credit_card",
    # Show last 4 digits
    F.concat(F.lit("****-****-****-"),
             F.substring(F.regexp_replace("credit_card", "-", ""), -4, 4))
)

# Write to dev environment β€” safe for developers to use
masked_df.write.format("delta").mode("overwrite").saveAsTable("dev.customers_masked")

print("Masked dataset written to dev environment")
Dynamic Masking in Spark SQL
Dynamic Masking in Spark SQL (without Ranger/UC)
For environments where Ranger or Unity Catalog is not available, you can implement dynamic masking via SQL views that apply masking functions and grant analysts access only to the view β€” not the base table.
sql β€” view-based dynamic masking
-- Base table (restricted: only data engineers can access)
-- CREATE TABLE customers (id BIGINT, name STRING, email STRING, salary DOUBLE, ...)

-- Masked view (analysts get access to this view only)
CREATE OR REPLACE VIEW customers_masked AS
SELECT
  id,
  name,
  -- mask email
  regexp_replace(email, '^(.{2}).*(@.*)', '$1***$2') AS email,
  -- salary in bands, not exact
  CASE
    WHEN salary < 500000  THEN '<5L'
    WHEN salary < 1000000 THEN '5L-10L'
    ELSE '>10L'
  END AS salary_band,
  created_date  -- non-sensitive, show as-is
FROM customers;

-- Grant analysts access to the view, NOT the table
GRANT SELECT ON VIEW customers_masked TO analysts;
🎫
Tokenization
Reversible Substitution β–Ό
Concept
What is Tokenization?
Tokenization replaces a sensitive value (e.g. credit card number 4111-1111-1111-1111) with a non-sensitive token (e.g. TOK-8f3a2b). The mapping is stored in a secure token vault. Authorised systems can detokenize (reverse) the token to retrieve the original value. Unlike hashing, tokenization is reversible.
Tokenization vs Masking vs Hashing: Original PAN: 4111-1111-1111-1111 Masking: ****-****-****-1111 (not reversible, display only) Hashing: sha256 β†’ a3f5d2... (not reversible, good for lookup) Tokenization: TOK-8f3a2b9c (reversible via token vault) Use tokenization when: downstream systems must process/transmit the "card number" but must not see the real number (PCI scope reduction).
PII Tokenization
PII Tokenization in PySpark
Implement a simple hash-based token in PySpark for environments where a full token vault is not available. For production PCI or HIPAA workloads, use a dedicated tokenization service (e.g. Protegrity, Voltage, AWS Payment Cryptography).
python β€” hash-based tokenization
from pyspark.sql import functions as F
import hashlib

# Option 1: Built-in Spark SHA-2 (fast, no UDF overhead)
df_tokenized = df.withColumn(
    "email_token",
    F.sha2(F.concat(F.col("email"), F.lit("MY_HMAC_SECRET")), 256)
).withColumn(
    "phone_token",
    F.sha2(F.concat(F.col("phone"), F.lit("MY_HMAC_SECRET")), 256)
).drop("email", "phone")  # remove originals

# Option 2: Format-Preserving Tokenization via UDF
# (preserves the "look" of a card number β€” required for PCI)
import hmac, hashlib, struct

def format_preserving_token(value: str, secret: str) -> str:
    """Generates a token that looks like the original (same length, char type)"""
    h = hmac.new(secret.encode(), value.encode(), hashlib.sha256).digest()
    # Convert first 8 bytes to a 16-digit-like decimal string
    num = struct.unpack('>Q', h[:8])[0] % (10**16)
    return f"{num:016d}"  # 16-digit token

fp_token_udf = F.udf(
    lambda v: format_preserving_token(v, "MY_HMAC_SECRET"),
    "string"
)
df_fp = df.withColumn("card_token", fp_token_udf("card_number"))
πŸ”
Encryption
At Rest & In Transit β–Ό
AES in Spark
AES Encryption in Spark (Spark 3.3+)
Spark 3.3+ has built-in aes_encrypt() and aes_decrypt() functions. These allow column-level encryption so that even if someone gains access to the Parquet file, they cannot read the sensitive values without the key.
python β€” AES column-level encryption
from pyspark.sql import functions as F

# AES key β€” in production, retrieve from AWS Secrets Manager or Vault
# Key length: 16 (AES-128), 24 (AES-192), or 32 bytes (AES-256)
AES_KEY = "0123456789abcdef"  # 16 bytes = AES-128

# ─── Encryption ───────────────────────────────────────────────
df = spark.createDataFrame([
    (1, "alice@corp.com", "4111111111111111"),
    (2, "bob@corp.com",   "5500005555555559"),
], ["id", "email", "card_number"])

encrypted_df = df.withColumn(
    "email_enc",
    F.aes_encrypt(F.col("email"), F.lit(AES_KEY), F.lit("ECB"))
).withColumn(
    "card_enc",
    F.aes_encrypt(F.col("card_number"), F.lit(AES_KEY), F.lit("ECB"))
).drop("email", "card_number")  # remove plaintext

# Write encrypted data to storage
encrypted_df.write.format("delta").mode("overwrite").save("/data/customers_enc")

# ─── Decryption (authorised process only) ─────────────────────
enc_df = spark.read.format("delta").load("/data/customers_enc")

decrypted_df = enc_df.withColumn(
    "email",
    F.aes_decrypt(F.col("email_enc"), F.lit(AES_KEY), F.lit("ECB")).cast("string")
).withColumn(
    "card_number",
    F.aes_decrypt(F.col("card_enc"), F.lit(AES_KEY), F.lit("ECB")).cast("string")
)

decrypted_df.show(truncate=False)
ECB vs GCM mode: ECB mode is simpler but less secure (identical plaintexts produce identical ciphertext). For production PCI/HIPAA data, use GCM mode with an IV: F.aes_encrypt(col, key, lit("GCM"), lit("DEFAULT"), F.expr("rand_bytes(12)")). GCM also provides authentication (detects tampering).
Encryption at Rest
Encryption at Rest (Storage Layer)
For S3 or HDFS, encryption at rest is typically handled at the storage layer, not in Spark code. This means even if someone copies the raw Parquet files, the bytes are encrypted. S3 SSE-KMS is the standard approach.
python β€” write to S3 with SSE-KMS encryption
# Spark writes to S3 β€” S3 encrypts files at rest using KMS key
spark.conf.set(
    "spark.hadoop.fs.s3a.server-side-encryption-algorithm",
    "SSE-KMS"
)
spark.conf.set(
    "spark.hadoop.fs.s3a.server-side-encryption.key",
    "arn:aws:kms:ap-south-1:123456789:key/your-kms-key-id"
)

# Now all writes are automatically encrypted at rest
df.write.parquet("s3a://my-secure-bucket/data/customers/")
# Files on S3 are AES-256 encrypted β€” unreadable without KMS access
πŸ—οΈ
Secrets Management
Never Hardcode Credentials β–Ό
Rule #1
Never hardcode credentials
Hardcoding passwords in Spark code is the #1 security mistake. Credentials in code end up in Git history, Spark UI environment tabs, and application logs. Always retrieve secrets from an external secrets store at runtime.
Anti-pattern β€” never do this:
spark.conf.set("sfPassword", "MyPassword123")
This appears in Spark UI β†’ Environment tab, Spark event logs, and any error stack traces.
Databricks Secrets
Databricks Secrets
Databricks has a built-in secrets backend. Store credentials using the Databricks CLI, then retrieve them in notebooks/jobs with dbutils.secrets.get(). Secrets are redacted from notebook output automatically.
bash β€” create Databricks secret
# Create a secret scope
databricks secrets create-scope --scope data-pipeline

# Store the secret
databricks secrets put --scope data-pipeline --key snowflake_password
python β€” retrieve Databricks secret in PySpark
# In Databricks notebook or job
sf_password = dbutils.secrets.get(scope="data-pipeline", key="snowflake_password")
sf_url      = dbutils.secrets.get(scope="data-pipeline", key="snowflake_url")

snowflake_options = {
    "sfURL":      sf_url,
    "sfUser":     "spark_service_user",
    "sfPassword": sf_password,
    "sfDatabase": "SALES_DB",
    "sfSchema":   "PUBLIC",
    "sfWarehouse":"SPARK_WH",
}

df = (spark.read.format("snowflake")
      .options(**snowflake_options)
      .option("dbtable", "orders")
      .load())
AWS Secrets Manager
AWS Secrets Manager with Spark
In EMR or on-prem Spark, retrieve secrets from AWS Secrets Manager using boto3 at job startup. The IAM role running the job must have secretsmanager:GetSecretValue permission.
python β€” retrieve secret from AWS Secrets Manager
import boto3, json
from pyspark.sql import SparkSession

def get_secret(secret_name: str, region: str = "ap-south-1") -> dict:
    client = boto3.client("secretsmanager", region_name=region)
    resp = client.get_secret_value(SecretId=secret_name)
    return json.loads(resp["SecretString"])

# Retrieve at startup β€” runs once on the Driver
db_creds = get_secret("prod/spark/postgres_credentials")

spark = SparkSession.builder.appName("SecureJob").getOrCreate()

# Use credentials β€” never logged, never in config files
jdbc_df = (spark.read
    .format("jdbc")
    .option("url",      f"jdbc:postgresql://{db_creds['host']}:5432/{db_creds['dbname']}")
    .option("dbtable", "public.orders")
    .option("user",     db_creds["username"])
    .option("password", db_creds["password"])
    .load())

jdbc_df.show()
Azure & HashiCorp Vault
Azure Key Vault and HashiCorp Vault
The same principle applies across clouds and on-prem. Azure Key Vault integrates with Synapse and Databricks on Azure. HashiCorp Vault works everywhere and is popular for multi-cloud or on-prem Spark clusters.
python β€” HashiCorp Vault with hvac
import hvac  # pip install hvac
import os

def get_vault_secret(path: str) -> dict:
    client = hvac.Client(
        url="https://vault.corp.local:8200",
        token=os.environ["VAULT_TOKEN"]  # injected by orchestrator
    )
    resp = client.secrets.kv.v2.read_secret_version(path=path)
    return resp["data"]["data"]

creds = get_vault_secret("secret/data/spark/snowflake")
# creds = {"sfPassword": "...", "sfUser": "...", ...}
31.4 β€” NETWORK SECURITY

Network Security

Network security in Spark ensures that cluster traffic never traverses public internet and that only authorised components can communicate. This is enforced through VPC design, private endpoints, security groups, and Kubernetes network policies.

🌐
VPC and Subnet Isolation
AWS / Azure / GCP β–Ό
Design
VPC Isolation for Spark Clusters
All Spark cluster nodes (Driver and Executors) should run in private subnets β€” subnets with no direct internet route. Outbound internet access (for package installs) goes through a NAT Gateway. S3 and other AWS service access uses VPC Endpoints to stay on the AWS backbone.
VPC Design for Spark: β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ VPC: 10.0.0.0/16 β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Private Subnet A β”‚ β”‚ Private Subnet B β”‚ β”‚ β”‚ β”‚ 10.0.1.0/24 β”‚ β”‚ 10.0.2.0/24 β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ EMR Master / Driver β”‚ β”‚ EMR Core / Executors β”‚ β”‚ β”‚ β”‚ RDS (metadata DB) β”‚ β”‚ (no public IPs) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ NAT Gateway β”‚ β”‚ S3 VPC Endpoint (Gateway)β”‚ β”‚ β”‚ β”‚ (outbound internet) β”‚ β”‚ Glue VPC Endpoint (Iface)β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ KMS VPC Endpoint (Iface) β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Key: Spark β†’ S3 traffic stays inside AWS network (no internet hop) via S3 Gateway Endpoint
Security Groups
Security Groups vs NACLs for Spark
Security Groups act as stateful instance-level firewalls. For Spark, configure them to allow Driver ↔ Executor ports, block all inbound from the internet, and allow outbound only to required services.
text β€” EMR Spark security group rules
Master Security Group (sg-master):
  Inbound:
    - Port 22 (SSH)       from Bastion Host SG only
    - Port 18080 (Spark History UI) from VPN CIDR only
    - All ports            from Worker SG (sg-worker)
  Outbound:
    - All traffic         to Worker SG (sg-worker)
    - HTTPS (443)         to S3 VPC endpoint, Secrets Manager endpoint
    - Port 443            to KMS endpoint

Worker Security Group (sg-worker):
  Inbound:
    - All ports            from Master SG (sg-master)
    - All ports            from Worker SG itself (executor ↔ executor)
  Outbound:
    - All traffic          to Master SG
    - HTTPS (443)          to S3 VPC endpoint
VPC Endpoints
VPC Endpoints β€” keeping S3/Glue/KMS traffic private
Without VPC endpoints, Spark β†’ S3 traffic exits your VPC to the public internet, then re-enters AWS. S3 Gateway Endpoint routes S3 traffic directly within AWS at no extra cost. Interface endpoints (Glue, KMS, Secrets Manager) run inside your VPC as private ENIs.
terraform β€” create S3 VPC Gateway Endpoint
# S3 Gateway Endpoint β€” free, routes S3 traffic within AWS
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.spark_vpc.id
  service_name      = "com.amazonaws.ap-south-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
  tags = { Name = "spark-s3-endpoint" }
}

# Glue Interface Endpoint β€” for Spark to reach Glue Catalog privately
resource "aws_vpc_endpoint" "glue" {
  vpc_id              = aws_vpc.spark_vpc.id
  service_name        = "com.amazonaws.ap-south-1.glue"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id]
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}
πŸ”Œ
Private Endpoints for Spark Clusters
No Public IPs β–Ό
Concept
Why private endpoints matter
In many regulated industries, cluster nodes must not have public IP addresses. Databricks on AWS supports "No Public IP" (NPIP) mode. EMR supports fully private clusters. Kubernetes Spark runs in private node pools. Access is only through VPN or Direct Connect.
terraform β€” Databricks workspace with no public IPs
resource "databricks_mws_workspaces" "secure_ws" {
  account_id     = var.databricks_account_id
  workspace_name = "secure-data-platform"

  network_id = databricks_mws_networks.private_net.network_id

  # No public IPs on cluster nodes
  custom_tags = {
    no_public_ip = "true"
  }
}
☸️
Network Policies on Kubernetes
EKS / GKE / AKS β–Ό
K8s Network Policy
Kubernetes Network Policies for Spark
In Spark on Kubernetes, the Driver Pod and Executor Pods run in the cluster. A NetworkPolicy restricts which pods can talk to which β€” ensuring Spark pods can only communicate with each other and necessary services (S3, Iceberg catalog), not with other workloads in the cluster.
yaml β€” Kubernetes NetworkPolicy for Spark
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: spark-network-policy
  namespace: spark-jobs
spec:
  # Apply to all Spark pods (driver and executors)
  podSelector:
    matchLabels:
      spark-role: executor  # also apply separate policy for driver

  policyTypes:
    - Ingress
    - Egress

  ingress:
    # Executors only accept traffic from the Driver pod
    - from:
        - podSelector:
            matchLabels:
              spark-role: driver
      ports:
        - port: 7078   # executor port
        - port: 7079   # blockmanager port

  egress:
    # Allow egress to other Spark pods in same namespace
    - to:
        - namespaceSelector:
            matchLabels:
              name: spark-jobs
    # Allow egress to S3 (via VPC endpoint, port 443)
    - ports:
        - port: 443
          protocol: TCP
YARN Firewall
Firewall Rules for YARN Clusters
For on-prem Spark/YARN, iptables or firewalld rules replace Kubernetes NetworkPolicies. Restrict the key Spark ports so only cluster nodes can communicate internally.
text β€” key Spark ports to control
Driver port:              4040  (Spark UI β€” restrict to internal VPN)
History Server:           18080 (restrict to internal VPN)
Executor blockmanager:    Random ephemeral (allow within cluster CIDR)
Thrift Server JDBC:       10000 (allow from app servers only)
YARN ResourceManager UI:  8088  (restrict to ops team)
HDFS NameNode:            8020  (allow from cluster nodes only)
Kerberos KDC:             88    (allow from all cluster nodes)
31.5 β€” COMPLIANCE FRAMEWORKS

Compliance Frameworks

Compliance frameworks define legal and regulatory requirements for how data must be handled. As a data engineer, you need to understand what each framework demands so you can implement the right Spark patterns. HIPAA, GDPR, and PCI-DSS are the three most commonly encountered.

πŸ₯
HIPAA
Healthcare β–Ό
What is HIPAA?
HIPAA β€” Health Insurance Portability and Accountability Act
HIPAA applies to any system handling Protected Health Information (PHI) β€” data that can identify a patient and relates to their health. Examples: diagnosis codes, treatment records, medical images, insurance claims, patient names+dates. Violations result in massive fines.
PHI ExampleDE Implication
Patient name + diagnosisMust be encrypted at rest and in transit
DOB + ZIP codeTogether can identify a patient β†’ treat as PHI
Medical imagesDe-identify before storing in data lake
Insurance claim amountsAccess-controlled; audit logged
PHI Handling
PHI Handling in Spark Pipelines
De-identify PHI as early as possible in the pipeline β€” ideally in the Bronze β†’ Silver transformation. De-identified data can be used freely; PHI-containing data must remain in the secure zone.
python β€” PHI de-identification in PySpark
from pyspark.sql import functions as F

# Bronze layer: raw PHI data (restricted access)
raw_claims = spark.table("bronze.medical_claims")

# De-identification for Silver layer (analysts can access)
deidentified = (raw_claims
    # Remove direct identifiers
    .drop("patient_name", "ssn", "address", "phone", "email")

    # Replace patient_id with one-way hash (pseudonymization)
    .withColumn("patient_hash",
                F.sha2(F.concat("patient_id", F.lit("SALT_HIPAA_2026")), 256))
    .drop("patient_id")

    # Date shift: shift all dates by a random patient-specific offset
    # (so date patterns are preserved but exact dates are obscured)
    .withColumn("service_year", F.year("service_date"))  # keep year only
    .drop("service_date", "dob")

    # Generalise ZIP to first 3 digits only (HIPAA safe harbor rule)
    .withColumn("zip3", F.substring("zip_code", 1, 3))
    .drop("zip_code")
)

deidentified.write.format("delta").mode("overwrite").saveAsTable("silver.claims_deidentified")
Access Controls
Access Controls for PHI
HIPAA's "minimum necessary" standard requires that only the minimum data needed for a task is accessed. In Spark, implement this through column-level grants (Unity Catalog / Ranger) and separate Bronze/Silver zones.
sql β€” HIPAA access control in Unity Catalog
-- PHI zone: only hipaa_authorised_users can access bronze
GRANT SELECT ON TABLE bronze.medical_claims TO `hipaa_authorised_users`;

-- Analysts can access de-identified silver layer
GRANT SELECT ON TABLE silver.claims_deidentified TO `data_analysts`;

-- Log all access to PHI (Unity Catalog audit)
SELECT user_identity.email, action_name, event_time
FROM system.access.audit
WHERE request_params.table = 'medical_claims'
ORDER BY event_time DESC;
Audit Logging
Audit Logging for HIPAA
HIPAA requires audit logs of who accessed PHI, when, and what they did. Implement structured audit logging in Spark pipelines that writes to an immutable audit table (Delta with no-deletes policy).
python β€” pipeline audit logging
import datetime
from pyspark.sql import Row

def write_audit_log(spark, pipeline_name, user, table_accessed, row_count, status):
    audit_row = [Row(
        audit_id=str(datetime.datetime.utcnow().timestamp()),
        pipeline_name=pipeline_name,
        accessed_by=user,
        table_accessed=table_accessed,
        access_time=str(datetime.datetime.utcnow()),
        rows_accessed=row_count,
        status=status,
        data_classification="PHI"
    )]
    audit_df = spark.createDataFrame(audit_row)
    # Append-only β€” HIPAA requires immutable audit trail
    audit_df.write.format("delta").mode("append").saveAsTable("audit.phi_access_log")

# Call after each PHI access
write_audit_log(spark,
    pipeline_name="claims_etl",
    user="spark_svc@corp.local",
    table_accessed="bronze.medical_claims",
    row_count=raw_claims.count(),
    status="SUCCESS"
)
πŸ‡ͺπŸ‡Ί
GDPR
EU Data Privacy β–Ό
What is GDPR?
GDPR β€” General Data Protection Regulation
GDPR applies to any processing of EU residents' personal data regardless of where your company is located. Key principles for data engineers: data minimisation, purpose limitation, storage limitation, and the right to erasure ("right to be forgotten").
Right to Erasure
Right to Erasure with Delta Lake / Iceberg
GDPR's right to erasure requires that when a user requests deletion of their data, you can actually delete it from all storage locations β€” including historical data. Delta Lake MERGE + DELETE + VACUUM and Iceberg row-level deletes make this possible in a Spark-based lakehouse.
python β€” GDPR right to erasure in Delta Lake
from delta.tables import DeltaTable

# User submits erasure request for user_id = 12345
USER_TO_DELETE = 12345

# Step 1: Delete from all Delta tables containing this user's data
tables_with_pii = [
    "gold.customers",
    "silver.orders",
    "silver.clickstream",
    "bronze.raw_events",
]

for table_name in tables_with_pii:
    dt = DeltaTable.forName(spark, table_name)
    dt.delete(f"user_id = {USER_TO_DELETE}")
    print(f"Deleted user {USER_TO_DELETE} from {table_name}")

# Step 2: VACUUM to remove old Parquet files containing the user's data
# Default retention is 7 days β€” GDPR erasure must VACUUM after that period
# For immediate erasure, you can override retention (with caution):
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

for table_name in tables_with_pii:
    spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS")
    print(f"Vacuumed old files from {table_name}")

# Step 3: Log the erasure request completion
spark.sql(f"""
    INSERT INTO gdpr.erasure_log VALUES (
      {USER_TO_DELETE}, current_timestamp(), 'COMPLETED', 'All tables cleaned'
    )
""")
Time travel caveat: After DELETE, Delta's time travel lets you still see old data within the retention window. VACUUM with RETAIN 0 HOURS removes historical files immediately. Only do this with careful coordination β€” you lose the ability to time-travel after VACUUM.
Data Minimisation
Data Minimisation in Pipelines
GDPR requires you only collect and retain data that is necessary for the stated purpose. In Spark pipelines, implement column dropping, purpose-based schemas, and retention policies.
python β€” data minimisation patterns
# Drop PII fields that are not needed for analytics purpose
ANALYTICS_PURPOSE_COLUMNS = [
    "order_id", "product_id", "quantity",
    "amount", "order_date", "country"
    # NOT: customer_name, email, phone, address
]

analytics_df = raw_orders.select(ANALYTICS_PURPOSE_COLUMNS)
analytics_df.write.saveAsTable("gold.order_analytics")

# Implement retention: delete records older than 2 years
DeltaTable.forName(spark, "silver.events").delete(
    "event_date < date_sub(current_date(), 730)"
)
πŸ’³
PCI-DSS
Payment Card β–Ό
What is PCI-DSS?
PCI-DSS β€” Payment Card Industry Data Security Standard
PCI-DSS applies to any system that stores, processes, or transmits cardholder data (PANs, CVVs, card expiry dates, PINs). The standard has 12 major requirements. For data engineers, the critical ones are: encryption of stored PANs, tokenization to remove PANs from analytics systems, strict access control, and comprehensive audit logging.
PCI Data ElementWhat you MUST do
Full PAN (card number)Encrypt at rest (AES-256) OR tokenize β€” never store in plaintext
CVV/CVCNever store after authorization β€” must delete immediately
Expiry date + cardholder nameEncrypt at rest if stored alongside PAN
PIN blocksNever store in a data lake β€” transaction processor only
Encryption Requirements
PCI Encryption Requirements in Spark
PCI Requirement 3 mandates protection of stored cardholder data. AES-256 encryption is the standard. Key management must follow strict controls: keys stored in HSMs or KMS, separation of duties, annual key rotation.
python β€” PCI-compliant PAN handling in Spark
import boto3, json
from pyspark.sql import functions as F

# Step 1: Retrieve AES-256 key from KMS/Vault (never hardcode)
def get_pci_key():
    client = boto3.client("secretsmanager", region_name="ap-south-1")
    secret = client.get_secret_value(SecretId="pci/pan_encryption_key")
    return json.loads(secret["SecretString"])["aes_key"]

AES_KEY = get_pci_key()  # 32-char string = AES-256

# Step 2: Process payment data β€” tokenize PAN, drop CVV immediately
payment_df = spark.table("landing.payment_events")

pci_safe_df = (payment_df
    # Tokenize PAN: replace with SHA-256 HMAC token (non-reversible)
    # For reversible: use aes_encrypt with GCM mode
    .withColumn("pan_token",
                F.sha2(F.concat(F.col("card_pan"), F.lit(AES_KEY)), 256))

    # Show only last 4 digits for display/logging
    .withColumn("pan_last4",
                F.substring("card_pan", -4, 4))

    # CVV MUST be dropped immediately (PCI Req 3.2.1)
    .drop("card_pan", "cvv", "pin")
)

# Write to PCI-scoped table (separate catalog/schema with strict ACL)
pci_safe_df.write.format("delta").mode("append").saveAsTable("pci_scope.payment_safe")
Access Logging
PCI Access Logging (Requirement 10)
PCI Requirement 10 mandates that all access to system components and cardholder data must be logged. Log who accessed the PCI-scoped tables, when, from where, and what queries were run. Logs must be retained for 12 months (3 months immediately available).
python β€” PCI audit log writer
import datetime, socket, os

def pci_audit_log(spark, action, table, user=None, rows_affected=0):
    """Write PCI Requirement 10-compliant audit log entry"""
    if user is None:
        user = os.environ.get("SPARK_USER", "unknown")

    audit_record = [{
        "timestamp":     str(datetime.datetime.utcnow()),
        "user_id":       user,
        "action":        action,   # SELECT / INSERT / DELETE
        "object_type":   "TABLE",
        "object_name":   table,
        "rows_affected": rows_affected,
        "origin_ip":    socket.gethostbyname(socket.gethostname()),
        "success":       True,
        "pci_scope":    True
    }]
    audit_df = spark.createDataFrame(audit_record)
    # Immutable append-only audit table (Delta)
    audit_df.write.format("delta").mode("append").saveAsTable("pci_audit.access_log")

# Usage
pci_audit_log(spark, "SELECT", "pci_scope.payment_safe", rows_affected=50000)
MODULE 31 β€” SUMMARY

Summary & Quick Reference

Complete reference for Spark Security β€” the key concepts, commands, and patterns from each section.

πŸ“‹
Module 31 β€” Complete Quick Reference
β–Ό
31.1 Authentication
Authentication β€” Key Facts
kinit = interactive TGT keytab = non-interactive auth --principal + --keytab = spark-submit Kerberos args LDAP auth on Thrift Server via hive-site.xml spark.authenticate=true β†’ RPC encryption spark.ssl.ui.enabled=true β†’ HTTPS UI spark.network.crypto.enabled=true β†’ shuffle encryption keytool = manage Java keystores/truststores
31.2 Authorization
Authorization β€” Key Facts
Ranger = on-prem policy enforcement (Hive, HDFS, Spark) Ranger row filter = auto-inject WHERE clause Ranger col mask = transform at query time Unity Catalog = Databricks 3-level namespace GRANT / REVOKE SQL statements Row filter = SQL function attached to table Column mask = SQL function on column system.access.audit = UC audit log table
31.3 Data Security
Data Security β€” Key Facts
Static masking = permanent copy, for dev/test Dynamic masking = query-time, same table, Ranger/UC Tokenization = reversible substitution (PCI) aes_encrypt(col, key, "GCM") = column encryption SSE-KMS = S3 encryption at rest dbutils.secrets.get() = Databricks secrets secretsmanager.get_secret_value() = AWS secrets
31.4 Network
Network Security β€” Key Facts
Private subnets = no public IPs on cluster nodes S3 Gateway Endpoint = free, keeps S3 traffic in AWS Interface Endpoints = Glue, KMS, Secrets Manager Security Groups = stateful, instance-level firewalls K8s NetworkPolicy = restrict pod-to-pod traffic NAT Gateway = outbound internet for patching only
31.5 Compliance
Compliance β€” Key Facts
FrameworkScopeKey DE Requirement
HIPAAHealthcare PHI (US)De-identify PHI, encrypt at rest, immutable audit log
GDPREU residents' personal dataRight to erasure (Delta DELETE+VACUUM), data minimisation
PCI-DSSPayment card dataTokenize PAN, drop CVV immediately, AES-256 if stored, audit Req 10
Interview Prep
Top Interview Questions β€” Module 31
  • What is the difference between Authentication and Authorization in Spark?
  • How does Kerberos ticket renewal work in a long-running Spark Streaming job?
  • What is the difference between static masking and dynamic masking?
  • How does Apache Ranger column masking work β€” does it change the stored data?
  • How would you implement GDPR right to erasure in a Delta Lake lakehouse?
  • What is a VPC Endpoint and why is it important for Spark on AWS?
  • What is the difference between Ranger row filter and a view-based approach?
  • How does Unity Catalog row filter differ from Ranger row filter?
  • What encryption mode should you use for PCI cardholder data β€” ECB or GCM? Why?
  • Where does AES key for column encryption come from in a production pipeline?
  • What HIPAA "Safe Harbor" de-identification requires for ZIP codes and dates?
  • How does spark.network.crypto.enabled=true affect Spark shuffle performance?
Security Checklist
Production Spark Security Checklist
Spark Security β€” Production Checklist Authentication: β–‘ Kerberos configured for YARN clusters (principal + keytab in spark-submit) β–‘ Token renewal configured for Streaming jobs β–‘ LDAP authentication on Thrift Server (if SQL endpoint exposed) β–‘ SSL/TLS enabled for Spark UI (spark.ssl.ui.enabled=true) β–‘ RPC encryption enabled (spark.network.crypto.enabled=true) β–‘ I/O encryption enabled (spark.io.encryption.enabled=true) Authorization: β–‘ Ranger/UC policies enforce table/column/row access β–‘ Analysts access masked views or UC-masked tables β–‘ PII-containing tables restricted to PII-authorised groups β–‘ DDL permissions (DROP, CREATE) restricted to data engineers Data Security: β–‘ PII masking in Silver layer (hash, tokenize, or drop) β–‘ No hardcoded credentials in any code or config β–‘ AES encryption for fields that must be stored but not visible β–‘ S3 SSE-KMS enabled for all data lake buckets Network: β–‘ Cluster nodes in private subnets (no public IPs) β–‘ S3 Gateway VPC Endpoint configured β–‘ Glue/KMS/Secrets Manager Interface Endpoints configured β–‘ Security groups restrict inbound to cluster CIDR only β–‘ Spark UI not exposed beyond internal VPN Compliance: β–‘ HIPAA: PHI de-identified before Silver layer β–‘ HIPAA: Immutable audit log for PHI access β–‘ GDPR: Erasure runbook documented and tested β–‘ GDPR: Data retention policies enforced via Delta DELETE β–‘ PCI: PAN tokenized, CVV dropped at ingestion β–‘ PCI: Access logs retained 12 months