MODULE 29 — OVERVIEW

AWS Cloud + Boto3
Data Engineering

This is the largest and most practical module in the entire course. It covers every AWS service a Data Engineer uses daily — from S3 and Glue to EMR, MSK, Lambda, and CloudWatch — plus a complete Boto3 API reference with production-grade code patterns, error handling, and 8 real-world pipeline architectures.

🪣

Storage & Security

S3, IAM, KMS, Secrets Manager, Parameter Store — the foundation every pipeline builds on.

⚙️

Compute & ETL

Glue, EMR, Athena, Redshift, Lake Formation — where your Spark and SQL transformations run.

📨

Messaging & Events

MSK, SQS, SNS, EventBridge, Lambda — the glue between pipeline stages.

🐍

Boto3 Deep Dive

17 complete API sections — every call a DE needs, with error handling, paginators, and waiters.

🏗️

Pipeline Patterns

8 end-to-end real-world pipeline architectures with full boto3 code.

📋

IaC + Governance

Terraform, Data Quality, CDC, Delta, Iceberg, Observability — production-grade practices.

MODULE 29 — COMPLETE TOPIC MAP STORAGE & SECURITY 29.1 Amazon S3 → Buckets, data lake design, multipart, security, S3 Select 29.2 AWS IAM → Roles, policies, STS, cross-account access 29.3 AWS KMS → Envelope encryption, CMK, S3/Glue/Redshift encryption 29.4 Secrets Manager → Storing & retrieving creds in Python 29.5 Parameter Store → Pipeline config, SecureString, /project/env/key hierarchy COMPUTE & ETL 29.6 AWS Glue → Data Catalog, Crawlers, ETL Jobs, Data Quality 29.7 AWS EMR → Spark on EMR, Serverless, Autoscaling, Spot, Bootstrap 29.8 Amazon Athena → Serverless SQL, partition pruning, Iceberg, Federated 29.9 Lake Formation → Fine-grained permissions, LF-Tags, cross-account sharing 29.10 Amazon Redshift → COPY/UNLOAD, distribution styles, Spectrum, WLM MESSAGING & EVENTING 29.11 Amazon MSK → Managed Kafka, Schema Registry, MSK Connect, Spark+MSK 29.12 Amazon DynamoDB → Pipeline metadata store, audit tables, Spark integration 29.13 Amazon RDS → JDBC reads, partitioned Spark reads, read replicas 29.14 EventBridge → Cron scheduling, event-driven pipeline triggers 29.15 Amazon SQS → Queue types, DLQ, consume-process-delete pattern 29.16 Amazon SNS → Fan-out, failure alerts, SNS→SQS pattern 29.17 AWS Lambda → S3 triggers, Glue/EMR launchers, lightweight ETL 29.18 CloudWatch → Custom metrics, alarms, Log Insights, dashboards 29.19 AWS VPC → Subnets, VPC Endpoints, security groups for DE OPTIMIZATION & GOVERNANCE 29.20 Cost Optimization → Spot, Reserved, S3 lifecycle, Athena cost control 29.21 Data Governance → Lake Formation, lineage, PII, column masking 29.22 Terraform (IaC) → Provision S3, IAM, Glue, EMR, Lambda, VPC as code 29.23 Data Quality Eng. → Great Expectations, Glue DQ, quarantine patterns 29.24 Streaming Pipelines → Kafka+Spark+Delta, watermark, foreachBatch 29.25 CDC Pipelines → DMS, Debezium, MERGE INTO Delta/Iceberg 29.26 Delta & Iceberg → ACID on S3, time travel, MERGE, OPTIMIZE, VACUUM 29.27 Data Modeling → Dimensional, SCD Type 2, Data Vault, dbt 29.28 Spark Perf. Eng. → Partitioning, joins, AQE, caching, compaction 29.29 Observability → Metrics, logs, lineage, SLA, alerting BOTO3 DEEP DIVE (29.30.1 – 29.30.17) .1 Fundamentals .2 Error Handling .3 Paginators .4 Waiters .5 S3 APIs .6 Glue APIs .7 Athena APIs .8 EMR APIs .9 Lambda APIs .10 Secrets Mgr .11 SQS APIs .12 SNS APIs .13 DynamoDB APIs .14 CloudWatch APIs .15 STS APIs .16 EventBridge APIs .17 RDS/Redshift APIs REAL-WORLD PIPELINE PATTERNS (P1–P8) P1 File Arrival Batch P2 Daily Scheduled on EMR P3 Metadata-Driven Multi P4 CDC Streaming Pipeline P5 Athena Query Automation P6 Cross-Account Data Access P7 Data Quality Gate P8 Error Recovery Pipeline

☁️ AWS Focus

This module focuses on the 80% of AWS you will actually use as a Data Engineer — not every AWS service, but every service that matters for building, running, monitoring, and securing data pipelines at scale.

29.1

Amazon S3 — The Data Lake Foundation

S3 (Simple Storage Service) is where almost every modern data pipeline begins and ends. Every Spark job you write on EMR, Databricks, or Glue is ultimately reading from and writing to S3. Understanding S3 deeply — buckets, storage classes, partitioning layout, performance tricks, and security — is non-negotiable for a Data Engineer.

🪣

S3 Fundamentals CORE CONCEPT ▼

Buckets and Objects

S3 stores data as objects inside buckets. A bucket is a globally unique top-level container (like s3://my-company-data-lake). An object is just a file — a Parquet file, a CSV, a JSON blob, an image — identified by a key (its full path inside the bucket). There is no real "folder" structure; S3 is a flat key-value store, but tools like the console and Spark simulate folders using the / character in the key name.

🗄️ Analogy

Think of a bucket as a giant warehouse with no shelves — every box (object) just has a long label (key) stuck on it like raw/sales/2024/01/15/data.parquet. The warehouse robot (S3) can instantly find any box by its label, but it doesn't actually organize boxes into physical shelves — the "folder" look is just how the label is written.

python — basic bucket & object operations

import boto3

s3 = boto3.client("s3")

# An object's "path" is really just its key string
bucket = "my-company-data-lake"
key    = "raw/sales/2024/01/15/transactions.parquet"

# Upload a local file as an object
s3.upload_file("transactions.parquet", bucket, key)

# s3://my-company-data-lake/raw/sales/2024/01/15/transactions.parquet
print(f"s3://{bucket}/{key}")

Prefixes and Delimiters

Since S3 has no real folders, the console and APIs use a prefix (everything before the last /) plus a delimiter (usually /) to group keys and make them look like folders. This is critical when listing objects — using Prefix + Delimiter in list_objects_v2 lets you list only "files in this folder" instead of every object in the entire bucket.

python — list objects under a prefix

response = s3.list_objects_v2(
    Bucket="my-company-data-lake",
    Prefix="raw/sales/2024/01/",   # acts like a "folder"
    Delimiter="/"                  # stops at the next "folder" level
)

# CommonPrefixes shows the "sub-folders" (day=01, day=02, ...)
for p in response.get("CommonPrefixes", []):
    print(p["Prefix"])

Object Metadata and Tagging

Every object has system metadata (size, last modified, ETag, storage class) and can have custom metadata (key-value pairs you attach, like source-system: salesforce). Tags are separate from metadata — they're used for cost allocation, lifecycle rules, and access control, and can be changed without re-uploading the object (metadata changes require a copy).

python — metadata vs tags

# Custom metadata is set at upload time
s3.put_object(
    Bucket=bucket, Key=key, Body=open("data.parquet","rb"),
    Metadata={"source-system": "salesforce", "pipeline": "sales-etl"}
)

# Tags can be added/changed anytime — used for cost allocation & lifecycle
s3.put_object_tagging(
    Bucket=bucket, Key=key,
    Tagging={"TagSet": [
        {"Key": "environment", "Value": "production"},
        {"Key": "team", "Value": "data-engineering"}
    ]}
)

Storage Classes

S3 offers multiple storage classes that trade retrieval speed for cost. As a DE, choosing the right class for raw, processed, and archival data is one of the easiest ways to cut storage cost by 50-90%.

Storage Class	Use Case	Retrieval	Relative Cost
Standard	Frequently accessed data (Bronze/Silver layers)	Instant	Highest
Intelligent-Tiering	Unknown/changing access patterns	Instant	Auto-optimized
Standard-IA	Monthly reports, less-frequent reads	Instant	Lower
One Zone-IA	Re-creatable data, single-AZ ok	Instant	Lower still
Glacier Instant Retrieval	Archives accessed quarterly	Instant	Low
Glacier Flexible Retrieval	Compliance archives, rare access	Minutes-hours	Very low
Glacier Deep Archive	7+ year regulatory retention	12+ hours	Cheapest

Lifecycle Policies

A lifecycle policy automatically moves objects between storage classes (or deletes them) based on age — without any pipeline code. A common DE pattern: keep raw data on Standard for 30 days, move to Standard-IA for 90 days, then Glacier Deep Archive after a year, then expire after 7 years for compliance.

python — lifecycle policy for raw zone

s3.put_bucket_lifecycle_configuration(
    Bucket="my-company-data-lake",
    LifecycleConfiguration={
        "Rules": [{
            "ID": "raw-zone-tiering",
            "Filter": {"Prefix": "raw/"},
            "Status": "Enabled",
            "Transitions": [
                {"Days": 30,  "StorageClass": "STANDARD_IA"},
                {"Days": 365, "StorageClass": "GLACIER_DEEP_ARCHIVE"}
            ],
            "Expiration": {"Days": 2555}  # ~7 years
        }]
    }
)

📌 Key Point

Lifecycle transitions have a minimum object size and time — moving tiny files between classes too early can actually cost more in transition fees than it saves. This is one more reason to solve the small files problem first.

🏛️

Data Lake Design ARCHITECTURE ▼

Folder Structure Conventions

Even though S3 has no real folders, a consistent key-naming convention is essential so Spark, Glue, and Athena can discover and partition data correctly. A typical convention separates the zone, domain/table name, and partition values.

text — typical key layout

s3://my-company-data-lake/
├── bronze/
│   └── sales/orders/
│       └── year=2024/month=01/day=15/part-0001.parquet
├── silver/
│   └── sales/orders_cleaned/
│       └── year=2024/month=01/day=15/part-0001.parquet
└── gold/
    └── sales/daily_revenue/
        └── year=2024/month=01/part-0001.parquet

Hive-Style Partitioning (year=/month=/day=)

Hive-style partitioning encodes both the column name and its value in the path: year=2024/month=01/day=15/. Glue Crawlers and Spark automatically recognize this pattern and turn year, month, day into queryable columns — without you needing to read the file to know its date.

python — writing Hive-style partitions from Spark

df.write \
  .partitionBy("year", "month", "day") \
  .mode("append") \
  .parquet("s3://my-company-data-lake/silver/sales/orders_cleaned/")

# Resulting path: .../orders_cleaned/year=2024/month=01/day=15/part-xxxx.parquet

📌 Why It Matters

When you query WHERE year=2024 AND month=1, Athena/Spark can skip reading every other partition entirely — this is partition pruning, and it's one of the biggest cost and speed wins in a data lake.

Dynamic Partitions & Partition Discovery

Dynamic partitioning means Spark decides the partition values from the data itself (e.g., each row's year/month column determines where it lands) rather than you hardcoding a single partition. Partition discovery is the process by which Glue Catalog or Spark scans the S3 prefix tree and registers each year=.../month=.../day=... combination as a partition in the metastore so queries can use it.

python — register new partitions after a write

# After writing new partitions, refresh metastore so Athena/Spark SQL sees them
spark.sql("MSCK REPAIR TABLE silver.orders_cleaned")

# Or, more efficiently with Glue, add only the new partition via boto3
glue = boto3.client("glue")
glue.batch_create_partition(
    DatabaseName="silver", TableName="orders_cleaned",
    PartitionInputList=[{
        "Values": ["2024", "01", "15"],
        "StorageDescriptor": {"Location": "s3://my-company-data-lake/silver/sales/orders_cleaned/year=2024/month=01/day=15/"}
    }]
)

Raw / Bronze / Silver / Gold Zone Layout

This is the medallion architecture applied to S3 zones. Each zone is a separate top-level prefix (often even a separate bucket for stricter access control) representing a stage of data quality.

🟫

Bronze / Raw

Exact copy of source data — untouched, often in original format (CSV/JSON). Source of truth for replay.

⚪

Silver

Cleaned, deduplicated, conformed schema — usually Parquet/Delta, ready for joins.

🟡

Gold

Business-level aggregates and marts — what BI tools and dashboards query directly.

Single-Zone vs Multi-Zone Lake Design

A single-zone lake puts bronze/silver/gold as prefixes inside one bucket — simpler to manage, but harder to apply different access policies per zone. A multi-zone (multi-bucket) design gives each zone its own bucket — company-raw, company-silver, company-gold — enabling stricter IAM policies (e.g., only the ingestion role can write to raw; analysts can only read gold).

Aspect	Single-Zone (one bucket)	Multi-Zone (per-zone buckets)
Access control granularity	Prefix-based IAM policies	Bucket-level IAM, simpler policies
Lifecycle policies	Per-prefix rules	Per-bucket rules, cleaner
Operational simplicity	Fewer buckets to manage	More resources, more Terraform
Common at	Small-medium teams	Large enterprises, multi-account

⚡

Performance OPTIMIZATION ▼

Multipart Upload (Mandatory for Files > 100 MB)

For files larger than ~100 MB, AWS recommends multipart upload — splitting the file into parts (5 MB - 5 GB each) and uploading them in parallel, then telling S3 to assemble them. This is faster, more resilient (a failed part can be retried alone), and required for files over 5 GB.

python — multipart upload steps

# 1. Initiate
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = mpu["UploadId"]

parts = []
try:
    # 2. Upload each part (can be done in parallel with ThreadPoolExecutor)
    for i, chunk in enumerate(file_chunks, start=1):
        resp = s3.upload_part(
            Bucket=bucket, Key=key, UploadId=upload_id,
            PartNumber=i, Body=chunk
        )
        parts.append({"PartNumber": i, "ETag": resp["ETag"]})

    # 3. Complete — S3 assembles the object from parts
    s3.complete_multipart_upload(
        Bucket=bucket, Key=key, UploadId=upload_id,
        MultipartUpload={"Parts": parts}
    )
except Exception:
    # 4. Always abort on failure to avoid storage charges for orphaned parts
    s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)
    raise

Parallel Upload with ThreadPoolExecutor + boto3

In practice, you rarely write raw multipart code — boto3's high-level upload_file() / TransferConfig handles multipart and threading automatically. But for custom pipelines (e.g., uploading hundreds of small files), wrapping uploads in a ThreadPoolExecutor dramatically speeds things up since each upload is mostly waiting on network I/O.

python — parallel file uploads

from concurrent.futures import ThreadPoolExecutor
import boto3, glob

s3 = boto3.client("s3")
bucket = "my-company-data-lake"

def upload_one(local_path):
    key = f"bronze/sales/{local_path.split('/')[-1]}"
    s3.upload_file(local_path, bucket, key)
    return key

files = glob.glob("data/*.parquet")

# Upload up to 10 files concurrently
with ThreadPoolExecutor(max_workers=10) as ex:
    results = list(ex.map(upload_one, files))

print(f"Uploaded {len(results)} files")

S3 Request Rate Optimization (Prefix Spread)

Modern S3 scales request rates automatically per prefix, but workloads with extremely high throughput (thousands of requests/sec) still benefit from spreading keys across multiple prefixes rather than writing everything under one hot prefix. Avoid sequential key names like 00001.parquet, 00002.parquet for very high-throughput write patterns — instead use a hash prefix or date-based prefixes that naturally spread the load.

🚦 Analogy

Imagine one cashier (a prefix) serving 10,000 customers a second versus 10 cashiers each serving 1,000. Even though S3's "cashiers" auto-scale today, designing for spread from the start avoids hot-partition issues at extreme scale.

Large File Handling Patterns in Spark + S3

Spark reads S3 files in splits. Very large single files (multi-GB CSVs) can become a bottleneck if they aren't splittable (e.g., gzip-compressed CSV is NOT splittable — one executor reads the whole file). Prefer splittable formats (Parquet, ORC, uncompressed or bzip2/snappy) and write data as many medium-sized files rather than one giant file.

⚠️ Common Mistake

Writing a 10 GB gzip CSV to S3 and then reading it in Spark — only one task can process that file because gzip isn't splittable, creating a massive bottleneck no matter how many executors you have.

S3 Select — Query Inside Objects Without Full Download

S3 Select lets you run a simple SQL-like filter directly on a CSV/JSON/Parquet object stored in S3, and S3 returns only the matching rows/columns — without downloading the whole object. This is useful for lightweight Lambda functions that need a small slice of a large file.

python — S3 Select on a CSV object

resp = s3.select_object_content(
    Bucket=bucket, Key="bronze/sales/orders.csv",
    ExpressionType="SQL",
    Expression="SELECT s.order_id, s.amount FROM S3Object s WHERE s.amount > 1000",
    InputSerialization={"CSV": {"FileHeaderInfo": "USE"}},
    OutputSerialization={"CSV": {}}
)

for event in resp["Payload"]:
    if "Records" in event:
        print(event["Records"]["Payload"].decode())

📌 Note

S3 Select is a lightweight tool for small lookups — for analytical queries at scale, Athena (built on Trino, supports Parquet/Iceberg, joins, aggregations) is the right tool.

S3 Transfer Acceleration

Transfer Acceleration routes uploads/downloads through Amazon's global CloudFront edge network instead of going directly to the bucket's region — useful when uploading from locations geographically far from your bucket's region (e.g., uploading from Asia to a US bucket). It's enabled per-bucket and used via a special endpoint (bucket.s3-accelerate.amazonaws.com).

Small Files Problem — Causes and Compaction Solutions

The small files problem happens when a pipeline writes thousands of tiny files (a few KB each) instead of fewer, larger ones — often from streaming jobs with frequent micro-batches, or over-partitioned Spark writes. Each small file has metadata overhead (S3 LIST/GET calls, Spark task scheduling overhead), so listing and reading thousands of small files is dramatically slower than reading a handful of large ones.

🔍

Cause: Over-partitioning

Too many partition columns → too many output directories → tiny files per partition.

🔁

Cause: Streaming micro-batches

Each micro-batch writes its own small file set every few seconds/minutes.

🧹

Fix: Compaction job

Periodically read small files, coalesce()/repartition(), rewrite as fewer large files.

📐

Fix: Right-size writes

Use repartition(n) before writing so each output file is the right size.

Optimal File Sizes (128 MB - 1 GB for Parquet)

For Parquet on S3 read by Spark/Athena, the sweet spot is typically 128 MB to 1 GB per file. Files this size align well with Spark's default split size and HDFS-style block sizes, giving each task a meaningful chunk of work without making any single task too slow.

python — controlling output file size

# Estimate target partitions so each output file ≈ 256MB
total_size_mb = 5120   # e.g., 5 GB dataset
target_file_mb = 256
num_files = max(1, total_size_mb // target_file_mb)

df.repartition(num_files).write.mode("overwrite").parquet(
    "s3://my-company-data-lake/silver/sales/orders_cleaned/"
)

🔒

Security PROTECTION ▼

Bucket Policies

A bucket policy is a resource-based JSON policy attached directly to the bucket, controlling who (which principals — users, roles, accounts) can perform which actions (read, write, delete) on the bucket and its objects. Bucket policies are essential for cross-account access and for blanket rules like "deny all unencrypted uploads."

json — bucket policy: deny unencrypted uploads

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyUnencryptedUploads",
    "Effect": "Deny",
    "Principal": "*",
    "Action": "s3:PutObject",
    "Resource": "arn:aws:s3:::my-company-data-lake/*",
    "Condition": {
      "StringNotEquals": {"s3:x-amz-server-side-encryption": "aws:kms"}
    }
  }]
}

IAM Policies for S3

While bucket policies are attached to the resource (the bucket), IAM policies are attached to the principal (a user, role, or group) and define what S3 actions that principal can perform — across any bucket the policy allows. A Glue job's execution role, for example, gets an IAM policy granting s3:GetObject on the raw bucket and s3:PutObject on the silver bucket.

📌 Key Difference

Bucket policy = "who can access this bucket." IAM policy = "what can this role access." Both are evaluated — access is granted only if neither denies it and at least one allows it.

Block Public Access Settings

S3 Block Public Access is an account/bucket-level setting (now on by default for new buckets) that overrides any policy or ACL that would make objects public. For data lakes — which almost always contain sensitive business data — this should remain enabled at the account level, with access granted only via IAM roles, never public URLs.

⚠️ Never Do This

Disabling Block Public Access "temporarily" to share a file is a common cause of major data breaches. Use presigned URLs (time-limited) instead for sharing specific objects.

Encryption — SSE-S3, SSE-KMS, Client-Side

S3 supports encryption at rest in three main flavors:

Method	Key Management	Use Case
SSE-S3	AWS manages keys entirely (AES-256)	Default, simplest, free
SSE-KMS	You manage keys via AWS KMS (CMK)	Audit trail, key rotation, fine-grained access control
Client-Side	You encrypt before upload, S3 stores ciphertext	Maximum control, used for highly regulated data

python — enforce SSE-KMS on upload

s3.put_object(
    Bucket=bucket, Key=key, Body=data,
    ServerSideEncryption="aws:kms",
    SSEKMSKeyId="arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
)

🧩

Advanced PRODUCTION ▼

Versioning

When versioning is enabled on a bucket, every PUT to the same key creates a new version instead of overwriting — old versions remain accessible (and deletable separately). This protects against accidental overwrites/deletes, and is required for cross-region replication and certain compliance needs.

python — enable versioning

s3.put_bucket_versioning(
    Bucket="my-company-data-lake",
    VersioningConfiguration={"Status": "Enabled"}
)

⚠️ Cost Implication

Versioning means deleted/overwritten objects still take up storage until explicitly removed via lifecycle rules (NoncurrentVersionExpiration) — without that rule, costs silently grow.

Cross-Region & Same-Region Replication

Cross-Region Replication (CRR) automatically copies new objects from a bucket in one region to a bucket in another region — used for disaster recovery, data residency requirements, or serving compute in multiple regions. Same-Region Replication (SRR) does the same within a region — common for separating production data into a compliance/audit account.

s3://prod-data-lake (us-east-1)

→

Replication Rule

→

s3://dr-data-lake (eu-west-1)

Event Notifications → Lambda / SQS / SNS

S3 can emit an event notification whenever an object is created, deleted, or restored. These events can trigger a Lambda function directly, or be sent to SQS (for buffering/decoupling) or SNS (for fan-out to multiple subscribers). This is the backbone of event-driven ingestion — "file lands → pipeline starts automatically."

python — configure S3 → Lambda trigger

s3.put_bucket_notification_configuration(
    Bucket="my-company-data-lake",
    NotificationConfiguration={
        "LambdaFunctionConfigurations": [{
            "LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:start-ingest-pipeline",
            "Events": ["s3:ObjectCreated:*"],
            "Filter": {"Key": {"FilterRules": [
                {"Name": "prefix", "Value": "bronze/sales/"},
                {"Name": "suffix", "Value": ".parquet"}
            ]}}
        }]
    }
)

S3 Inventory for Auditing Large Buckets

S3 Inventory generates a daily/weekly report (a CSV/ORC/Parquet file) listing all objects in a bucket along with metadata (size, storage class, encryption status, last modified). For buckets with millions of objects, this is far cheaper and faster than calling list_objects_v2 repeatedly — and the report itself can be queried with Athena to audit storage usage, find unencrypted objects, or detect old files for cleanup.

✅ Example Use

A DE team queries the weekly S3 Inventory report with Athena to find every object older than 2 years still on Standard storage class — feeding a cost-optimization report without scanning the live bucket.

29.2

AWS IAM — Identity and Access Management

IAM controls who can do what across every AWS service your pipelines touch. Every Glue job, EMR cluster, Lambda function, and Airflow worker runs as an IAM role — not as "you." Getting IAM right is the difference between a pipeline that works in dev and breaks in prod with AccessDenied, and a pipeline that's secure, auditable, and portable across accounts.

🔑

IAM Fundamentals CORE CONCEPT ▼

Users, Groups, Roles, Policies

IAM has four core building blocks. A user represents a person or application with long-term credentials. A group is a collection of users that share permissions. A role is an identity without permanent credentials — it's assumed temporarily by a user, service, or another account. A policy is the actual JSON document that defines permissions (allow/deny on specific actions and resources), and it's attached to users, groups, or roles.

🏢 Analogy

A user is like an employee with their own permanent ID badge. A role is like a visitor badge at the front desk — anyone can "become" that visitor temporarily, the badge expires automatically, and it only grants access to specific floors (permissions). A policy is the actual list printed on the badge: "Access: Floors 2 and 3 only, Friday until 6 PM."

Concept	Has Long-Term Credentials?	Typical Use in Data Engineering
User	Yes (access key/secret)	Local dev only — avoid in production
Group	N/A (contains users)	Organizing human users by team
Role	No — temporary creds via STS	Glue/EMR/Lambda execution, cross-account access
Policy	N/A (a document)	Attached to roles to grant least-privilege access

Inline vs Managed Policies

A managed policy is a standalone, reusable policy document (AWS-managed like AmazonS3ReadOnlyAccess, or customer-managed) that can be attached to multiple roles. An inline policy is embedded directly inside a single role/user/group — it exists only as part of that identity and is deleted if the identity is deleted.

Aspect	Managed Policy	Inline Policy
Reusability	Attach to many roles	Tied to one identity
Versioning	Has version history	No version history
Best for	Standard permission sets shared across teams	One-off, tightly-scoped exceptions for a single role

python — attach a managed policy vs put an inline policy

iam = boto3.client("iam")

# Managed policy — reusable, attach by ARN
iam.attach_role_policy(
    RoleName="glue-etl-role",
    PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
)

# Inline policy — embedded directly in this one role only
iam.put_role_policy(
    RoleName="glue-etl-role",
    PolicyName="AllowWriteToSilverBucket",
    PolicyDocument=json.dumps({
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": ["s3:PutObject", "s3:DeleteObject"],
            "Resource": "arn:aws:s3:::company-silver/*"
        }]
    })
)

Resource-Based Policies

Most IAM policies are identity-based — attached to a user/role, defining what that identity can do. A resource-based policy is attached to the resource itself (S3 bucket policies, KMS key policies, Lake Formation resource shares, SQS queue policies) and defines who can access that resource — including identities from other AWS accounts. Both types are evaluated together; access requires no explicit deny and at least one explicit allow.

📌 Key Point

Resource-based policies are the only way to grant access to a principal in a different AWS account without that account assuming a role in yours — though role assumption (covered below) is usually the cleaner pattern for cross-account data access.

⚙️

Important for Data Engineers PRACTICAL ▼

Role Assumption Pattern (EC2 / Glue / EMR / Lambda Assume Roles)

Every AWS compute service that runs your code needs an execution role — a role the service assumes on your behalf to act with specific permissions. You never put access keys inside a Glue job or Lambda function; instead, you attach an IAM role, and AWS automatically injects temporary credentials into the running environment.

Glue Job starts

→

AWS assumes glue-etl-role

→

Temporary creds injected

→

boto3/Spark uses them automatically

json — trust policy allowing Glue to assume this role

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"Service": "glue.amazonaws.com"},
    "Action": "sts:AssumeRole"
  }]
}

✅ Example

Inside a running Glue job, your PySpark code calls boto3.client("s3") with no credentials specified — boto3 automatically finds the temporary credentials injected by the assumed glue-etl-role via the instance metadata / environment.

Least Privilege Design for Data Pipelines

Least privilege means granting only the exact permissions a role needs — nothing more. Instead of giving a Glue job's role s3:* on every bucket, scope it precisely: s3:GetObject on the raw bucket prefix it reads, s3:PutObject on the silver bucket prefix it writes, and nothing else.

json — least-privilege Glue ETL policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadRawSalesOnly",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::company-raw",
        "arn:aws:s3:::company-raw/sales/*"
      ]
    },
    {
      "Sid": "WriteSilverSalesOnly",
      "Effect": "Allow",
      "Action": ["s3:PutObject"],
      "Resource": "arn:aws:s3:::company-silver/sales/*"
    }
  ]
}

⚠️ Anti-Pattern

Granting "Action": "s3:*" and "Resource": "*" to "make it work" is the #1 IAM mistake. A single compromised job credential then has access to every bucket in the account, including production databases' backups.

Cross-Account Access Pattern

Large organizations often separate environments into different AWS accounts (e.g., data-raw-account, data-processing-account, analytics-account) for blast-radius isolation. Cross-account access lets a role in Account B assume a role in Account A — Account A's role trust policy explicitly lists Account B as a trusted principal.

json — Account A role trusts Account B's Glue role

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": "arn:aws:iam::222222222222:role/processing-glue-role"},
    "Action": "sts:AssumeRole",
    "Condition": {"StringEquals": {"sts:ExternalId": "shared-secret-id"}}
  }]
}

📌 Note

The ExternalId condition is a security best-practice for third-party cross-account access — it prevents the "confused deputy" problem where another customer of the same third party could trick it into assuming your role.

Service-Linked Roles

A service-linked role is a special pre-defined role that an AWS service creates and manages automatically — you can't edit its permissions, and only that service can assume it. Examples relevant to DEs: the role AWS creates for EMR managed scaling, or the role behind Lake Formation's data access. They simplify setup because AWS guarantees the permissions are exactly what the service needs — no more, no less.

⏱️

Temporary Credentials STS ▼

STS AssumeRole

STS (Security Token Service) issues short-lived credentials (default 1 hour, configurable up to the role's max session duration) when a principal "assumes" a role. sts.assume_role() returns an AccessKeyId, SecretAccessKey, and SessionToken — all three are required together and expire automatically.

python — assume a role and use temporary credentials

sts = boto3.client("sts")

response = sts.assume_role(
    RoleArn="arn:aws:iam::333333333333:role/cross-account-reader",
    RoleSessionName="de-pipeline-session",
    DurationSeconds=3600
)

creds = response["Credentials"]

# Build a new session using the temporary credentials
session = boto3.Session(
    aws_access_key_id=creds["AccessKeyId"],
    aws_secret_access_key=creds["SecretAccessKey"],
    aws_session_token=creds["SessionToken"]
)

s3_other_account = session.client("s3")
s3_other_account.list_objects_v2(Bucket="other-account-bucket")

Cross-Account Role Assumption

This combines the cross-account trust policy (above) with assume_role(): Account B's pipeline calls sts.assume_role() targeting Account A's role ARN. Because Account A's trust policy lists Account B as trusted, STS issues temporary credentials scoped to whatever permissions Account A's role has — letting Account B's Spark job read Account A's S3 bucket or Glue Catalog without ever holding Account A's long-term credentials.

Session Tags

Session tags are key-value pairs passed during assume_role() that get attached to the resulting temporary session. Policies can then reference aws:PrincipalTag/... in their Condition blocks — enabling attribute-based access control (ABAC). For example, a single shared role can be assumed by multiple teams, but a session tag like team=sales restricts that specific session to only the sales team's S3 prefix.

python — assume role with session tags for ABAC

response = sts.assume_role(
    RoleArn="arn:aws:iam::444444444444:role/shared-team-role",
    RoleSessionName="sales-team-session",
    Tags=[{"Key": "team", "Value": "sales"}]
)
# A policy condition like:
#   "Resource": "arn:aws:s3:::company-data/${aws:PrincipalTag/team}/*"
# automatically scopes this session to company-data/sales/* only

🏷️ Analogy

Session tags are like writing "Sales Dept Only" on a temporary visitor badge — the badge template (role) is the same for everyone, but what's written on it that day determines which doors actually open.

29.3

AWS KMS — Key Management Service

KMS is how AWS handles encryption at rest for almost every data service — S3, Glue, Redshift, DynamoDB, Secrets Manager, and more. As a Data Engineer, you don't write cryptography code — you tell AWS which key to use and KMS does all the heavy lifting. Understanding how KMS works lets you build compliant, auditable, encrypted data pipelines without touching low-level crypto APIs.

🔑

Key Concepts — AWS Managed Keys vs Customer Managed Keys (CMK) FOUNDATION ▼

AWS Managed Keys (aws/service-name)

When you enable encryption on an AWS service (e.g. S3, Glue, CloudWatch Logs) without specifying your own key, AWS automatically creates and manages a key on your behalf. These are called AWS managed keys. You can't rotate them manually, can't restrict who uses them beyond service-level controls, and can't share them across accounts. They are free and zero-configuration — great for default encryption, but limited for compliance requirements.

🏦 Analogy

AWS managed keys are like a bank's master safe that you store your box in. The bank controls the safe key — you trust them, but you have no copy and no control over who else has access to the bank's key.

🆓

AWS Managed Key

Auto-created by AWS. Key alias like aws/s3, aws/glue. No cost. No manual rotation. No cross-account. Limited audit.

🎛️

Customer Managed Key (CMK)

You create it. You control the policy. You can rotate, disable, share, and audit every use. Costs $1/month + API call charges.

🏠

Customer Provided Key

You supply the raw key bytes on every API call (S3 SSE-C). AWS never stores your key. Rare in practice — complex to manage.

Customer Managed Keys (CMK) — Why Data Engineers Care

For production data pipelines, you should always create CMKs for sensitive data. CMKs let you: (1) restrict which IAM roles can decrypt data, (2) audit every key usage via CloudTrail, (3) rotate keys on a schedule, (4) disable a key to immediately revoke access to all encrypted data, (5) share keys cross-account for data mesh architectures.

python — create a CMK with boto3

import boto3

kms = boto3.client("kms", region_name="us-east-1")

# Create a symmetric CMK for encrypting data lake data
response = kms.create_key(
    Description="CMK for production data lake encryption",
    KeyUsage="ENCRYPT_DECRYPT",       # default — symmetric AES-256 key
    KeySpec="SYMMETRIC_DEFAULT",        # AES-256 GCM
    Policy="""
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Enable IAM Root",
          "Effect": "Allow",
          "Principal": {"AWS": "arn:aws:iam::123456789012:root"},
          "Action": "kms:*",
          "Resource": "*"
        },
        {
          "Sid": "Allow Glue and EMR roles to use the key",
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::123456789012:role/GlueExecutionRole",
              "arn:aws:iam::123456789012:role/EMRJobRole"
            ]
          },
          "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
          "Resource": "*"
        }
      ]
    }
    """
)

key_id  = response["KeyMetadata"]["KeyId"]
key_arn = response["KeyMetadata"]["Arn"]
print(f"Created CMK: {key_arn}")

# Create a human-readable alias so you don't have to remember the UUID
kms.create_alias(
    AliasName="alias/data-lake-cmk",
    TargetKeyId=key_id
)
# Now you can reference it as alias/data-lake-cmk in all service configs

💡 Key Point

Always create an alias for your CMKs. Referencing alias/data-lake-cmk is far better than a raw UUID like mrk-abc123... in your configs, and the alias can be pointed to a different key if you ever need to rotate manually.

📦

Envelope Encryption — How It Works INTERNALS ▼

The Problem: You Can't Encrypt 10 GB with a KMS Key Directly

KMS is designed for small payloads (up to 4 KB per API call). You obviously can't send a 10 GB Parquet file to KMS to encrypt it — that would be impossibly slow and expensive. This is why AWS uses envelope encryption — a two-key system where the actual data is encrypted locally with a fast symmetric key, and only that small key is sent to KMS for protection.

✉️ Analogy

Imagine you need to mail a large document safely. You lock it in a briefcase (envelope encryption with a data key) and then put just the briefcase key in a tiny secure envelope (KMS encrypts the data key). Anyone who wants the document must first go to the secure vault (KMS) to get the briefcase key — they can't open the briefcase without it.

Envelope Encryption Step-by-Step

Here is exactly what happens when S3, Glue, or Redshift encrypts your data with a CMK:

ENVELOPE ENCRYPTION FLOW WRITE (encrypting data): ┌─────────────────────────────────────────────────────────────┐ │ 1. Service (S3/Glue) calls KMS: GenerateDataKey(CMK_ARN) │ │ │ │ 2. KMS returns TWO things: │ │ a) Plaintext data key (32 bytes, AES-256) │ │ b) Encrypted data key (blob encrypted under your CMK) │ │ │ │ 3. Service uses plaintext data key to encrypt your data │ │ locally (very fast — no network call) │ │ │ │ 4. Service stores encrypted data key ALONGSIDE the data │ │ (in S3 object metadata, Redshift block header, etc.) │ │ │ │ 5. Plaintext data key is DESTROYED from memory │ └─────────────────────────────────────────────────────────────┘ READ (decrypting data): ┌─────────────────────────────────────────────────────────────┐ │ 1. Service reads encrypted data key from metadata │ │ │ │ 2. Service calls KMS: Decrypt(encrypted_data_key) │ │ → KMS checks caller has kms:Decrypt permission │ │ → Returns plaintext data key │ │ │ │ 3. Service decrypts data locally with plaintext data key │ │ │ │ 4. Plaintext data key is DESTROYED from memory │ └─────────────────────────────────────────────────────────────┘ KEY INSIGHT: Your CMK (master key) NEVER leaves KMS. Only small encrypted data keys travel over the network.

python — manual envelope encryption with boto3 (for understanding)

import boto3
import os
from cryptography.hazmat.primitives.ciphers.aead import AESGCM

kms    = boto3.client("kms")
CMK_ID = "alias/data-lake-cmk"

# ─── ENCRYPT ───────────────────────────────────────────────
# Step 1: Ask KMS for a data key
dk_resp = kms.generate_data_key(KeyId=CMK_ID, KeySpec="AES_256")

plaintext_data_key  = dk_resp["Plaintext"]      # 32 bytes — USE then DESTROY
encrypted_data_key  = dk_resp["CiphertextBlob"] # store alongside your data

# Step 2: Encrypt your actual data LOCALLY (fast, no KMS call)
data_to_encrypt = b"sensitive customer data here"
nonce = os.urandom(12)                            # 96-bit nonce for AES-GCM
aesgcm = AESGCM(plaintext_data_key)
ciphertext = aesgcm.encrypt(nonce, data_to_encrypt, None)

# Step 3: Clear the plaintext key from memory
del plaintext_data_key
# Store: ciphertext + nonce + encrypted_data_key together

# ─── DECRYPT ───────────────────────────────────────────────
# Step 1: Call KMS to decrypt the encrypted data key
dk_plain = kms.decrypt(CiphertextBlob=encrypted_data_key)["Plaintext"]

# Step 2: Decrypt locally
aesgcm2    = AESGCM(dk_plain)
decrypted  = aesgcm2.decrypt(nonce, ciphertext, None)
print(decrypted)  # b"sensitive customer data here"

# Step 3: Clear plaintext key
del dk_plain
# Note: In practice S3/Glue/Redshift do ALL of this for you automatically!

⚠️ Important

In real data pipelines you never do this manually. S3, Glue, Redshift, and all AWS services implement envelope encryption internally — you just pass the CMK ARN in the service config and everything else is automatic. The code above is for understanding the mechanism.

🪣

S3 + KMS — SSE-KMS Integration DATA LAKE ▼

Server-Side Encryption with KMS (SSE-KMS)

When you upload an object to S3 with SSE-KMS, S3 calls KMS to get a data key, encrypts your object with it, and stores only the encrypted data key in the object's metadata. On download, S3 calls kms:Decrypt on your behalf. The caller needs both an S3 read permission AND a kms:Decrypt permission — this double gate is powerful for access control in a data lake.

python — upload to S3 with SSE-KMS

import boto3

s3     = boto3.client("s3")
CMK_ID = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"

# Upload with SSE-KMS using your CMK
s3.upload_file(
    Filename="sales_data.parquet",
    Bucket="my-data-lake",
    Key="gold/sales/2024/sales_data.parquet",
    ExtraArgs={
        "ServerSideEncryption": "aws:kms",
        "SSEKMSKeyId": CMK_ID
    }
)

# Alternatively via put_object
s3.put_object(
    Bucket="my-data-lake",
    Key="gold/users/users.parquet",
    Body=b"parquet bytes here",
    ServerSideEncryption="aws:kms",
    SSEKMSKeyId=CMK_ID
)

Enforce SSE-KMS via Bucket Policy (Deny Unencrypted Uploads)

Best practice is to deny all uploads that don't use SSE-KMS via a bucket policy. This ensures no developer accidentally stores raw unencrypted data in your secure data lake bucket — the upload will simply fail with a 403.

python — attach deny-unencrypted bucket policy

import json, boto3

s3     = boto3.client("s3")
BUCKET = "my-data-lake"
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"

# Bucket policy that DENIES any upload without SSE-KMS using our CMK
policy = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyNonKMSUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": f"arn:aws:s3:::{BUCKET}/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption": "aws:kms"
                }
            }
        },
        {
            "Sid": "DenyWrongKMSKey",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": f"arn:aws:s3:::{BUCKET}/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption-aws-kms-key-id": CMK_ARN
                }
            }
        }
    ]
}

s3.put_bucket_policy(
    Bucket=BUCKET,
    Policy=json.dumps(policy)
)
print("Bucket policy applied — unencrypted uploads will now be denied.")

📋 Real World

At most regulated companies (finance, healthcare), the infosec team enforces this bucket policy on all data lake buckets as a baseline control. Glue jobs and EMR jobs must be granted kms:GenerateDataKey and kms:Decrypt on the CMK or their writes will fail with AccessDenied.

⚙️

Glue Job Encryption with KMS ETL ENCRYPTION ▼

What Glue Encrypts with KMS

AWS Glue can encrypt three things with your CMK: (1) job bookmarks (state tracking for incremental loads), (2) CloudWatch logs emitted by the Glue job, and (3) metadata stored in the Glue Data Catalog. You configure all of this through a Security Configuration — a reusable encryption config you attach to Glue jobs and crawlers.

python — create Glue security configuration with KMS

import boto3

glue   = boto3.client("glue")
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"

# Create a security configuration
glue.create_security_configuration(
    Name="data-lake-security-config",
    EncryptionConfiguration={
        "S3Encryption": [
            {
                "S3EncryptionMode": "SSE-KMS",     # encrypt all S3 output
                "KmsKeyArn": CMK_ARN
            }
        ],
        "CloudWatchEncryption": {
            "CloudWatchEncryptionMode": "SSE-KMS",  # encrypt job logs
            "KmsKeyArn": CMK_ARN
        },
        "JobBookmarksEncryption": {
            "JobBookmarksEncryptionMode": "CSE-KMS", # encrypt incremental state
            "KmsKeyArn": CMK_ARN
        }
    }
)

# Attach security configuration when creating / updating a Glue job
glue.create_job(
    Name="sales-etl-job",
    Role="arn:aws:iam::123456789012:role/GlueExecutionRole",
    Command={"Name": "glueetl", "ScriptLocation": "s3://scripts/etl.py", "PythonVersion": "3"},
    SecurityConfiguration="data-lake-security-config",  # ← attach here
    GlueVersion="4.0",
    NumberOfWorkers=10,
    WorkerType="G.1X"
)

⚠️ IAM Permission Required

The Glue execution role must have kms:GenerateDataKey, kms:Decrypt, and kms:DescribeKey on the CMK. If these are missing, the Glue job will fail with AccessDeniedException before it even reads a single row.

🏢

Redshift Encryption with KMS DATA WAREHOUSE ▼

Enabling KMS Encryption on a Redshift Cluster

Redshift encrypts data at the block level — every data block on disk is encrypted with a hierarchy of keys: a cluster encryption key (CEK) wraps block-level keys, and your CMK wraps the CEK. The result is that no Redshift data is readable without a live, enabled CMK. Encryption is set at cluster creation time — you cannot enable it on an existing unencrypted cluster without a snapshot-restore cycle.

python — create encrypted Redshift cluster

import boto3

redshift = boto3.client("redshift")
CMK_ARN  = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"

redshift.create_cluster(
    ClusterIdentifier="prod-dwh",
    NodeType="ra3.4xlarge",
    NumberOfNodes=4,
    MasterUsername="admin",
    MasterUserPassword="ChangeMe123!",  # use Secrets Manager in prod!
    DBName="analytics",
    Encrypted=True,           # enable encryption
    KmsKeyId=CMK_ARN,          # use our CMK — not AWS managed key
    VpcSecurityGroupIds=["sg-abc123"],
    ClusterSubnetGroupName="redshift-subnet-group"
)
print("Encrypted Redshift cluster creation started.")

💡 Key Point

With a CMK on Redshift, you can disable the key in KMS to instantly render the entire data warehouse unreadable — a powerful "emergency stop" for data breach scenarios. Re-enabling the key restores access. This is a compliance requirement in some regulated industries.

🗝️

Secrets Manager — KMS Key Association CREDENTIALS ▼

Using a CMK to Encrypt Secrets

By default, Secrets Manager encrypts secrets with an AWS managed key (aws/secretsmanager). For production pipelines, use your own CMK — this gives you audit trails of every secret access in CloudTrail and the ability to revoke access by disabling the key.

python — create a secret encrypted with your CMK

import boto3, json

sm      = boto3.client("secretsmanager")
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"

# Create a secret encrypted with our CMK
sm.create_secret(
    Name="prod/postgresql/credentials",
    Description="PostgreSQL DB credentials for ETL pipeline",
    KmsKeyId=CMK_ARN,   # ← associate our CMK here
    SecretString=json.dumps({
        "host":     "prod-db.cluster-xyz.us-east-1.rds.amazonaws.com",
        "port":     5432,
        "username": "etl_user",
        "password": "SuperSecret123!",
        "dbname":   "analytics"
    })
)
print("Secret stored with CMK encryption.")

🔄

Key Rotation — Automatic vs Manual COMPLIANCE ▼

Automatic Key Rotation

KMS can automatically rotate a CMK every 90 days to 2560 days (you choose). When rotation happens, AWS generates new cryptographic material for the key, but keeps the same Key ID and ARN — your configs don't need to change. Old data encrypted with the previous key material is automatically re-wrapped when accessed. This is the recommended approach for most use cases.

python — enable automatic key rotation

import boto3

kms    = boto3.client("kms")
KEY_ID = "alias/data-lake-cmk"

# Enable automatic rotation (rotates every 365 days by default)
kms.enable_key_rotation(
    KeyId=KEY_ID,
    RotationPeriodInDays=365   # annual rotation (90–2560 days)
)

# Check rotation status
status = kms.get_key_rotation_status(KeyId=KEY_ID)
print(status["KeyRotationEnabled"])          # True
print(status.get("NextRotationDate"))        # e.g. 2025-01-15T00:00:00Z

# You can also trigger a manual on-demand rotation right now
kms.rotate_key_on_demand(KeyId=KEY_ID)
print("On-demand rotation triggered.")

Manual Rotation — When You Need a Completely New Key

Automatic rotation keeps the same Key ID. But sometimes you need a completely new key — for example when a key is compromised, or when compliance requires a new key for each year's data. In that case you create a new CMK, update your alias to point to it, and old data encrypted under the previous key is still decryptable because KMS retains old key material. New data is encrypted with the new key.

python — manual rotation: update alias to new key

import boto3

kms = boto3.client("kms")

# Step 1: Create a brand new CMK
new_key = kms.create_key(
    Description="Data lake CMK v2 — 2025 rotation",
    KeyUsage="ENCRYPT_DECRYPT",
    KeySpec="SYMMETRIC_DEFAULT"
)
new_key_id = new_key["KeyMetadata"]["KeyId"]

# Step 2: Move the alias to the new key
# (all configs referencing "alias/data-lake-cmk" now use the new key)
kms.update_alias(
    AliasName="alias/data-lake-cmk",
    TargetKeyId=new_key_id
)
print(f"Alias now points to new key: {new_key_id}")
# Old data encrypted with old key is still readable.
# New data is encrypted with new_key_id.
# Old key should be scheduled for deletion after data migration.

# Step 3: Schedule old key for deletion (7–30 day waiting period)
OLD_KEY_ID = "mrk-previous-key-id"
kms.schedule_key_deletion(
    KeyId=OLD_KEY_ID,
    PendingWindowInDays=30   # 30-day grace period to recover if needed
)

📋

KMS Integration Quick Reference SUMMARY ▼

Which Services Use KMS and How

AWS Service	How KMS is Used	Config Location	IAM Permission Needed
Amazon S3	SSE-KMS on each object upload	Upload ExtraArgs or bucket default encryption	`kms:GenerateDataKey`, `kms:Decrypt`
AWS Glue	Job bookmarks, logs, S3 output, catalog	Security Configuration attached to job	`kms:GenerateDataKey`, `kms:Decrypt`, `kms:DescribeKey`
Amazon Redshift	Block-level cluster encryption	Cluster creation parameter `KmsKeyId`	IAM role association at cluster level
Secrets Manager	Encrypts secret value at rest	`KmsKeyId` on `create_secret()`	`kms:Decrypt`, `kms:GenerateDataKey`
CloudWatch Logs	Encrypt log group at rest	`associate-kms-key` on log group	`kms:GenerateDataKey`, `kms:Decrypt`
Amazon MSK	Encrypts data at rest on broker	Cluster encryption config at creation	Automatic via MSK service role
Amazon RDS	Storage volume encryption	`KmsKeyId` at DB creation	Automatic via RDS service role

KMS API Calls Every Data Engineer Should Know

create_key() create_alias() update_alias() describe_key() generate_data_key() decrypt() enable_key_rotation() get_key_rotation_status() disable_key() schedule_key_deletion() list_aliases()

python — KMS housekeeping APIs

import boto3

kms = boto3.client("kms")

# List all CMKs and their aliases
for page in kms.get_paginator("list_keys").paginate():
    for key in page["Keys"]:
        detail = kms.describe_key(KeyId=key["KeyId"])["KeyMetadata"]
        if detail["KeyManager"] == "CUSTOMER":   # only our CMKs
            print(detail["KeyId"], detail["Description"], detail["KeyState"])

# Disable a key (emergency — data becomes unreadable)
kms.disable_key(KeyId="alias/data-lake-cmk")

# Re-enable a key
kms.enable_key(KeyId="alias/data-lake-cmk")

# Check if a key policy allows your Glue role to use it
policy = kms.get_key_policy(
    KeyId="alias/data-lake-cmk",
    PolicyName="default"
)["Policy"]
print(policy)   # JSON string — inspect for your role ARN

KMS Costs — What to Know

KMS charges $1/month per CMK plus $0.03 per 10,000 API calls. For a busy data lake with thousands of Glue jobs writing millions of objects to S3, KMS costs can add up. Use S3 Bucket Keys to dramatically reduce KMS API calls — instead of calling KMS for every object, S3 generates a bucket-level key locally and reuses it for many objects, reducing KMS calls by up to 99%.

python — enable S3 Bucket Key (reduces KMS API costs ~99%)

import boto3

s3     = boto3.client("s3")
BUCKET = "my-data-lake"
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"

# Enable default SSE-KMS with Bucket Key on the bucket
s3.put_bucket_encryption(
    Bucket=BUCKET,
    ServerSideEncryptionConfiguration={
        "Rules": [{
            "ApplyServerSideEncryptionByDefault": {
                "SSEAlgorithm": "aws:kms",
                "KMSMasterKeyID": CMK_ARN
            },
            "BucketKeyEnabled": True   # ← this is the cost saver
        }]
    }
)
print("Bucket Key enabled — KMS API costs reduced by ~99%.")

☁️ 29.3 Summary

AWS managed keys = zero config, free, limited control. CMKs = full control, auditable, cross-account shareable. Envelope encryption = CMK never encrypts data directly; it protects a small data key that encrypts the actual data. S3 SSE-KMS + bucket policy = the standard data lake encryption pattern. Glue Security Config = attach CMK to Glue jobs for encrypted logs, bookmarks, and output. Enable Bucket Keys in S3 to avoid runaway KMS API costs.

29.4

AWS Secrets Manager

Every pipeline needs credentials — database passwords, API keys, Snowflake tokens, Kafka credentials. Hard-coding them is a security disaster. AWS Secrets Manager is the production solution: store secrets centrally, retrieve them in Python at runtime, rotate them automatically, and audit every access via CloudTrail. It integrates natively with Glue, Lambda, Airflow, and any boto3-based pipeline.

🗝️

Storing Database Credentials, API Keys, Tokens FOUNDATION ▼

What You Store in Secrets Manager

Secrets Manager stores any sensitive string — most commonly a JSON object with multiple fields like {"username": "db_user", "password": "db_pass", "host": "db.example.com"}. You can also store plain strings (e.g. a raw API token). Every secret has a name (the lookup key), a value (the secret payload), and optional metadata like a description, tags, and KMS key.

🏦 Analogy

Secrets Manager is like a bank vault with individual numbered safety deposit boxes. Each box (secret name) has a lock (KMS encryption). Only people with the right key card (IAM permissions) can open a box, and the bank logs every access on camera (CloudTrail). You never carry the contents around — you visit the vault each time you need them.

⚠️ Never Do This in Production

DB_PASSWORD = "my_super_secret_123" — hardcoded credentials in code or environment variables that get committed to git. This is how data breaches happen. Always fetch from Secrets Manager at runtime.

python — create a secret with boto3

import boto3, json

sm = boto3.client("secretsmanager", region_name="us-east-1")

# Store a database credential as JSON — the standard pattern
secret_value = json.dumps({
    "username": "pipeline_user",
    "password": "S3cr3tP@ssw0rd!",
    "host":     "prod-rds.cluster-abc.us-east-1.rds.amazonaws.com",
    "port":     5432,
    "dbname":  "analytics"
})

response = sm.create_secret(
    Name="prod/rds/pipeline-user",          # hierarchical name
    Description="RDS credentials for the nightly ETL pipeline",
    SecretString=secret_value,
    KmsKeyId="alias/data-lake-cmk",          # encrypt with CMK (not default key)
    Tags=[
        {"Key": "environment", "Value": "production"},
        {"Key": "team",        "Value": "data-engineering"}
    ]
)

print(f"Secret ARN: {response['ARN']}")
# arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/rds/pipeline-user-AbCdEf

Secret Naming Convention — Hierarchical Paths

Like Parameter Store, Secrets Manager supports slash-delimited hierarchical names. This is not just cosmetic — IAM policies can grant access to entire subtrees like prod/rds/* or prod/*, enabling fine-grained role-based access where the staging Glue role can only read staging/* secrets and the production role can only read prod/*.

🗂️

Environment/Service/Name

prod/rds/pipeline-user
prod/snowflake/etl-role
staging/kafka/schema-registry

📦

Project/Env/Service

datalake/prod/redshift
datalake/prod/s3-keys
datalake/staging/rds

🔑

IAM Policy Pattern

Grant prod-glue-role access to prod/* only. Grant staging-emr-role access to staging/* only. Zero cross-env risk.

🐍

Retrieving Secrets in Python — get_secret_value() MOST USED API ▼

The Standard Pattern — get_secret_value() + json.loads()

This is the most important boto3 pattern in this section — you will write this in every pipeline that connects to a database, Kafka cluster, or any external API. The response has a SecretString field (for text secrets) — always json.loads() it to extract individual fields like host, port, password.

python — the canonical secrets retrieval pattern every DE must know

import boto3, json
from botocore.exceptions import ClientError

def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
    """Retrieve a JSON secret from AWS Secrets Manager."""
    sm = boto3.client("secretsmanager", region_name=region)
    try:
        response = sm.get_secret_value(SecretId=secret_name)
    except ClientError as e:
        code = e.response["Error"]["Code"]
        if code == "ResourceNotFoundException":
            raise ValueError(f"Secret '{secret_name}' does not exist.")
        elif code == "AccessDeniedException":
            raise PermissionError(f"IAM role lacks permission to read '{secret_name}'.")
        else:
            raise   # re-raise unexpected errors

    # SecretString for text secrets (most common), SecretBinary for binary
    return json.loads(response["SecretString"])


# ── Usage — RDS connection ───────────────────────────────────
creds = get_secret("prod/rds/pipeline-user")

# creds is now a plain dict — unpack what you need
DB_HOST = creds["host"]
DB_PORT = creds["port"]
DB_USER = creds["username"]
DB_PASS = creds["password"]
DB_NAME = creds["dbname"]

# Use them to build a JDBC URL for Spark
jdbc_url = f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}"

df = spark.read \
    .format("jdbc") \
    .option("url", jdbc_url) \
    .option("user", DB_USER) \
    .option("password", DB_PASS) \
    .option("dbtable", "public.orders") \
    .load()

Using Secrets in AWS Glue Jobs

In a Glue job, your script runs on a Glue executor. It assumes the Glue execution role automatically — so as long as that role has secretsmanager:GetSecretValue on the secret ARN, calling get_secret_value() inside the Glue script requires no extra configuration. No passing credentials as job parameters — just call the API.

python — inside a Glue ETL job script

# glue_etl_job.py  — this script runs on a Glue executor
import sys, json, boto3
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc   = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# ── Fetch Snowflake creds from Secrets Manager at runtime ──
sm = boto3.client("secretsmanager")
sf_creds = json.loads(
    sm.get_secret_value(SecretId="prod/snowflake/etl-role")["SecretString"]
)

# ── Use them in the Spark Snowflake connector ──
sf_options = {
    "sfURL":      sf_creds["sfURL"],
    "sfUser":     sf_creds["sfUser"],
    "sfPassword": sf_creds["sfPassword"],
    "sfDatabase": "ANALYTICS",
    "sfWarehouse":"ETL_WH",
    "sfSchema":   "SILVER",
}

df = spark.read \
    .format("net.snowflake.spark.snowflake") \
    .options(**sf_options) \
    .option("dbtable", "orders") \
    .load()

# creds never appear in logs, git, or job parameters — they are fetched live

📌 Security Best Practice

Never pass credentials as Glue job arguments (--db_password=mypass) — those appear in the Glue console, CloudTrail logs, and are visible to anyone who can describe the job. Always use Secrets Manager.

Using Secrets in AWS Lambda

Lambda is a natural consumer of Secrets Manager. The key best practice in Lambda is to cache the secret outside the handler function — Lambda reuses the same container for multiple invocations, so fetching the secret once and caching it in a module-level variable avoids a Secrets Manager API call on every single invocation.

python — Lambda with secret caching (production pattern)

import json, boto3

sm = boto3.client("secretsmanager")

# ── Module-level cache — fetched once per container warm-start ──
_DB_CREDS = None

def get_db_creds():
    global _DB_CREDS
    if _DB_CREDS is None:
        raw = sm.get_secret_value(SecretId="prod/rds/pipeline-user")
        _DB_CREDS = json.loads(raw["SecretString"])
    return _DB_CREDS

def lambda_handler(event, context):
    creds = get_db_creds()   # uses cached value after first call
    # ... use creds to connect to RDS and process the event ...
    return {"statusCode": 200, "body": "OK"}

# First invocation: hits Secrets Manager API (~5ms latency)
# Subsequent invocations on same container: reads from _DB_CREDS (0ms)

Using Secrets in Airflow

Airflow has a built-in Secrets Manager backend. When configured, Airflow automatically fetches connections and variables from Secrets Manager instead of its metadata database. This means you manage all your production credentials in one place (Secrets Manager) rather than in the Airflow UI — much more secure and auditable.

python — airflow.cfg / environment config for Secrets Manager backend

# In airflow.cfg or MWAA environment variables:
# [secrets]
# backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
# backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}

# Then a connection stored as:  airflow/connections/my_rds_conn
# and a variable stored as:     airflow/variables/my_s3_bucket
# are automatically available in DAGs as:
#   BaseHook.get_connection("my_rds_conn")
#   Variable.get("my_s3_bucket")

# Manual retrieval inside a DAG task (if not using the backend):
import boto3, json

def my_task(**context):
    sm = boto3.client("secretsmanager")
    creds = json.loads(
        sm.get_secret_value(SecretId="prod/rds/pipeline-user")["SecretString"]
    )
    # build JDBC URL and run Spark job...

🔄

Automatic Secret Rotation SECURITY ▼

How Automatic Rotation Works

Secrets Manager can automatically rotate a secret on a schedule (e.g. every 30 days). Under the hood, it triggers an AWS Lambda function that: (1) creates a new password in the target service (e.g. RDS), (2) updates the secret value in Secrets Manager, (3) tests that the new credentials work, and (4) finalizes the rotation. Your pipeline code doesn't need to change — next time it calls get_secret_value(), it gets the new credentials automatically.

🔐 SM schedules rotation

→

λ Lambda creates new password in RDS

→

λ Lambda updates secret value

→

λ Lambda tests new creds

→

✅ Rotation complete

python — enable automatic rotation with boto3

import boto3

sm = boto3.client("secretsmanager")

# Enable rotation — AWS has built-in Lambda functions for RDS, Redshift, etc.
sm.rotate_secret(
    SecretId="prod/rds/pipeline-user",
    RotationLambdaARN="arn:aws:lambda:us-east-1:123456789012:function:SecretsManagerRDSRotation",
    RotationRules={
        "AutomaticallyAfterDays": 30   # rotate every 30 days
    }
)

# For supported services (RDS, Aurora, Redshift, DocumentDB), AWS provides
# pre-built rotation Lambda functions in the Serverless Application Repository.
# For custom services (Snowflake, APIs), you write your own rotation Lambda.

# Check rotation status
secret_meta = sm.describe_secret(SecretId="prod/rds/pipeline-user")
print("Rotation enabled:", secret_meta.get("RotationEnabled"))
print("Last rotated:",    secret_meta.get("LastRotatedDate"))
print("Next rotation:",   secret_meta.get("NextRotationDate"))

The Versioning System — AWSCURRENT vs AWSPREVIOUS

Secrets Manager keeps multiple versions of a secret using staging labels. During rotation, the new version is staged as AWSPENDING, then promoted to AWSCURRENT when tests pass. The old version is moved to AWSPREVIOUS and kept temporarily. Your pipeline always fetches AWSCURRENT by default, but can explicitly request AWSPREVIOUS if needed for rollback.

python — fetch a specific version of a secret

import boto3, json

sm = boto3.client("secretsmanager")

# Default: fetches AWSCURRENT (latest valid secret)
current = json.loads(sm.get_secret_value(SecretId="prod/rds/pipeline-user")["SecretString"])

# Explicitly fetch the previous version (rollback scenario)
previous = json.loads(sm.get_secret_value(
    SecretId="prod/rds/pipeline-user",
    VersionStage="AWSPREVIOUS"
)["SecretString"])

# List all versions of a secret to see the labels
versions = sm.list_secret_version_ids(SecretId="prod/rds/pipeline-user")
for v in versions["Versions"]:
    print(v["VersionId"], v.get("VersionStages"))
# e.g:
# abc123  ['AWSCURRENT']
# def456  ['AWSPREVIOUS']

📋 Real World

During rotation, there's a brief window where both old and new passwords exist. If your pipeline runs mid-rotation and gets a connection error, it should catch the auth error, call get_secret_value() again (to get the freshly rotated current value), and retry — a pattern Secrets Manager's own documentation calls graceful retry on rotation.

Writing a Custom Rotation Lambda

For services not supported out of the box (Snowflake, Kafka, external APIs), you write a Lambda with four specific handler cases that Secrets Manager calls in sequence: createSecret, setSecret, testSecret, and finishSecret.

python — skeleton of a custom rotation Lambda

import boto3, json, string, secrets

sm = boto3.client("secretsmanager")

def lambda_handler(event, context):
    step       = event["Step"]
    secret_id  = event["SecretId"]
    token      = event["ClientRequestToken"]   # new version ID

    if step == "createSecret":
        # Generate a new password and store as AWSPENDING version
        new_pass = "".join(secrets.choice(string.ascii_letters + string.digits) for _ in range(32))
        current  = json.loads(sm.get_secret_value(SecretId=secret_id)["SecretString"])
        current["password"] = new_pass
        sm.put_secret_value(
            SecretId=secret_id, ClientRequestToken=token,
            SecretString=json.dumps(current), VersionStages=["AWSPENDING"]
        )

    elif step == "setSecret":
        # Apply the new password to the actual service (e.g. Snowflake ALTER USER)
        pending = json.loads(sm.get_secret_value(
            SecretId=secret_id, VersionStage="AWSPENDING")["SecretString"])
        # ... call Snowflake / RDS / API to set the new password ...

    elif step == "testSecret":
        # Verify the pending credentials actually work
        pending = json.loads(sm.get_secret_value(
            SecretId=secret_id, VersionStage="AWSPENDING")["SecretString"])
        # ... try connecting with pending creds; raise exception if it fails ...

    elif step == "finishSecret":
        # Promote AWSPENDING → AWSCURRENT
        sm.update_secret_version_stage(
            SecretId=secret_id, VersionStage="AWSCURRENT",
            MoveToVersionId=token,
            RemoveFromVersionId=sm.describe_secret(SecretId=secret_id)["VersionIdsToStages"]
                             and None  # simplified for illustration
        )

⚖️

Secrets Manager vs Parameter Store — When to Use Which DECISION GUIDE ▼

Side-by-Side Comparison

Both services store sensitive config, but they have different design centers. Secrets Manager is purpose-built for credentials that rotate and where every access must be audited. Parameter Store is cheaper and better for static config that rarely changes.

Feature	Secrets Manager	Parameter Store
Primary Use	Rotating credentials (DB, API keys)	Pipeline config, non-rotating values
Cost	$0.40/secret/month + $0.05 per 10K API calls	Free (Standard), $0.05/10K calls (Advanced)
Automatic Rotation	✅ Built-in with Lambda integration	❌ No — you'd build it manually
Versioning	✅ Full version history with staging labels	✅ Up to 100 versions (Advanced tier)
Secret Size	Up to 65KB	4KB (Standard), 8KB (Advanced)
Cross-Account	✅ Resource policy enables cross-account access	❌ Within account only
Audit Logging	Every GetSecretValue call in CloudTrail	Every GetParameter call in CloudTrail
API	`secretsmanager.*`	`ssm.get_parameter()`

Decision Rule — Simple Mental Model

🗝️

Use Secrets Manager for:

Database passwords · API tokens · Snowflake credentials · Kafka SASL passwords · anything that rotates or is shared cross-account

⚙️

Use Parameter Store for:

S3 bucket names · environment flags · Glue job config · feature flags · pipeline schedules · anything non-sensitive or rarely changing

💡 Rule of Thumb

If it changes regularly or leaking it causes a security breach → Secrets Manager. If it's just pipeline config that happens to be slightly sensitive → Parameter Store SecureString (cheaper). If it's completely non-sensitive → Parameter Store Standard (free).

📋

Secrets Manager API — Full Reference for Data Engineers API REFERENCE ▼

All Key Operations with Code

python — complete Secrets Manager API reference

import boto3, json

sm = boto3.client("secretsmanager", region_name="us-east-1")

# ── 1. CREATE a secret ───────────────────────────────────────
sm.create_secret(
    Name="prod/kafka/sasl-credentials",
    SecretString=json.dumps({"username": "kafka-user", "password": "s3cr3t"}),
    KmsKeyId="alias/data-lake-cmk"
)

# ── 2. GET a secret (the most common call) ───────────────────
creds = json.loads(
    sm.get_secret_value(SecretId="prod/kafka/sasl-credentials")["SecretString"]
)

# ── 3. UPDATE (rotate/change) a secret value ─────────────────
sm.put_secret_value(
    SecretId="prod/kafka/sasl-credentials",
    SecretString=json.dumps({"username": "kafka-user", "password": "n3wP@ss!"})
)

# ── 4. DESCRIBE a secret (metadata, rotation status) ─────────
meta = sm.describe_secret(SecretId="prod/kafka/sasl-credentials")
print(meta["RotationEnabled"])   # False / True
print(meta["LastChangedDate"])   # datetime
print(meta["Tags"])              # list of {Key, Value}

# ── 5. LIST secrets (with paginator) ─────────────────────────
paginator = sm.get_paginator("list_secrets")
for page in paginator.paginate(
    Filters=[{"Key": "tag-key", "Values": ["environment"]}]
):
    for secret in page["SecretList"]:
        print(secret["Name"], secret["LastChangedDate"])

# ── 6. DELETE a secret ──────────────────────────────────────
sm.delete_secret(
    SecretId="prod/kafka/sasl-credentials",
    RecoveryWindowInDays=30   # 30-day window before permanent deletion
    # ForceDeleteWithoutRecovery=True  ← immediate deletion (no recovery!)
)

# ── 7. RESTORE a deleted secret (within recovery window) ─────
sm.restore_secret(SecretId="prod/kafka/sasl-credentials")

# ── 8. TAG a secret ─────────────────────────────────────────
sm.tag_resource(
    SecretId="prod/kafka/sasl-credentials",
    Tags=[{"Key": "cost-center", "Value": "data-platform"}]
)

# ── 9. REPLICATE to another region (DR pattern) ──────────────
sm.replicate_secret_to_regions(
    SecretId="prod/rds/pipeline-user",
    AddReplicaRegions=[{"Region": "us-west-2", "KmsKeyId": "alias/data-lake-cmk-west"}]
)
# Now your pipelines in us-west-2 read from the local replica — faster & resilient

IAM Policy Pattern — Least Privilege for Secrets

Grant each role access to only its own secrets using IAM resource ARN patterns. A Glue job for the sales pipeline should never be able to read the finance pipeline's Redshift credentials.

json — IAM policy for a Glue role scoped to only prod/sales/* secrets

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret"
      ],
      "Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/sales/*"
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/alias/data-lake-cmk"
    }
  ]
}

☁️ 29.4 Summary

Never hardcode credentials — always fetch from Secrets Manager at runtime. The canonical pattern is get_secret_value(SecretId=name) → json.loads(SecretString). Use automatic rotation for database passwords — AWS has pre-built Lambda rotators for RDS, Redshift, and DocumentDB. Cache the secret at module level in Lambda to avoid per-invocation API calls. Choose Secrets Manager over Parameter Store when credentials rotate, need cross-account sharing, or require per-access auditing. Always scope IAM to prod/service/* patterns — never grant wildcard access to all secrets.

29.5

AWS Systems Manager — Parameter Store

Parameter Store is the lightweight, cost-effective sibling of Secrets Manager. While Secrets Manager is best for rotating credentials, Parameter Store is perfect for pipeline configuration — environment flags, S3 bucket names, Glue job parameters, feature toggles, and anything that drives pipeline behaviour but doesn't need auto-rotation. It integrates natively with Glue, Lambda, EMR, and any boto3 code for free (Standard tier).

📊

Standard vs Advanced Parameters FOUNDATION ▼

What Is Parameter Store?

AWS Systems Manager Parameter Store is a centralized key-value configuration store. Instead of hard-coding S3 bucket names, database hosts, or feature flags into your Glue job code, you store them in Parameter Store and fetch them at runtime. This means changing a config value doesn't require redeploying code — you update the parameter and the next job run picks it up automatically.

📋 Analogy

Think of Parameter Store as a shared Google Sheet that all your pipeline jobs can read from. Instead of each job having its own hardcoded config, every job calls ssm.get_parameter("/myproject/prod/s3_bucket") and gets the latest value. Change the sheet (parameter) once, and all jobs see the update next time they run.

Standard vs Advanced — Side by Side

There are two tiers. Standard is free and sufficient for most data engineering use cases. Advanced unlocks larger values, parameter policies (auto-expiry), and higher API throughput — useful for complex frameworks with many parameters.

Feature	Standard	Advanced
Cost	Free	$0.05 per advanced parameter/month
Value Size	Up to 4 KB	Up to 8 KB
Parameters per Account	10,000	100,000
Parameter Policies	❌ Not available	✅ Expiration, notification, no-change alerts
Throughput	40 transactions/sec (default)	1,000 transactions/sec (configurable)
SecureString	✅ Supported (with KMS)	✅ Supported
Best For	Pipeline config, feature flags, S3 paths	Large configs, auto-expiry, high-volume APIs

💡 Key Point

For 99% of data engineering use cases, Standard is enough and free. Only upgrade to Advanced when you need parameter policies (auto-expiry alerts) or your value exceeds 4 KB (rare — JSON configs are usually well under this).

Parameter Types — String, StringList, SecureString

Parameter Store has three value types. String is a plain text value. StringList is a comma-separated list of strings. SecureString is a KMS-encrypted value — the AWS equivalent of a secret for values that are mildly sensitive but don't need rotation.

📝

String

Plain text.
s3://my-data-lake
us-east-1
glue-job-v2
Non-sensitive config.

📋

StringList

Comma-separated values.
us-east-1,eu-west-1
bronze,silver,gold
Multi-value configs.

🔒

SecureString

KMS-encrypted.
Mildly sensitive values.
Pipeline environment keys.
Internal API URLs. Not for rotating passwords — use Secrets Manager for those.

python — create all three parameter types

import boto3

ssm = boto3.client("ssm", region_name="us-east-1")

# ── String: plain non-sensitive config ──────────────────────────
ssm.put_parameter(
    Name="/datalake/prod/s3_bucket",
    Value="my-company-data-lake-prod",
    Type="String",
    Description="Main data lake S3 bucket for production",
    Overwrite=True
)

# ── StringList: comma-separated list of values ──────────────────
ssm.put_parameter(
    Name="/datalake/prod/active_regions",
    Value="us-east-1,eu-west-1,ap-southeast-1",
    Type="StringList",
    Description="Regions where the pipeline runs",
    Overwrite=True
)

# ── SecureString: encrypted with KMS (for mildly sensitive config)
ssm.put_parameter(
    Name="/datalake/prod/internal_api_key",
    Value="int-api-key-abc123",
    Type="SecureString",
    KeyId="alias/data-lake-cmk",   # optional: use CMK (default: AWS-managed key)
    Description="Internal monitoring API key",
    Overwrite=True
)

print("All parameters stored.")

🔒

SecureString Parameters (KMS-backed) SECURITY ▼

How SecureString Works

A SecureString parameter is stored encrypted using KMS. When you call get_parameter(WithDecryption=True), SSM calls KMS to decrypt the value and returns it in plaintext to your code. If you call it without WithDecryption=True, you get back the raw encrypted blob — which is useless. The caller's IAM role needs both ssm:GetParameter AND kms:Decrypt on the key used to encrypt it.

🔐 Analogy

A SecureString is like a note written in invisible ink. Anyone can see the paper (the parameter exists), but only people with the special UV lamp (kms:Decrypt permission) can read the actual content. Passing WithDecryption=True is you using the UV lamp.

python — reading SecureString (always use WithDecryption=True)

import boto3

ssm = boto3.client("ssm", region_name="us-east-1")

# ✅ CORRECT — WithDecryption=True decrypts the SecureString
response = ssm.get_parameter(
    Name="/datalake/prod/internal_api_key",
    WithDecryption=True
)
api_key = response["Parameter"]["Value"]
print(api_key)   # "int-api-key-abc123" — actual value

# ❌ WRONG — without WithDecryption you get an encrypted blob
bad_response = ssm.get_parameter(
    Name="/datalake/prod/internal_api_key"
    # WithDecryption defaults to False!
)
print(bad_response["Parameter"]["Value"])
# AQICAHh7... (encrypted gibberish — DO NOT use this)

⚠️ Common Bug

Forgetting WithDecryption=True on a SecureString is one of the most common boto3 bugs. Your code will run without error but get an encrypted blob as the value — causing downstream connection failures that look mysterious. Always set it explicitly.

IAM Policy for SecureString Access

A Glue job or Lambda that reads SecureString parameters needs permissions at both layers: SSM to read the parameter and KMS to decrypt it. Here's the minimal IAM policy to grant both:

json — IAM policy for reading SecureString parameters

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ReadSSMParameters",
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:GetParameters",
        "ssm:GetParametersByPath"
      ],
      "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/datalake/prod/*"
    },
    {
      "Sid": "DecryptSSMSecureString",
      "Effect": "Allow",
      "Action": "kms:Decrypt",
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/alias/data-lake-cmk"
    }
  ]
}

⚙️

Using Parameter Store for Pipeline Config PRODUCTION PATTERN ▼

Why Pipeline Config Belongs in Parameter Store

Data pipelines have dozens of configurable values: which S3 bucket to write to, how many Spark partitions to use, whether a feature flag is enabled, the name of the Glue database, the Kafka topic to read from, the DQ threshold percentage. Storing these in Parameter Store means you can change pipeline behaviour without redeploying code — a huge operational advantage in production.

✅ Real-World Example

Your ETL job reads from 10 different source tables. Instead of hardcoding the list in code, you store it in Parameter Store as /pipeline/prod/source_tables (StringList). When the business adds an 11th table, you update the parameter — the next job run automatically includes it, with zero code change or deployment.

Bulk Config Load Pattern — get_parameters_by_path()

Instead of making one API call per parameter, use get_parameters_by_path() to fetch all parameters under a path prefix in a single (paginated) call. This is the production pattern for loading all pipeline config at startup.

python — load all pipeline config at once with get_parameters_by_path()

import boto3

ssm = boto3.client("ssm", region_name="us-east-1")

def load_pipeline_config(path_prefix: str) -> dict:
    """Load all parameters under a path prefix as a flat dict."""
    config = {}
    paginator = ssm.get_paginator("get_parameters_by_path")

    for page in paginator.paginate(
        Path=path_prefix,
        Recursive=True,          # include all nested sub-paths
        WithDecryption=True     # decrypt SecureStrings automatically
    ):
        for param in page["Parameters"]:
            # Strip the path prefix to get the short key name
            short_name = param["Name"].replace(path_prefix, "").lstrip("/")
            config[short_name] = param["Value"]

    return config


# ── Usage in a Glue job ───────────────────────────────────────────
cfg = load_pipeline_config("/datalake/prod")

print(cfg)
# {
#   "s3_bucket":         "my-company-data-lake-prod",
#   "glue_database":     "prod_analytics",
#   "spark_partitions":  "200",
#   "dq_threshold_pct":  "95",
#   "kafka_topic":       "prod.sales.events",
#   "active_regions":    "us-east-1,eu-west-1,ap-southeast-1",
#   "internal_api_key":  "int-api-key-abc123"   ← SecureString, auto-decrypted
# }

# Now use config values naturally
s3_bucket  = cfg["s3_bucket"]
partitions = int(cfg["spark_partitions"])
dq_pct     = float(cfg["dq_threshold_pct"])
regions    = cfg["active_regions"].split(",")   # StringList → Python list

print(f"Writing to: {s3_bucket}")
print(f"Partitions: {partitions}, DQ threshold: {dq_pct}%")
print(f"Active regions: {regions}")

Fetching a Single Parameter — get_parameter()

For fetching individual parameters in the middle of a job (e.g., checking a feature flag before an optional step), use get_parameter(). Always use WithDecryption=True even for String types — it's a no-op for non-SecureStrings but saves bugs if the type ever changes.

python — single parameter reads with error handling

import boto3
from botocore.exceptions import ClientError

ssm = boto3.client("ssm")

def get_param(name: str, default=None):
    """Get a single parameter, returning default if it doesn't exist."""
    try:
        resp = ssm.get_parameter(Name=name, WithDecryption=True)
        return resp["Parameter"]["Value"]
    except ClientError as e:
        if e.response["Error"]["Code"] == "ParameterNotFound":
            return default
        raise   # re-raise unexpected errors


# ── Feature flag check in pipeline ───────────────────────────────
run_dq_checks = get_param("/datalake/prod/feature/run_dq_checks", default="true")

if run_dq_checks.lower() == "true":
    print("Running data quality checks...")
    # ... run checks ...
else:
    print("DQ checks disabled via feature flag — skipping.")

# ── Batch fetch multiple specific parameters ──────────────────────
resp = ssm.get_parameters(
    Names=[
        "/datalake/prod/s3_bucket",
        "/datalake/prod/glue_database",
        "/datalake/prod/kafka_topic"
    ],
    WithDecryption=True
)
params = {p["Name"].split("/")[-1]: p["Value"] for p in resp["Parameters"]}
# {"s3_bucket": "...", "glue_database": "...", "kafka_topic": "..."}

# Check for invalid/missing names
if resp.get("InvalidParameters"):
    print(f"WARNING: Parameters not found: {resp['InvalidParameters']}")

🗂️

Hierarchy Naming Convention — /project/env/key BEST PRACTICE ▼

Why Hierarchy Matters

Parameter Store allows slash-delimited hierarchical names — not just for organisation but because get_parameters_by_path() lets you fetch all parameters under a prefix in one call. More importantly, IAM policies can scope access to a subtree — your staging Glue role can access /datalake/staging/* but never /datalake/prod/*. This is how you enforce environment isolation without complex per-parameter policies.

📁 Analogy

Parameter Store paths are like file system paths. Just as you can ls /project/prod/ to see all prod configs, you can call get_parameters_by_path("/project/prod") to get all production parameters in one call. And just as you can set folder permissions, IAM policies can restrict access to entire subtrees.

Recommended Naming Patterns

Here are the most common naming schemes used in production data engineering teams — pick one and apply it consistently across your entire organisation:

PARAMETER STORE NAMING PATTERNS Pattern 1 — /project/environment/component/key (recommended) /datalake/prod/s3/raw_bucket → "my-data-lake-raw" /datalake/prod/s3/silver_bucket → "my-data-lake-silver" /datalake/prod/glue/database_name → "prod_analytics" /datalake/prod/glue/job_dpu → "10" /datalake/prod/spark/shuffle_partitions → "400" /datalake/prod/kafka/topic → "prod.sales.events" /datalake/prod/kafka/consumer_group → "etl-pipeline-cg" /datalake/prod/dq/threshold_pct → "95" /datalake/prod/feature/run_dq_checks → "true" /datalake/prod/feature/send_alerts → "true" /datalake/staging/s3/raw_bucket → "my-data-lake-raw-staging" /datalake/staging/spark/shuffle_partitions → "100" Pattern 2 — /team/environment/key (simpler) /data-eng/prod/s3_bucket → "my-data-lake" /data-eng/prod/glue_db → "prod_analytics" /data-eng/staging/s3_bucket → "my-data-lake-staging" Pattern 3 — /pipeline-name/environment/key (per-pipeline) /customer-360/prod/source_table → "customers" /customer-360/prod/target_schema → "gold" /customer-360/staging/max_rows → "100000"

📌 Rule of Thumb

Always put environment second in the hierarchy (/project/env/...), not last. This lets you write IAM policies like Resource: "arn:...parameter/datalake/prod/*" that grant or deny access to all prod parameters but no staging ones.

Practical Naming Conventions for Common Config Types

Config Type	Recommended Path	Type	Example Value
S3 bucket name	`/proj/prod/s3/bucket_name`	String	`my-lake-prod`
Glue database	`/proj/prod/glue/database`	String	`prod_analytics`
Spark partitions	`/proj/prod/spark/shuffle_partitions`	String	`200`
Kafka topic	`/proj/prod/kafka/topic`	String	`prod.sales`
Source table list	`/proj/prod/etl/source_tables`	StringList	`orders,customers,products`
Feature flag	`/proj/prod/feature/dq_enabled`	String	`true`
DQ threshold	`/proj/prod/dq/threshold_pct`	String	`95.0`
Internal API key	`/proj/prod/api/monitoring_key`	SecureString	`key-xyz-123` (encrypted)

🐍

Complete Parameter Store API Reference API REFERENCE ▼

All Key SSM Operations for Data Engineers

python — complete Parameter Store API reference

import boto3
from botocore.exceptions import ClientError

ssm = boto3.client("ssm", region_name="us-east-1")

# ── 1. CREATE or UPDATE a parameter (Overwrite=True = upsert) ────
ssm.put_parameter(
    Name="/datalake/prod/spark/shuffle_partitions",
    Value="200",
    Type="String",
    Description="Spark shuffle partition count for production",
    Overwrite=True
)

# ── 2. GET a single parameter ─────────────────────────────────────
resp  = ssm.get_parameter(
    Name="/datalake/prod/spark/shuffle_partitions",
    WithDecryption=True   # always set True; no-op for String type
)
value = resp["Parameter"]["Value"]        # "200"
ptype = resp["Parameter"]["Type"]         # "String"
ver   = resp["Parameter"]["Version"]      # version number (increments on update)

# ── 3. GET multiple specific parameters in one API call ───────────
multi = ssm.get_parameters(
    Names=[
        "/datalake/prod/s3/raw_bucket",
        "/datalake/prod/glue/database",
        "/datalake/prod/kafka/topic"
    ],
    WithDecryption=True
)
# Build a dict: strip full path, keep just the last key segment
params = {p["Name"].rsplit("/", 1)[-1]: p["Value"] for p in multi["Parameters"]}
# Check if any requested names were invalid/missing
if multi["InvalidParameters"]:
    raise ValueError(f"Missing parameters: {multi['InvalidParameters']}")

# ── 4. GET ALL parameters under a path (bulk config load) ─────────
paginator = ssm.get_paginator("get_parameters_by_path")
all_params = {}
for page in paginator.paginate(
    Path="/datalake/prod",
    Recursive=True,
    WithDecryption=True
):
    for p in page["Parameters"]:
        all_params[p["Name"]] = p["Value"]

# ── 5. GET a specific version of a parameter ──────────────────────
# Useful for rollback: fetch a known-good previous version
history_resp = ssm.get_parameter_history(
    Name="/datalake/prod/spark/shuffle_partitions",
    WithDecryption=True
)
for h in history_resp["Parameters"]:
    print(h["Version"], h["Value"], h["LastModifiedDate"])

# Fetch a specific version directly
v2 = ssm.get_parameter(
    Name="/datalake/prod/spark/shuffle_partitions:2",  # :version suffix
    WithDecryption=True
)["Parameter"]["Value"]

# ── 6. DELETE a parameter ─────────────────────────────────────────
ssm.delete_parameter(Name="/datalake/staging/temp_flag")

# Delete multiple at once (up to 10)
ssm.delete_parameters(
    Names=[
        "/datalake/staging/old_flag",
        "/datalake/staging/deprecated_key"
    ]
)

# ── 7. DESCRIBE (list parameters with metadata) ───────────────────
desc_paginator = ssm.get_paginator("describe_parameters")
for page in desc_paginator.paginate(
    ParameterFilters=[{
        "Key": "Path",
        "Option": "Recursive",
        "Values": ["/datalake/prod"]
    }]
):
    for p in page["Parameters"]:
        print(p["Name"], p["Type"], p.get("Description", ""))

# ── 8. ADD TAGS to a parameter ────────────────────────────────────
ssm.add_tags_to_resource(
    ResourceType="Parameter",
    ResourceId="/datalake/prod/spark/shuffle_partitions",
    Tags=[
        {"Key": "team",  "Value": "data-engineering"},
        {"Key": "owner", "Value": "pipeline-team"}
    ]
)

Glue Job Integration — Loading Config at Job Start

This is the canonical production pattern for a Glue job that reads all its configuration from Parameter Store at startup:

python — complete Glue job using Parameter Store for config

import sys, boto3, json
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

# ── Step 1: Get job arguments (just the env) ──────────────────────
args = getResolvedOptions(sys.argv, ["JOB_NAME", "ENV"])
env  = args["ENV"]   # "prod" or "staging" — passed at job launch

# ── Step 2: Load ALL config for this environment in one call ──────
ssm = boto3.client("ssm")

def load_config(env: str) -> dict:
    paginator = ssm.get_paginator("get_parameters_by_path")
    cfg = {}
    for page in paginator.paginate(
        Path=f"/datalake/{env}",
        Recursive=True,
        WithDecryption=True
    ):
        for p in page["Parameters"]:
            short = p["Name"].replace(f"/datalake/{env}/", "")
            cfg[short] = p["Value"]
    return cfg

cfg = load_config(env)

# ── Step 3: Use config values in the job ─────────────────────────
RAW_BUCKET   = cfg["s3/raw_bucket"]
SILVER_BUCKET = cfg["s3/silver_bucket"]
GLUE_DB      = cfg["glue/database"]
PARTITIONS   = int(cfg["spark/shuffle_partitions"])
DQ_ENABLED   = cfg.get("feature/dq_enabled", "true") == "true"

print(f"[{env.upper()}] Loading from {RAW_BUCKET} → {SILVER_BUCKET}")
print(f"Shuffle partitions: {PARTITIONS}, DQ: {DQ_ENABLED}")

# ── Step 4: Spark job runs with config ────────────────────────────
sc    = SparkContext()
glue  = GlueContext(sc)
spark = glue.spark_session
spark.conf.set("spark.sql.shuffle.partitions", PARTITIONS)

df = spark.read.parquet(f"s3://{RAW_BUCKET}/bronze/sales/")
# ... transformations ...
df.write.mode("overwrite").parquet(f"s3://{SILVER_BUCKET}/silver/sales/")

✅ Why This Pattern Is Powerful

The same Glue job code runs in staging and production — just pass ENV=staging or ENV=prod as a job argument. All configuration differences are in Parameter Store, not in the code. Promoting from staging to prod means updating parameters, not redeploying code.

📋

Advanced Tier — Parameter Policies ADVANCED ▼

What Are Parameter Policies?

Advanced-tier parameters support parameter policies — lifecycle rules that can automatically expire a parameter, send an SNS notification when it's about to expire, or alert if it hasn't been updated in a certain time. This is useful for API keys or internal tokens that need to be refreshed periodically (but where you manage rotation manually rather than via Secrets Manager).

python — Advanced parameter with expiration policy

import boto3, json

ssm = boto3.client("ssm")

# ── Advanced parameter with expiry in 90 days ────────────────────
# (requires Tier="Advanced")
ssm.put_parameter(
    Name="/datalake/prod/api/partner_token",
    Value="token-abc-xyz-123",
    Type="SecureString",
    Tier="Advanced",          # required for parameter policies
    Overwrite=True,
    Policies=json.dumps([
        {
            "Type": "Expiration",           # auto-delete after this date
            "Version": "1.0",
            "Attributes": {
                "Timestamp": "2025-12-31T23:59:59.000Z"
            }
        },
        {
            "Type": "ExpirationNotification",  # alert 14 days before expiry
            "Version": "1.0",
            "Attributes": {
                "Before":  "14",
                "Unit":    "Days"
            }
        },
        {
            "Type": "NoChangeNotification",     # alert if not updated for 30 days
            "Version": "1.0",
            "Attributes": {
                "After": "30",
                "Unit":  "Days"
            }
        }
    ])
)
print("Advanced parameter with policies created.")

💡 When to Use Parameter Policies

Use parameter policies for manually rotated tokens where you want a reminder that a rotation is overdue — like partner API keys that a third party emails you every 90 days. If the token can be automatically rotated, use Secrets Manager with a rotation Lambda instead.

📋

Parameter Store API — Quick Reference Pills SUMMARY ▼

Core APIs

put_parameter() get_parameter() get_parameters() get_parameters_by_path() get_parameter_history() describe_parameters() delete_parameter() delete_parameters() add_tags_to_resource()

Parameter Store vs Secrets Manager — Final Decision Table

Scenario	Use	Why
Database password that rotates every 90 days	Secrets Manager	Built-in auto-rotation with Lambda
S3 bucket name for the pipeline	Parameter Store String	Non-sensitive, free, simple
Kafka SASL password	Secrets Manager	Sensitive, needs rotation and audit trail
Spark shuffle.partitions value	Parameter Store String	Just config, completely non-sensitive
Internal monitoring API key (mildly sensitive)	Parameter Store SecureString	Sensitive enough to encrypt, but doesn't rotate
List of source tables to process	Parameter Store StringList	Non-sensitive, easy to update
Snowflake private key for key-pair auth	Secrets Manager	Highly sensitive, cross-account sharing possible
Feature flag (true/false)	Parameter Store String	Free, instant to update, simple to read

☁️ 29.5 Summary

Standard tier is free and covers 99% of DE use cases. Three types: String (plain text), StringList (comma-separated), SecureString (KMS-encrypted). Always use hierarchical paths (/project/env/component/key) to enable bulk loading and IAM scoping. get_parameters_by_path() loads all pipeline config in one paginated call — use it at job startup. Always set WithDecryption=True on get calls or SecureStrings return an encrypted blob. Use Parameter Store for config; use Secrets Manager for rotating credentials.

29.6

AWS Glue — Serverless ETL & Data Catalog

AWS Glue is the central ETL and metadata hub for most AWS data lakes. It provides a fully managed Apache Spark environment (ETL jobs), a schema registry (Data Catalog), automated schema discovery (Crawlers), and a built-in data quality framework. As a data engineer you'll use Glue daily — running Spark jobs without managing clusters, registering table schemas that Athena and Redshift Spectrum can query, and tracking incremental data loads via job bookmarks.

📚

Data Catalog — Databases, Tables, Partitions FOUNDATION ▼

What Is the Glue Data Catalog?

The Glue Data Catalog is a fully managed, account-level metadata store — essentially a Hive Metastore as a service. It stores schema definitions (databases, tables, columns, data types), partition metadata, and table location (S3 path). Athena, Redshift Spectrum, EMR, and Spark on EMR all look up table schemas from the same Glue Catalog, making it the single source of truth for your entire data lake. You never re-define schemas in each tool separately.

📖 Analogy

Think of the Glue Catalog as a library's card catalogue. Every book (data file) is on a shelf in S3, but the card catalogue tells every tool — Athena, Redshift Spectrum, EMR — what columns a table has, where it lives, and how it's partitioned. Without the catalogue, every tool would have to "read the book from scratch" on every query.

GLUE DATA CATALOG STRUCTURE Account Level (one Catalog per AWS account) └─ Database: "prod_analytics" ├─ Table: "sales_transactions" │ ├─ Location: s3://my-lake/silver/sales/ │ ├─ Format: Parquet │ ├─ Columns: [txn_id, customer_id, amount, txn_date] │ └─ Partitions: [year=2024/month=01/day=15, ...] │ ├─ Table: "customers" │ ├─ Location: s3://my-lake/silver/customers/ │ └─ Columns: [customer_id, name, email, region] │ └─ Table: "product_catalog" └─ Location: s3://my-lake/silver/products/ └─ Database: "raw_landing" └─ Table: "raw_orders" (Bronze — unvalidated)

🔑 Key Point

The Glue Catalog is region-scoped but account-wide. One catalog per region per account — shared by Athena, EMR, Glue jobs, Redshift Spectrum, and any Spark session that's configured to use it. Lake Formation adds fine-grained permissions on top of this same catalog.

Databases and Tables

A Glue database is just a namespace — a logical grouping of tables. A Glue table is a schema definition: column names + types, table location, SerDe (serialization library), input/output formats, and partition keys. The actual data stays in S3; the catalog stores only metadata. Managed tables (rare in Glue) store data in a Glue-managed S3 location; external tables (standard practice) point to your own S3 prefix.

python — create database and table in Glue Catalog via boto3

import boto3

glue = boto3.client("glue", region_name="us-east-1")

# ── 1. Create a database ─────────────────────────────────────────
glue.create_database(DatabaseInput={
    "Name": "prod_analytics",
    "Description": "Production analytics — silver and gold tables",
    "LocationUri": "s3://my-data-lake/silver/"
})

# ── 2. Create an external Parquet table ──────────────────────────
glue.create_table(
    DatabaseName="prod_analytics",
    TableInput={
        "Name": "sales_transactions",
        "Description": "Cleaned and validated sales transactions",
        "StorageDescriptor": {
            "Columns": [
                {"Name": "txn_id",      "Type": "string"},
                {"Name": "customer_id",  "Type": "bigint"},
                {"Name": "amount",       "Type": "decimal(18,2)"},
                {"Name": "txn_date",     "Type": "date"},
                {"Name": "product_id",   "Type": "string"}
            ],
            "Location": "s3://my-data-lake/silver/sales/",
            "InputFormat":  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
            },
            "Compressed": True
        },
        "PartitionKeys": [
            {"Name": "year",  "Type": "string"},
            {"Name": "month", "Type": "string"}
        ],
        "TableType": "EXTERNAL_TABLE"
    }
)
print("Table 'sales_transactions' registered in Glue Catalog.")

Schema Versioning and Evolution

Every time a Glue Crawler (or update_table()) changes a table's schema, Glue stores the previous schema as a version. You can retrieve any historical version and compare schemas — useful when a bug introduced a wrong column type. Schema evolution works by adding columns (safe — old files are read without the new column) or changing compatible types (e.g. int → bigint). Dropping or renaming columns is a breaking change that can confuse existing queries.

python — add a new column (schema evolution) via update_table()

import boto3, copy

glue = boto3.client("glue")

# 1. Get the current table definition
resp  = glue.get_table(DatabaseName="prod_analytics", Name="sales_transactions")
table = resp["Table"]

# 2. Build the TableInput (must strip read-only fields AWS adds)
table_input = {
    k: v for k, v in table.items()
    if k not in ["DatabaseName", "CreateTime", "UpdateTime",
                  "CreatedBy", "IsRegisteredWithLakeFormation",
                  "CatalogId", "VersionId"]
}

# 3. Add the new column
table_input["StorageDescriptor"]["Columns"].append(
    {"Name": "discount_pct", "Type": "double", "Comment": "Applied discount percentage"}
)

# 4. Push the update — Glue saves the old schema as a version
glue.update_table(DatabaseName="prod_analytics", TableInput=table_input)
print("Column 'discount_pct' added. Old schema saved as a version.")

# 5. List schema versions to see history
versions = glue.get_table_versions(
    DatabaseName="prod_analytics", TableName="sales_transactions"
)
for v in versions["TableVersions"]:
    ncols = len(v["Table"]["StorageDescriptor"]["Columns"])
    print(f"  Version {v['VersionId']}: {ncols} columns ({v['Table']['UpdateTime']})")

Glue Catalog as Hive Metastore Replacement

Any Spark session on EMR or a self-managed cluster can use the Glue Catalog as its Hive Metastore — no separate Hive installation needed. You configure SparkSession with two settings: one to enable the Glue Catalog connector, one to point to your AWS account's catalog. After that, spark.sql("SHOW TABLES IN prod_analytics") queries exactly the same tables that Athena sees.

python — configure Spark (EMR) to use Glue Catalog as Hive Metastore

from pyspark.sql import SparkSession

spark = (SparkSession.builder
    .appName("GlueCatalogExample")
    # Tell Spark to use Glue Catalog instead of local Hive Metastore
    .config("spark.sql.catalogImplementation", "hive")
    .config("hive.metastore.client.factory.class",
            "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
    .enableHiveSupport()
    .getOrCreate()
)

# Now Spark SQL sees the same tables as Athena
spark.sql("SHOW DATABASES").show()
spark.sql("USE prod_analytics")
spark.sql("SHOW TABLES").show()

# Read directly using catalog table name — no path needed!
df = spark.sql("SELECT * FROM sales_transactions WHERE year='2024'")
df.show(5)

Partition Management

When you write new partitioned data to S3, the Glue Catalog doesn't know about the new partitions automatically — you must register them. The fastest way is batch_create_partition() (up to 100 partitions per call) or running MSCK REPAIR TABLE via Athena (slow for large tables). Always register new partitions at the end of your ETL job — otherwise Athena queries won't see today's data.

python — batch register new partitions after writing data

import boto3
from datetime import date, timedelta

glue = boto3.client("glue")
DB   = "prod_analytics"
TBL  = "sales_transactions"
BASE = "s3://my-data-lake/silver/sales"

# Build partition objects for the last 7 days
partitions = []
today = date.today()
for i in range(7):
    d = today - timedelta(days=i)
    yr, mo = str(d.year), f"{d.month:02d}"
    partitions.append({
        "Values": [yr, mo],               # must match PartitionKeys order
        "StorageDescriptor": {
            "Location": f"{BASE}/year={yr}/month={mo}/",
            "InputFormat":  "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
            "SerdeInfo": {"SerializationLibrary":
                "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"},
            "Compressed": True
        }
    })

# batch_create_partition — up to 100 per call
for chunk in [partitions[i:i+100] for i in range(0, len(partitions), 100)]:
    resp = glue.batch_create_partition(
        DatabaseName=DB, TableName=TBL,
        PartitionInputList=chunk
    )
    if resp.get("Errors"):
        for err in resp["Errors"]:
            # AlreadyExistsException is fine — partition already registered
            if err["ErrorDetail"]["ErrorCode"] != "AlreadyExistsException":
                print(f"ERROR: {err}")

print(f"Registered {len(partitions)} partitions in Glue Catalog.")

✅ Production Tip

Always call batch_create_partition() with error suppression for AlreadyExistsException. Your job may retry on failure, and trying to re-register an existing partition should never cause the job to fail.

🕷️

Glue Crawlers — Automated Schema Discovery AUTOMATION ▼

What Crawlers Do

A Glue Crawler inspects a data source — S3 prefix, JDBC database, Redshift, DynamoDB — reads sample files, infers column names and data types, and writes (or updates) table definitions in the Glue Catalog automatically. Instead of manually defining every table schema, you point a crawler at your S3 prefix and it builds the catalog entry for you. Crawlers also detect new partitions and schema changes in subsequent runs.

🔎 Analogy

A crawler is like a library cataloguing robot. You dump a new batch of books (Parquet files) on the loading dock (S3 prefix), and the robot scans every book, reads the table of contents (schema), and creates a card catalogue entry automatically. On its next run it detects any new books and updates the catalogue.

HOW A GLUE CRAWLER WORKS S3 Prefix: s3://my-lake/silver/sales/ ├── year=2024/month=01/data-001.parquet ← Crawler reads sample rows ├── year=2024/month=02/data-001.parquet ← Detects same schema └── year=2024/month=03/data-001.parquet Crawler Output: → Database: prod_analytics → Table: sales (inferred from prefix name) → Columns: txn_id string, amount decimal, txn_date date ... → PartitionKeys: year string, month string → Format: Parquet + Snappy → Location: s3://my-lake/silver/sales/

Schema Discovery from S3, RDS, Redshift

Crawlers support multiple source types. For S3 they sample Parquet/ORC/Avro/CSV/JSON files. For RDS/Redshift they read the JDBC schema directly — column definitions come straight from the database, not file inference. You can have one crawler cover multiple data sources (S3 + RDS) in a single run, writing all discovered tables to the same target database in the Catalog.

python — create and run a Glue Crawler for an S3 prefix

import boto3, time

glue = boto3.client("glue")

# ── Create the crawler ───────────────────────────────────────────
glue.create_crawler(
    Name="silver-sales-crawler",
    Role="arn:aws:iam::123456789012:role/glue-crawler-role",
    DatabaseName="prod_analytics",         # target Catalog database
    Targets={
        "S3Targets": [{
            "Path": "s3://my-data-lake/silver/sales/",
            "Exclusions": ["**/_temporary/**", "**/_spark_metadata/**"]
        }]
    },
    TablePrefix="silver_",                  # tables created as silver_sales, etc.
    SchemaChangePolicy={
        "UpdateBehavior": "UPDATE_IN_DATABASE",  # auto-update schema
        "DeleteBehavior": "LOG"                 # log deleted tables, don't auto-delete
    },
    RecrawlPolicy={"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"},  # incremental
    Schedule="cron(0 6 * * ? *)"            # daily at 06:00 UTC
)

# ── Start the crawler on-demand (outside of schedule) ───────────
glue.start_crawler(Name="silver-sales-crawler")

# ── Poll until READY (no built-in waiter — must poll manually) ──
def wait_for_crawler(name: str, poll_sec: int = 15, timeout: int = 600):
    start = time.time()
    while time.time() - start < timeout:
        state = glue.get_crawler(Name=name)["Crawler"]["State"]
        print(f"  Crawler state: {state}")
        if state == "READY":
            print("✅ Crawler finished.")
            return
        if state == "STOPPING":
            time.sleep(poll_sec)
            continue
        time.sleep(poll_sec)
    raise TimeoutError(f"Crawler did not finish within {timeout}s")

wait_for_crawler("silver-sales-crawler")

Incremental Crawling

By default, a crawler re-scans all files on every run — wasteful for large data lakes where only a new partition arrives daily. Setting RecrawlBehavior: "CRAWL_NEW_FOLDERS_ONLY" (or CRAWL_EVERYTHING for full schema refresh) makes the crawler skip already-catalogued folders and only process new S3 prefixes. This cuts crawler runtime from hours to seconds on mature data lakes.

💡 When to Use Each Mode

CRAWL_NEW_FOLDERS_ONLY — daily runs after ETL writes new partitions. Fast. Use 99% of the time.
CRAWL_EVERYTHING — monthly/weekly full refresh to catch schema drift or file format changes. Slow but thorough.

Partition Detection and Custom Classifiers

Crawlers automatically detect Hive-style partitions (year=2024/month=01/) and register them. For non-standard formats (custom CSV with unusual delimiters, nested JSON, proprietary binary formats), you attach a Custom Classifier — a regex, JSON path expression, or XML pattern that tells the crawler how to read the file and what schema it has.

python — create a custom classifier for pipe-delimited CSV

import boto3

glue = boto3.client("glue")

# Custom CSV classifier — pipe-delimited, quoted, with header
glue.create_classifier(
    CsvClassifier={
        "Name": "pipe-delimited-csv",
        "Delimiter": "|",
        "QuoteSymbol": '"',
        "ContainsHeader": "PRESENT",    # first row is header
        "AllowSingleColumn": False
    }
)

# Attach to a crawler
glue.create_crawler(
    Name="legacy-csv-crawler",
    Role="arn:aws:iam::123456789012:role/glue-crawler-role",
    DatabaseName="raw_landing",
    Targets={"S3Targets": [{"Path": "s3://my-lake/raw/legacy-exports/"}]},
    Classifiers=["pipe-delimited-csv"]   # applies before built-in classifiers
)

⚙️

ETL Jobs — Spark, Python Shell, Job Bookmarks MOST USED ▼

Spark ETL Jobs — The Main Event

Glue's primary job type is a Spark ETL job — Apache Spark running on AWS-managed infrastructure. You provide a PySpark script (uploaded to S3), configure DPUs (Data Processing Units — each DPU = 4 vCPUs + 16 GB RAM), and Glue manages the cluster for you. There's no EC2 to manage, no YARN to configure. Glue also provides the GlueContext wrapper which adds Catalog integration, job bookmarks, and DynamicFrame support on top of the standard SparkContext.

⚡

Spark ETL Job

Full Spark cluster, PySpark or Scala. Best for complex transformations, joins, aggregations on large datasets (>1 GB).

🐍

Python Shell Job

Single-node Python. Best for lightweight tasks: API calls, metadata updates, triggering downstream jobs, small file processing.

🎨

Glue Studio (Visual)

Drag-and-drop no-code/low-code ETL designer. Generates PySpark code. Good for simple pipelines; complex logic still needs code.

python — minimal Glue Spark ETL job structure (the required boilerplate)

import sys
import boto3
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql import functions as F

# ── 1. Parse job arguments ───────────────────────────────────────
args = getResolvedOptions(sys.argv, [
    "JOB_NAME",        # always required for job bookmarks
    "source_bucket",
    "target_bucket",
    "env"
])

# ── 2. Initialize Glue and Spark contexts ────────────────────────
sc    = SparkContext()
glue  = GlueContext(sc)
spark = glue.spark_session
job   = Job(glue)
job.init(args["JOB_NAME"], args)          # init required for bookmarks

# ── 3. Read data (standard Spark or DynamicFrame) ────────────────
df = spark.read.parquet(
    f"s3://{args['source_bucket']}/bronze/orders/"
)

# ── 4. Transform ─────────────────────────────────────────────────
df_clean = (df
    .dropDuplicates(["order_id"])
    .filter(F.col("amount") > 0)
    .withColumn("load_date", F.current_date())
)

# ── 5. Write ─────────────────────────────────────────────────────
df_clean.write.mode("overwrite").parquet(
    f"s3://{args['target_bucket']}/silver/orders/"
)

# ── 6. Commit job (required — marks bookmark checkpoint) ─────────
job.commit()
print("Job committed successfully.")

Job Parameters and --conf Overrides

Glue jobs accept user-defined parameters (key-value strings prefixed with --) that your script reads via getResolvedOptions(). You also pass Spark configuration overrides via --conf arguments — for example overriding shuffle partitions, executor memory, or enabling dynamic partition overwrite. Parameters set at job creation are defaults; you can override them per-run via start_job_run(Arguments={...}).

python — launch a Glue job with custom arguments and Spark conf overrides

import boto3

glue = boto3.client("glue")

response = glue.start_job_run(
    JobName="silver-orders-etl",
    Arguments={
        # User-defined parameters — read by getResolvedOptions()
        "--source_bucket": "my-lake-raw",
        "--target_bucket": "my-lake-silver",
        "--env":           "prod",

        # Spark configuration overrides for this run
        "--conf": (
            "spark.sql.shuffle.partitions=200 "
            "spark.sql.sources.partitionOverwriteMode=dynamic "
            "spark.serializer=org.apache.spark.serializer.KryoSerializer"
        ),

        # Glue-specific options
        "--enable-metrics":                "",   # push metrics to CloudWatch
        "--enable-continuous-cloudwatch-log": "",  # real-time log streaming
        "--job-bookmark-option":           "job-bookmark-enable"
    },
    MaxCapacity=10.0    # DPUs to allocate (10 DPU = 40 vCPUs + 160 GB RAM)
)

run_id = response["JobRunId"]
print(f"Started job run: {run_id}")

Job Bookmarks — Incremental Processing

The Glue Job Bookmark is the most powerful and most misunderstood Glue feature. When enabled, Glue tracks which S3 files (or JDBC rows) the job has already processed. On the next run, it automatically skips previously processed data and reads only new files — giving you incremental ETL without any watermark code. The bookmark state is stored in AWS-managed storage and linked to the job name + run ID chain.

📖 Analogy

A job bookmark is like a real bookmark in a book. Each time your job runs, it opens the book exactly where it left off — no need to re-read chapters you've already processed. If you delete the bookmark (reset it), the next run re-reads the whole book from page 1.

python — Glue job script with bookmark enabled (reads only NEW files)

import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

args  = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc    = SparkContext()
glue  = GlueContext(sc)
job   = Job(glue)
job.init(args["JOB_NAME"], args)       # ← bookmark state restored here

# Read using DynamicFrame — bookmark tracking works automatically
# Glue will only read files it hasn't seen in previous successful runs
raw_dyf = glue.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://my-lake/raw/landing/"],
        "recurse": True
    },
    format="parquet",
    transformation_ctx="raw_data"      # ← name used as bookmark key
)

print(f"New records this run: {raw_dyf.count()}")

# ... transform and write ...

job.commit()   # ← bookmark state saved — next run picks up from here

⚠️ Bookmark Gotchas

Bookmarks work on S3 file modification timestamps — they don't understand Hive partitions or business dates. If you re-write an existing file (overwrite), the bookmark won't see it as new. Bookmarks are also tied to the job name; rename a job and the bookmark history is lost. Always reset the bookmark (via --job-bookmark-option job-bookmark-reset) after a full reload.

Partition Management in Glue Jobs

When a Glue Spark job writes partitioned Parquet to S3, the new partition folders appear in S3 but the Glue Catalog doesn't know about them yet. You must register the partitions at the end of the job using either batch_create_partition() (boto3) or spark.sql("MSCK REPAIR TABLE prod_analytics.sales_transactions") — the latter is slower but simpler for many partitions.

python — register new partitions via Spark SQL inside a Glue job

# After writing partitioned data in the Glue job, repair the table
# (this scans S3 and adds any missing partitions to the Glue Catalog)
spark.sql("MSCK REPAIR TABLE prod_analytics.sales_transactions")
print("Partitions registered in Glue Catalog.")

# OR — more efficient for daily jobs: only add today's partition
from datetime import date
today = date.today()
spark.sql(f"""
    ALTER TABLE prod_analytics.sales_transactions
    ADD IF NOT EXISTS PARTITION (year='{today.year}', month='{today.month:02d}')
    LOCATION 's3://my-lake/silver/sales/year={today.year}/month={today.month:02d}/'
""")

Schema Evolution Handling in Glue Jobs

In production, source schemas change — new columns appear, types widen. Glue's DynamicFrame handles schema evolution better than a plain Spark DataFrame because it reads all data into a flexible "dynamic" structure first. For DataFrame-based pipelines, use mergeSchema=True when reading Parquet and handle missing columns explicitly with withColumn defaults.

python — handle schema evolution gracefully in a Glue Spark job

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

# Read with mergeSchema — handles files with different column sets
df = spark.read.option("mergeSchema", True).parquet(
    "s3://my-lake/raw/sales/"
)

# Safely handle columns that may not exist in older files
if "discount_pct" not in df.columns:
    df = df.withColumn("discount_pct", F.lit(0.0).cast(DoubleType()))

# Cast to expected types in case source widened (e.g. int → bigint)
df = df.withColumn("order_id", F.col("order_id").cast("bigint"))

print(f"Schema after evolution handling: {df.schema.simpleString()}")

Glue Job Metrics — DPU Optimization

Each DPU costs money — you want to allocate enough to run fast without over-provisioning. Glue emits job-level metrics to CloudWatch: glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors tells you the peak executors actually used. If your job uses 3 executors but you allocated 10 DPUs (supporting 9 executors + 1 driver), you're over-provisioned. Profile first with MaxCapacity=2, then scale based on actual usage.

DPU Count	vCPUs	RAM	Typical Use Case
2 DPU	8	32 GB	Dev/test, small files (<500 MB)
5 DPU	20	80 GB	Medium daily ETL (1–10 GB)
10 DPU	40	160 GB	Large daily ETL (10–100 GB)
20+ DPU	80+	320+ GB	Full loads, large joins, >100 GB data

🔌

Glue Connections — JDBC, VPC Private Resources CONNECTIVITY ▼

JDBC Connections (RDS, Redshift)

To read from or write to RDS, Redshift, or any JDBC source, Glue uses a Connection — a named config object that stores the JDBC URL, credentials reference (Secrets Manager), and VPC network settings. When you run a Glue job that uses a Connection, Glue provisions an elastic network interface in your VPC subnet so the Spark executors can reach the private RDS/Redshift endpoint without traffic leaving AWS.

python — create a Glue JDBC connection for RDS PostgreSQL

import boto3

glue = boto3.client("glue")

glue.create_connection(
    ConnectionInput={
        "Name": "prod-rds-postgres",
        "Description": "Production RDS PostgreSQL — analytics database",
        "ConnectionType": "JDBC",
        "ConnectionProperties": {
            "JDBC_CONNECTION_URL": "jdbc:postgresql://prod-rds.cluster-abc.us-east-1.rds.amazonaws.com:5432/analytics",
            "USERNAME": "glue_user",
            "PASSWORD": "{{resolve:secretsmanager:prod/rds/glue-user}}"
            # In practice: Glue reads creds from Secrets Manager directly via secret name
        },
        "PhysicalConnectionRequirements": {
            "SubnetId":               "subnet-0abc123",        # private subnet in your VPC
            "SecurityGroupIdList":     ["sg-0xyz789"],           # SG allowing outbound to RDS
            "AvailabilityZone":        "us-east-1a"
        }
    }
)
print("Glue Connection 'prod-rds-postgres' created.")

python — use the JDBC connection inside a Glue Spark job

# Inside the Glue ETL job script — read from RDS via JDBC connection
customers_dyf = glue.create_dynamic_frame.from_catalog(
    database="prod_analytics",
    table_name="raw_customers",                  # Glue Catalog table pointing to RDS
    additional_options={"jobBookmarkKeys": ["updated_at"]}
)

# OR — direct JDBC read with partitioning for parallelism
customers_df = (spark.read
    .format("jdbc")
    .option("url",             "jdbc:postgresql://prod-rds:5432/analytics")
    .option("dbtable",         "public.customers")
    .option("user",            db_user)
    .option("password",        db_pass)
    .option("partitionColumn", "customer_id")   # enables parallel reads
    .option("lowerBound",      "1")
    .option("upperBound",      "10000000")
    .option("numPartitions",   "10")            # 10 parallel JDBC tasks
    .option("driver",          "org.postgresql.Driver")
    .load()
)

VPC Connection for Private Resources

When your source is in a private subnet (no internet access), you need a Glue Connection with the correct subnet and security group configuration. Glue creates an ENI (Elastic Network Interface) in your VPC, which means your Glue executors have a private IP in your VPC and can reach RDS, Redshift, or Kafka on private IP addresses — exactly as if the Spark cluster were physically inside your network.

🔧 Security Group Rule Required

The Glue Connection's security group must have a self-referencing inbound rule (allow all traffic from itself). This is because Glue executors communicate with each other through the ENI — without the self-referencing rule, executor-to-executor shuffle traffic is blocked and jobs hang.

GLUE VPC CONNECTION FLOW Your VPC (us-east-1) Private Subnet (10.0.1.0/24) ├── RDS PostgreSQL: 10.0.1.50:5432 └── Glue ENI: 10.0.1.100 ← Glue injects this dynamically AWS-managed Glue Spark Cluster └── Executors reach 10.0.1.50:5432 via the injected ENI ↑ Traffic never leaves the VPC — no NAT, no internet gateway needed Security Group "glue-connection-sg": Inbound: All traffic from sg-glue-connection-sg (self-reference — REQUIRED) Outbound: TCP 5432 to RDS security group TCP 5439 to Redshift security group

✅

Glue Data Quality — Built-In DQ Framework DATA QUALITY ▼

Glue Data Quality Rules

AWS Glue has a native Data Quality service — no third-party library needed. You define rulesets (collections of rules) using DQDL (Data Quality Definition Language), a simple English-like syntax. Rules run against a Glue table or DynamicFrame and produce a score (0–1). You can fail the pipeline if the score drops below your threshold, or just log results to an audit table.

🔢

Completeness

IsComplete "customer_id"
Checks for nulls in a column.

🔑

Uniqueness

IsUnique "txn_id"
Checks for duplicate values.

📅

Freshness

DataFreshness "load_dt" <= 24 hours
Checks recency of data.

📊

Accuracy

ColumnValues "amount" > 0
Checks value range rules.

Rule Types with Examples

dqdl — DQDL ruleset definition (Glue Data Quality language)

# DQDL — Data Quality Definition Language
# This is a string you pass to Glue, not Python

Rules = [
    # Completeness — no nulls in critical columns
    IsComplete "customer_id",
    IsComplete "txn_id",
    IsComplete "amount",

    # Uniqueness — no duplicate transaction IDs
    IsUnique "txn_id",

    # Value ranges — business rules
    ColumnValues "amount" between 0.01 and 1000000,
    ColumnValues "discount_pct" between 0 and 1,

    # Regex matching — basic format checks
    MatchesRegex "email" with pattern "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z]{2,}$",

    # Referential integrity — values must be in an allowed set
    ColumnValues "status" in ["PENDING","COMPLETED","CANCELLED","REFUNDED"],

    # Row count — detect data drops
    RowCount >= 1000,

    # Completeness ratio — allow up to 2% nulls in optional fields
    Completeness "notes" >= 0.98,

    # Data freshness — data should be from last 48 hours
    DataFreshness "created_at" <= 48 hours
]

Creating and Running a Ruleset via boto3

python — create ruleset, run evaluation, check results, write audit to DynamoDB

import boto3, time, json

glue    = boto3.client("glue")
dynamo  = boto3.resource("dynamodb")
tbl     = dynamo.Table("pipeline-dq-audit")

DQDL_RULES = """
Rules = [
    IsComplete "customer_id",
    IsComplete "txn_id",
    IsUnique "txn_id",
    ColumnValues "amount" between 0.01 and 1000000,
    RowCount >= 1000
]
"""

# ── 1. Create the ruleset (idempotent) ──────────────────────────
try:
    glue.create_data_quality_ruleset(
        Name="sales-transactions-dq",
        Ruleset=DQDL_RULES,
        TargetTable={
            "TableName":     "sales_transactions",
            "DatabaseName":  "prod_analytics"
        }
    )
except glue.exceptions.AlreadyExistsException:
    glue.update_data_quality_ruleset(
        Name="sales-transactions-dq", Ruleset=DQDL_RULES
    )

# ── 2. Run the evaluation ────────────────────────────────────────
run_resp = glue.start_data_quality_ruleset_evaluation_run(
    DataSource={"GlueTable": {
        "TableName": "sales_transactions", "DatabaseName": "prod_analytics"
    }},
    Role="arn:aws:iam::123456789012:role/glue-dq-role",
    RulesetNames=["sales-transactions-dq"],
    NumberOfWorkers=5
)
run_id = run_resp["RunId"]

# ── 3. Poll until finished ───────────────────────────────────────
for _ in range(40):
    run_detail = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
    status = run_detail["Status"]
    print(f"DQ run status: {status}")
    if status in ["SUCCEEDED", "FAILED", "ERROR"]:
        break
    time.sleep(15)

# ── 4. Check rule results ────────────────────────────────────────
results = run_detail.get("ResultIds", [])
overall_pass = True
dq_score     = 0.0

if results:
    result_detail = glue.get_data_quality_result(ResultId=results[0])
    rule_results  = result_detail["RuleResults"]
    score         = result_detail.get("Score", 0.0)
    dq_score      = score

    print(f"\nOverall DQ Score: {score:.2%}")
    for r in rule_results:
        icon = "✅" if r["Result"] == "PASS" else "❌"
        print(f"  {icon} {r['Name']}: {r['Result']} — {r.get('Description','')}")
        if r["Result"] == "FAIL":
            overall_pass = False

# ── 5. Write DQ results to DynamoDB audit table ──────────────────
from datetime import datetime
from decimal import Decimal

tbl.put_item(Item={
    "run_id":   run_id,
    "table":    "sales_transactions",
    "ts":       datetime.utcnow().isoformat(),
    "score":    Decimal(str(round(dq_score, 4))),
    "passed":   overall_pass
})

# ── 6. Fail the pipeline if score below threshold ────────────────
DQ_THRESHOLD = 0.95
if dq_score < DQ_THRESHOLD:
    raise ValueError(
        f"DQ score {dq_score:.2%} below threshold {DQ_THRESHOLD:.0%}. Pipeline halted."
    )

Integrating DQ Checks in the Glue Job Pipeline

The cleanest pattern is to run DQ checks inside the Glue Spark job after writing data to S3 but before registering partitions in the Catalog or sending success alerts. If DQ fails, the partition is never registered — downstream queries simply don't see the bad data, and the job fails with a clear error that triggers an alert.

Read Source

→

Transform

→

Write to S3

→

Run DQ Checks

→

DQ Pass?

→

Register Partition

→

SNS Alert ✅

⚠️ DQ Fail Path

If DQ fails: do NOT register the partition → write failure record to DynamoDB audit table → publish SNS failure alert → raise exception so Glue marks the job as FAILED → CloudWatch alarm fires → on-call engineer gets paged.

🐍

Glue boto3 API — Complete Quick Reference API REFERENCE ▼

Job Lifecycle — Create, Run, Poll, Stop

python — complete Glue job management with polling

import boto3, time
from botocore.exceptions import ClientError

glue = boto3.client("glue")

# ── CREATE a Glue Spark job ───────────────────────────────────────
glue.create_job(
    Name="silver-orders-etl",
    Role="arn:aws:iam::123456789012:role/glue-etl-role",
    Command={
        "Name":          "glueetl",               # Spark job type
        "ScriptLocation": "s3://my-scripts/etl/silver_orders.py",
        "PythonVersion":  "3"
    },
    GlueVersion="4.0",                         # Spark 3.3 + Python 3.10
    MaxCapacity=10.0,                           # DPUs
    Timeout=60,                                 # minutes before auto-kill
    DefaultArguments={
        "--job-bookmark-option": "job-bookmark-enable",
        "--enable-metrics":      "",
        "--env":                 "prod"
    },
    Connections={"Connections": ["prod-rds-postgres"]}
)

# ── START a job run ───────────────────────────────────────────────
resp   = glue.start_job_run(JobName="silver-orders-etl")
run_id = resp["JobRunId"]
print(f"Started: {run_id}")

# ── POLL until terminal state ─────────────────────────────────────
TERMINAL = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}
while True:
    run = glue.get_job_run(JobName="silver-orders-etl", RunId=run_id)
    state = run["JobRun"]["JobRunState"]
    print(f"  State: {state}")
    if state in TERMINAL:
        break
    time.sleep(30)

if state != "SUCCEEDED":
    error = run["JobRun"].get("ErrorMessage", "unknown")
    raise RuntimeError(f"Glue job {run_id} {state}: {error}")

# ── LIST run history with paginator ──────────────────────────────
paginator = glue.get_paginator("get_job_runs")
for page in paginator.paginate(JobName="silver-orders-etl"):
    for run in page["JobRuns"]:
        print(f"  {run['JobRunId']}: {run['JobRunState']} — {run.get('CompletedOn','')}")

# ── STOP a running job ────────────────────────────────────────────
glue.batch_stop_job_run(
    JobName="silver-orders-etl",
    JobRunIds=[run_id]
)

All Key Glue APIs — Pills Reference

create_database() get_database() delete_database() create_table() get_table() update_table() delete_table() get_tables() paginator create_partition() batch_create_partition() get_partitions() paginator delete_partition() create_crawler() start_crawler() get_crawler() stop_crawler() create_job() start_job_run() get_job_run() get_job_runs() paginator batch_stop_job_run() create_data_quality_ruleset() start_data_quality_ruleset_evaluation_run() get_data_quality_ruleset_evaluation_run() get_data_quality_result()

☁️ 29.6 Summary

Data Catalog = the central schema registry for your entire data lake — Athena, EMR, Redshift Spectrum all read from it. Crawlers auto-discover schemas; use CRAWL_NEW_FOLDERS_ONLY for daily incremental runs. Spark ETL Jobs run managed Spark — no cluster management; pass params via -- arguments, override Spark conf inline. Job Bookmarks = incremental processing without watermark code — but understand their file-timestamp-based mechanics. Connections = JDBC + VPC config for private RDS/Redshift access; always add self-referencing SG rule. Glue Data Quality = write DQDL rulesets, run evaluations after every ETL load, fail the pipeline on score drop, write results to DynamoDB audit table.

29.7

AWS EMR — Spark in the Cloud

EMR (Elastic MapReduce) is AWS's managed big-data platform. It lets you run Apache Spark (and Hadoop, Hive, Presto, etc.) on auto-provisioned EC2 clusters — or completely serverlessly — without managing JVMs, YARN configs, or OS patches yourself. As a Data Engineer, EMR is where your PySpark jobs run at production scale on AWS.

🏗️

EMR Architecture — Master, Core, Task Nodes CORE CONCEPT ▼

Three Node Types

Every EMR cluster is made up of three kinds of nodes, each playing a distinct role:

Node Type	Role	What Runs Here	Can You Lose It?
Master Node	Brain of the cluster	YARN ResourceManager, HDFS NameNode, Spark Driver (client mode), EMR control daemons	No — cluster dies
Core Nodes	Workers + HDFS storage	YARN NodeManager, HDFS DataNode, Spark Executors	No — HDFS data loss
Task Nodes	Extra compute only	YARN NodeManager, Spark Executors — no HDFS	Yes — safe for Spot

🏭 Analogy

The master node is the factory manager who knows where every worker is and assigns tasks. Core nodes are permanent workers who also store raw materials (HDFS data). Task nodes are temp contractors brought in for a big rush — if they quit suddenly, no raw materials are lost, so it's safe to use cheap Spot pricing for them.

⚡ Real-World Tip

In practice, if you use S3 as your storage layer (which you almost always should on AWS), core nodes don't store data either — they just run executors. That makes all worker nodes safe to run on Spot, dramatically cutting cluster cost.

Cluster vs EMR Serverless vs EMR on EKS

AWS offers three ways to run Spark with EMR — each with a different trade-off between control and operational overhead:

🖥️

EMR on EC2 (Classic)

Full cluster you manage. Choose instance types, configure YARN, use bootstrap actions. Best when you need maximum control or custom libraries.

⚡

EMR Serverless

No cluster to manage. Submit jobs, AWS allocates resources, bills you per vCPU-second. Best for variable workloads where cluster management is a burden.

☸️

EMR on EKS

Runs Spark as Kubernetes pods inside your existing EKS cluster. Best when your org is already Kubernetes-first and wants unified infra.

Architecture Diagram

EMR CLUSTER ARCHITECTURE ┌─────────────────────────────────────────────────────────────────┐ │ EMR CLUSTER │ │ │ │ ┌──────────────────┐ │ │ │ MASTER NODE │ ← YARN ResourceManager │ │ │ (m5.xlarge×1) │ ← Spark Driver (client mode) │ │ │ │ ← EMR control daemons │ │ └────────┬─────────┘ │ │ │ dispatches tasks │ │ ┌──────┴──────┐ │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ← CORE NODES (on-demand) │ │ │core-1 │ │core-2 │ YARN NodeManager + Executors │ │ │r5.4x×2 │ │r5.4x×2 │ HDFS DataNode (if used) │ │ └──────────┘ └──────────┘ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ← TASK NODES (Spot) │ │ │task-1 │ │task-2 │ │task-3 │ Executors only │ │ │r5.4x×1 │ │r5.4x×1 │ │r5.4x×1 │ No HDFS │ │ └──────────┘ └──────────┘ └──────────┘ Safe to terminate │ └─────────────────────────────────────────────────────────────────┘ All nodes read/write to S3 via EMRFS (EMR File System) S3 acts as the persistent storage layer — survives cluster termination

✨

Spark on EMR — Cluster Mode vs Client Mode MOST IMPORTANT ▼

Client Mode vs Cluster Mode

When you submit a Spark job to EMR, you choose where the Spark Driver runs — on the master node (cluster mode) or on the machine doing the submission (client mode):

Deploy Mode	Driver Location	Stdout/Logs	Use When
cluster	Runs inside YARN on a cluster node	Must fetch from YARN logs	Production — submitting and disconnecting
client	Runs on the machine submitting the job	Streams to your terminal	Dev/debug — interactive feedback needed

🔑 Production Rule

Always use --deploy-mode cluster in production. In client mode, if the machine submitting the job dies (e.g. your laptop or Airflow worker), the Driver dies and your job fails. Cluster mode keeps the Driver inside YARN, isolated from the submitter.

Submitting a Spark Step via AWS Console / CLI

bash — add a Spark step via AWS CLI

# Submit a PySpark job to a running EMR cluster
aws emr add-steps \
  --cluster-id j-2AXXXXXXGAPLF \
  --steps '[{
    "Name": "Daily Sales ETL",
    "ActionOnFailure": "CONTINUE",
    "HadoopJarStep": {
      "Jar": "command-runner.jar",
      "Args": [
        "spark-submit",
        "--deploy-mode", "cluster",
        "--conf", "spark.executor.memory=8g",
        "--conf", "spark.executor.cores=4",
        "--conf", "spark.sql.shuffle.partitions=200",
        "s3://my-code-bucket/scripts/daily_sales_etl.py",
        "--date", "2024-01-15",
        "--env",  "prod"
      ]
    }
  }]'

📦 What is command-runner.jar?

command-runner.jar is a special EMR jar that translates YARN step commands into real shell commands. When you pass spark-submit as the first Arg, EMR knows to run the actual spark-submit binary on the cluster. It's the standard way to submit Spark steps on EMR — you'll see it in every real pipeline.

EMRFS — S3 as Your Persistent Storage Layer

On regular Hadoop, data lives in HDFS — which means it's inside the cluster and disappears when the cluster terminates. EMR replaces HDFS with EMRFS (EMR File System), which is a connector that makes S3 appear as a local filesystem to Spark. Your PySpark code reads s3://... paths exactly like it would read local files. Data persists on S3 even after the cluster is terminated.

python — PySpark on EMR reads from S3 transparently

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SalesETL").getOrCreate()

# On EMR, s3:// paths work exactly like local paths
# EMRFS handles the translation transparently
df = spark.read.parquet("s3://my-data-lake/bronze/sales/year=2024/")

result = df.groupBy("region").agg({"revenue": "sum"})

# Write results back to S3 — persists after cluster terminates
result.write.mode("overwrite").parquet(
    "s3://my-data-lake/gold/sales_by_region/year=2024/"
)

spark.stop()

Configuring Spark on EMR — spark-defaults and Classification Overrides

You can tune Spark settings at cluster launch time using EMR's Configurations — structured JSON that overrides config files like spark-defaults.conf:

json — EMR cluster configuration overrides (spark-defaults)

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.sql.shuffle.partitions":        "400",
      "spark.default.parallelism":           "400",
      "spark.executor.memory":              "8g",
      "spark.executor.cores":               "4",
      "spark.driver.memory":                "4g",
      "spark.dynamicAllocation.enabled":    "true",
      "spark.sql.adaptive.enabled":         "true",
      "spark.serializer":                   "org.apache.spark.serializer.KryoSerializer"
    }
  },
  {
    "Classification": "spark-env",
    "Configurations": [{
      "Classification": "export",
      "Properties": {
        "PYSPARK_PYTHON": "/usr/bin/python3"
      }
    }]
  }
]

⚡

EMR Serverless — No Cluster Management MODERN PATTERN ▼

What is EMR Serverless and When to Use It

With classic EMR on EC2, you provision a cluster, wait for it to start (3–8 mins), run jobs, then terminate it. EMR Serverless eliminates all of that — you create an Application (a logical container), then submit job runs. AWS automatically allocates vCPUs and memory for each run, scales them during execution, and releases them when the job ends. You pay only for actual vCPU-seconds and GB-seconds used.

Aspect	EMR on EC2 (Classic)	EMR Serverless
Cluster startup	3–8 minutes	~30 seconds
Node management	You choose instance types	Fully managed by AWS
Pricing	Pay per EC2 instance-hour	Pay per vCPU-second + GB-second
Custom libs / bootstrap	Full control via bootstrap	Via custom image or --py-files
Best for	Predictable, large batch workloads	Variable, on-demand Spark jobs

EMR Serverless — Create Application and Submit a Job

python — EMR Serverless full workflow with boto3

import boto3, time

emr = boto3.client("emr-serverless", region_name="us-east-1")

# ── Step 1: Create an Application (one-time setup) ──────────────
app = emr.create_application(
    name="spark-etl-app",
    releaseLabel="emr-6.15.0",         # EMR runtime version
    type="SPARK",
    autoStartConfiguration={"enabled": True},
    autoStopConfiguration={
        "enabled": True,
        "idleTimeoutMinutes": 15          # auto-stop if idle 15 min
    },
    maximumCapacity={                        # cost guard-rail
        "cpu":    "200 vCPU",
        "memory": "1000 GB"
    }
)
app_id = app["applicationId"]
print(f"Application created: {app_id}")

# ── Step 2: Start the Application ───────────────────────────────
emr.start_application(applicationId=app_id)
# Wait until it's in STARTED state
while True:
    state = emr.get_application(applicationId=app_id)["application"]["state"]
    if state == "STARTED": break
    time.sleep(5)

# ── Step 3: Submit a Job Run ─────────────────────────────────────
job = emr.start_job_run(
    applicationId=app_id,
    executionRoleArn="arn:aws:iam::123456789012:role/EMRServerlessJobRole",
    jobDriver={
        "sparkSubmit": {
            "entryPoint": "s3://my-code-bucket/scripts/daily_etl.py",
            "entryPointArguments": ["--date", "2024-01-15"],
            "sparkSubmitParameters": (
                "--conf spark.executor.cores=4 "
                "--conf spark.executor.memory=8g "
                "--conf spark.sql.shuffle.partitions=200"
            )
        }
    },
    configurationOverrides={
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://my-emr-logs/serverless/"
            }
        }
    },
    name="daily-etl-2024-01-15"
)
job_run_id = job["jobRunId"]
print(f"Job submitted: {job_run_id}")

# ── Step 4: Poll until complete ──────────────────────────────────
while True:
    run = emr.get_job_run(applicationId=app_id, jobRunId=job_run_id)
    state = run["jobRun"]["state"]
    print(f"State: {state}")
    if state in ("SUCCESS", "FAILED", "CANCELLED"): break
    time.sleep(20)

print(f"Final state: {state}")

📈

Autoscaling — Managed Scaling & Instance Group Scaling COST CONTROL ▼

Managed Scaling (Recommended)

EMR's Managed Scaling continuously monitors YARN metrics and automatically adds or removes instances to match workload demand. You set a min and max, and EMR does the rest — no custom CloudWatch alarms or scale-in policies needed.

python — enable managed scaling on a cluster

import boto3

emr = boto3.client("emr")

emr.put_managed_scaling_policy(
    ClusterId="j-2AXXXXXXGAPLF",
    ManagedScalingPolicy={
        "ComputeLimits": {
            "UnitType":            "Instances",
            "MinimumCapacityUnits": 2,    # minimum 2 core/task nodes
            "MaximumCapacityUnits": 20,   # scale up to 20 nodes
            "MaximumOnDemandCapacityUnits": 5, # max on-demand (rest Spot)
            "MaximumCoreCapacityUnits": 5  # core nodes stay small
        }
    }
)

💡 Best Practice

Set MaximumOnDemandCapacityUnits low (e.g. 5) and let the rest scale with Spot instances. This caps your on-demand cost while allowing burst capacity at Spot pricing (~70% cheaper).

Instance Group vs Instance Fleet

EMR offers two ways to define node pools. Instance Groups use a single instance type per group (simpler). Instance Fleets let you specify multiple instance types, and EMR picks whichever has Spot availability (more resilient, better Spot fulfillment).

✅ Production Pattern

Use Instance Fleets for task nodes in production — specify 5–6 similar instance types (e.g. r5.4xlarge, r5a.4xlarge, r4.4xlarge, m5.8xlarge) so EMR can always find Spot capacity somewhere.

💰

Spot Instances — Cost Reduction for Task Nodes COST SAVING ▼

What are Spot Instances and Why Use Them

AWS Spot instances are spare EC2 capacity sold at up to 90% discount versus on-demand pricing. The catch: AWS can reclaim them with 2-minute notice if capacity is needed elsewhere. For task nodes (no HDFS data), this is perfectly safe — if a task node is reclaimed, YARN simply reschedules those tasks onto surviving nodes. You lose some compute time but no data.

python — run_job_flow with Spot task nodes

import boto3

emr = boto3.client("emr")

response = emr.run_job_flow(
    Name="production-spark-cluster",
    ReleaseLabel="emr-7.1.0",
    Applications=[{"Name": "Spark"}],
    Instances={
        "InstanceFleets": [
            {
                "Name": "MasterFleet",
                "InstanceFleetType": "MASTER",
                "TargetOnDemandCapacity": 1,  # Master always on-demand
                "InstanceTypeConfigs": [
                    {"InstanceType": "m5.xlarge"}
                ]
            },
            {
                "Name": "CoreFleet",
                "InstanceFleetType": "CORE",
                "TargetOnDemandCapacity": 2,   # Core on-demand
                "InstanceTypeConfigs": [
                    {"InstanceType": "r5.4xlarge", "WeightedCapacity": 1},
                    {"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1}
                ]
            },
            {
                "Name": "TaskFleet",
                "InstanceFleetType": "TASK",
                "TargetSpotCapacity": 10,  # Task nodes on Spot!
                "InstanceTypeConfigs": [
                    # Multiple types = better Spot availability
                    {"InstanceType": "r5.4xlarge",  "WeightedCapacity": 1},
                    {"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1},
                    {"InstanceType": "r4.4xlarge",  "WeightedCapacity": 1},
                    {"InstanceType": "m5.8xlarge",  "WeightedCapacity": 2}
                ],
                "LaunchSpecifications": {
                    "SpotSpecification": {
                        "TimeoutDurationMinutes": 5,
                        "TimeoutAction": "SWITCH_TO_ON_DEMAND"  # fallback
                    }
                }
            }
        ],
        "Ec2SubnetIds": ["subnet-abc123", "subnet-def456"],  # multi-AZ
        "KeepJobFlowAliveWhenNoSteps": False,  # auto-terminate after steps
    },
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
    AutoTerminationPolicy={"IdleTimeout": 3600}  # terminate if idle 1 hr
)

cluster_id = response["JobFlowId"]
print(f"Cluster started: {cluster_id}")

Spot Interruption Handling

When AWS reclaims a Spot instance, it gives a 2-minute interruption notice. YARN detects the node disappearing and automatically reschedules tasks that were running on it. To minimize disruption:

Enable Spark's external shuffle service — shuffle data is stored on a separate service, not the executor. If the executor is reclaimed, shuffle data survives.
Set spark.stage.maxConsecutiveAttempts higher (default 4) if stages frequently fail due to Spot interruptions.
Use multiple Spot instance types in Instance Fleets — if one type's pool is exhausted, EMR tries another.
Set TimeoutAction: SWITCH_TO_ON_DEMAND so the fleet falls back to on-demand if Spot is unavailable at launch.

🚀

Bootstrap Actions — Custom Setup Before Spark Starts CUSTOMIZATION ▼

What are Bootstrap Actions

Bootstrap Actions are shell scripts that run on every node in the cluster before YARN and Spark start. They're your chance to install Python packages, set OS-level config, download files from S3, or configure the environment. Think of it like apt-get install + pip install that runs on every machine in the cluster at launch time.

✅ Common Bootstrap Use Cases

Installing Python libraries (e.g. pandas, boto3, pyarrow, great-expectations) · Mounting EFS or NFS · Copying config files from S3 · Installing system packages · Setting environment variables

Writing a Bootstrap Script

bash — bootstrap.sh (uploaded to S3, referenced at cluster launch)

#!/bin/bash
# Bootstrap script — runs on every node before EMR services start
set -ex  # exit on error, print each command

# Update pip
sudo pip3 install --upgrade pip

# Install Python dependencies your PySpark job needs
sudo pip3 install \
    pyarrow==14.0.1 \
    great-expectations==0.18.0 \
    boto3==1.34.0 \
    tenacity==8.2.3 \
    psycopg2-binary==2.9.9

# Download a shared config file from S3
aws s3 cp s3://my-config-bucket/pipeline.yaml /etc/pipeline/pipeline.yaml

# Set environment variable for all processes
echo 'export PIPELINE_ENV=prod' | sudo tee -a /etc/environment

echo "Bootstrap complete!"

python — referencing bootstrap in run_job_flow()

import boto3

# First, upload your bootstrap script to S3
s3 = boto3.client("s3")
s3.upload_file("bootstrap.sh", "my-code-bucket", "bootstrap/bootstrap.sh")

emr = boto3.client("emr")

response = emr.run_job_flow(
    Name="cluster-with-bootstrap",
    ReleaseLabel="emr-7.1.0",
    Applications=[{"Name": "Spark"}],
    BootstrapActions=[
        {
            "Name": "Install Python Dependencies",
            "ScriptBootstrapAction": {
                "Path": "s3://my-code-bucket/bootstrap/bootstrap.sh",
                "Args": []  # optional args passed to the script
            }
        }
    ],
    Instances={
        "InstanceGroups": [
            {"Name": "Master", "InstanceRole": "MASTER",
             "InstanceType": "m5.xlarge", "InstanceCount": 1},
            {"Name": "Core",   "InstanceRole": "CORE",
             "InstanceType": "r5.4xlarge", "InstanceCount": 4}
        ],
        "Ec2SubnetId": "subnet-abc123",
        "KeepJobFlowAliveWhenNoSteps": False
    },
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
    LogUri="s3://my-emr-logs/"
)

print(f"Cluster: {response['JobFlowId']}")

🔬

EMR Studio — Interactive Notebooks for Debugging DEV TOOL ▼

What is EMR Studio

EMR Studio is a managed Jupyter-based IDE hosted by AWS. You connect it to a running EMR cluster (or EMR Serverless application) and run PySpark code interactively — exactly like a local Jupyter notebook, but with the full cluster compute behind it. It's invaluable for:

Exploring large datasets on S3 interactively
Debugging Spark jobs — checking DataFrames mid-transformation
Profiling slow queries with the built-in Spark UI link
Prototyping logic before packaging it into a production script

🔑 Key Capability

EMR Studio notebooks have a built-in Spark UI tab — you can see DAGs, stage timings, and task metrics for every cell you run, making it the best debugging environment for Spark on AWS.

💵

EMR Cost Optimization Patterns PRODUCTION ▼

The Golden Rules of EMR Cost Control

Strategy	How to Implement	Typical Saving
Auto-terminate clusters	Set `KeepJobFlowAliveWhenNoSteps=False` and `AutoTerminationPolicy`	Eliminates idle cluster cost
Spot task nodes	Instance Fleets with `TargetSpotCapacity` on task fleet	50–80% off task node cost
Right-size instances	Monitor YARN container utilization, pick r5/r6 for memory-heavy Spark	20–40% off wasted capacity
Reserved Instances	Use 1-year RIs for core nodes in always-on clusters	~30% vs on-demand
EMR Serverless	For variable/infrequent jobs — pay only during job execution	Eliminates idle time entirely

Complete Production EMR Launch Pattern

python — production-grade EMR cluster + step + auto-terminate

import boto3, time

emr = boto3.client("emr", region_name="us-east-1")

# ── Launch cluster with embedded steps ──────────────────────────
response = emr.run_job_flow(
    Name="daily-etl-2024-01-15",
    ReleaseLabel="emr-7.1.0",
    Applications=[{"Name": "Spark"}],
    Instances={
        "InstanceGroups": [
            {
                "Name": "Master", "InstanceRole": "MASTER",
                "InstanceType": "m5.xlarge", "InstanceCount": 1,
                "Market": "ON_DEMAND"
            },
            {
                "Name": "Core", "InstanceRole": "CORE",
                "InstanceType": "r5.4xlarge", "InstanceCount": 4,
                "Market": "ON_DEMAND"
            },
            {
                "Name": "Task", "InstanceRole": "TASK",
                "InstanceType": "r5.4xlarge", "InstanceCount": 8,
                "Market": "SPOT",
                "BidPrice": "0.50"  # max Spot bid
            }
        ],
        "Ec2SubnetId": "subnet-abc123",
        "KeepJobFlowAliveWhenNoSteps": False,  # ← auto-terminate!
        "TerminationProtected": False
    },
    Steps=[
        {
            "Name": "Bronze ETL",
            "ActionOnFailure": "TERMINATE_CLUSTER",  # fail fast
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "spark-submit", "--deploy-mode", "cluster",
                    "--conf", "spark.sql.shuffle.partitions=400",
                    "s3://my-code/scripts/bronze_etl.py",
                    "--date", "2024-01-15"
                ]
            }
        },
        {
            "Name": "Silver Transform",
            "ActionOnFailure": "TERMINATE_CLUSTER",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "spark-submit", "--deploy-mode", "cluster",
                    "s3://my-code/scripts/silver_transform.py",
                    "--date", "2024-01-15"
                ]
            }
        }
    ],
    BootstrapActions=[{
        "Name": "Install libs",
        "ScriptBootstrapAction": {
            "Path": "s3://my-code/bootstrap/setup.sh"
        }
    }],
    Configurations=[{
        "Classification": "spark-defaults",
        "Properties": {
            "spark.sql.adaptive.enabled": "true",
            "spark.dynamicAllocation.enabled": "true"
        }
    }],
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
    LogUri="s3://my-emr-logs/clusters/",
    AutoTerminationPolicy={"IdleTimeout": 3600},
    Tags=[
        {"Key": "Project",     "Value": "DataLake"},
        {"Key": "Environment", "Value": "prod"},
        {"Key": "CostCenter",   "Value": "DE-Team"}  # for billing reports
    ]
)

cluster_id = response["JobFlowId"]
print(f"Cluster launched: {cluster_id}")
print(f"Will auto-terminate after steps complete.")

# ── Poll until cluster terminates ───────────────────────────────
while True:
    cluster = emr.describe_cluster(ClusterId=cluster_id)
    state   = cluster["Cluster"]["Status"]["State"]
    print(f"  Cluster state: {state}")
    if state in ("TERMINATED", "TERMINATED_WITH_ERRORS"):
        break
    time.sleep(30)

print("Done.")

📋

EMR Quick Reference — Decision Guide SUMMARY ▼

What to Choose When

EMR DECISION GUIDE Q1: How predictable is your workload? Always running (24/7 cluster)? → EMR on EC2 + Reserved Instances Scheduled daily batch jobs? → EMR on EC2, auto-terminate + Spot tasks Variable / on-demand jobs? → EMR Serverless Q2: Do you need custom libraries / OS packages? Yes (specific Python versions, native libs)? → EMR on EC2 + Bootstrap No (standard Spark, pip install okay)? → EMR Serverless Q3: Are you already on Kubernetes? Yes? → Consider EMR on EKS for unified infra No? → EMR on EC2 or EMR Serverless Q4: Cost sensitivity? Max savings? → Spot Instance Fleets (task nodes) + auto-terminate Simplest? → EMR Serverless (no idle cost, no cluster mgmt) Q5: Is data in HDFS or S3? S3 (always on AWS)? → All node types safe for Spot HDFS? → Core nodes must be on-demand

Essential EMR boto3 Calls — Cheat Sheet

Operation	boto3 Call	Key Parameter
Launch cluster	`emr.run_job_flow()`	`Instances`, `Steps`, `BootstrapActions`
Add steps	`emr.add_job_flow_steps()`	`ClusterId`, `Steps`
Check cluster state	`emr.describe_cluster()`	`ClusterId` → `Status.State`
Check step state	`emr.describe_step()`	`ClusterId`, `StepId` → `Status.State`
List clusters	`emr.list_clusters()` + paginator	`ClusterStates=["RUNNING"]`
Terminate cluster	`emr.terminate_job_flows()`	`JobFlowIds=[cluster_id]`
Enable managed scaling	`emr.put_managed_scaling_policy()`	`ManagedScalingPolicy`
Serverless — submit job	`emr_serverless.start_job_run()`	`applicationId`, `jobDriver`
Serverless — poll state	`emr_serverless.get_job_run()`	`applicationId`, `jobRunId`

29.8

Amazon Athena — Serverless SQL on S3

Athena is AWS's serverless, interactive query service. You point it at files on S3 — Parquet, ORC, JSON, CSV — and run standard SQL against them with no cluster to manage, no data to load, and no infrastructure to provision. You pay only for the bytes scanned. For Data Engineers, Athena is the fastest way to query and validate data in your lake, run ad-hoc analytics, and automate SQL-based pipelines via boto3.

🔍

Athena Query Engine — Trino-Based Serverless SQL CORE CONCEPT ▼

How Athena Works Under the Hood

Athena is built on top of Trino (formerly PrestoSQL) — a massively parallel SQL engine. When you submit a query, Athena spins up a fleet of compute workers, reads the relevant S3 files in parallel, processes the query, writes results to an S3 output location, and then releases all compute. You never see any of this — it's fully managed and scales automatically with query size.

HOW ATHENA EXECUTES A QUERY Your SQL Query │ ▼ ┌─────────────────────────────────────────────────────────┐ │ ATHENA QUERY ENGINE (Trino-based) │ │ │ │ 1. Parse SQL → query plan │ │ 2. Look up table schema in Glue Data Catalog │ │ 3. List matching S3 partitions (partition pruning) │ │ 4. Spin up parallel workers → each reads S3 files │ │ 5. Merge results, apply ORDER BY / LIMIT │ │ 6. Write ResultSet to S3 output bucket │ └─────────────────────────────────────────────────────────┘ │ ▼ s3://my-bucket/athena-results/query-execution-id.csv (you fetch results via get_query_results API) PRICING: $5 per TB of data scanned Parquet + partition pruning → scan only what you need → cheap queries

💡 The Cost/Speed Secret

The two things that matter most for Athena cost and speed are: (1) use columnar formats like Parquet or ORC — Athena only reads the columns your SQL touches, not the whole row; (2) use partitioning — Athena skips entire S3 prefixes that don't match your WHERE clause. Together these can reduce scanned data by 99%, cutting both cost and query time.

Querying S3 Data with Athena — The Basic Flow

To query S3 data, you register it as a table in the Glue Data Catalog (or use a Glue Crawler to auto-discover it). Then Athena can query it with standard SQL. Here's what a full setup looks like:

sql — create external table pointing to S3 Parquet data

-- Run this DDL in Athena console or via start_query_execution API
CREATE EXTERNAL TABLE IF NOT EXISTS sales_db.transactions (
    transaction_id  STRING,
    customer_id     STRING,
    product_id      STRING,
    amount          DOUBLE,
    status          STRING,
    created_at      TIMESTAMP
)
PARTITIONED BY (
    year  STRING,
    month STRING,
    day   STRING
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/silver/transactions/'
TBLPROPERTIES (
    'parquet.compress' = 'SNAPPY',
    'projection.enabled' = 'false'
);

-- Load partition metadata (tells Athena about existing partitions)
MSCK REPAIR TABLE sales_db.transactions;

-- Now query with partition pruning (only scans year=2024/month=01/day=15)
SELECT
    customer_id,
    SUM(amount) AS total_spend
FROM sales_db.transactions
WHERE
    year  = '2024'
    AND month = '01'
    AND day   = '15'
    AND status = 'COMPLETED'
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 100;

✂️

Partition Pruning — Critical for Cost and Speed MOST IMPORTANT ▼

What is Partition Pruning

Partition pruning means Athena skips S3 prefixes entirely when your WHERE clause filters on partition columns. Instead of listing and reading every file in the table, Athena only reads files under the matching partition paths. This is the single biggest cost and performance lever in Athena.

📚 Analogy

Imagine a library with books organized by year → month → day on separate shelves. If you want books from January 2024, you go directly to the "2024 → 01" shelf. You never touch the other shelves. Partition pruning is Athena doing the same thing — it goes straight to the matching S3 "shelf" (prefix) and ignores everything else.

sql — partition pruning in action (WITH vs WITHOUT)

-- ❌ BAD: Full table scan — reads ALL partitions regardless of date
SELECT * FROM sales_db.transactions
WHERE created_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Athena scans ALL files → slow + expensive
-- created_at is a data column, not a partition column

-- ✅ GOOD: Partition pruning — only reads year=2024/month=01/day=15
SELECT * FROM sales_db.transactions
WHERE year = '2024' AND month = '01' AND day = '15'
  AND created_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Athena skips all other partitions → fast + cheap

-- ✅ ALSO GOOD: Range on partition columns
SELECT * FROM sales_db.transactions
WHERE year = '2024' AND month IN ('01', '02', '03');
-- Scans only Q1 2024 data

Partition Projection — Skip MSCK REPAIR Entirely

With large tables, MSCK REPAIR TABLE (which scans S3 to discover partitions) can take minutes or even time out. Partition Projection is an Athena feature where you declare the partition schema mathematically — Athena computes valid partition paths on the fly without any metadata lookup. This makes partition-heavy tables query-ready instantly, even with years of daily data.

sql — enable partition projection on a date-partitioned table

CREATE EXTERNAL TABLE sales_db.events (
    event_id   STRING,
    user_id    STRING,
    event_type STRING,
    payload    STRING
)
PARTITIONED BY (dt STRING)   -- single date partition column
STORED AS PARQUET
LOCATION 's3://my-data-lake/events/'
TBLPROPERTIES (
    -- Enable partition projection
    'projection.enabled'               = 'true',
    'projection.dt.type'               = 'date',
    'projection.dt.format'             = 'yyyy-MM-dd',
    'projection.dt.range'              = '2023-01-01,NOW',
    'projection.dt.interval'           = '1',
    'projection.dt.interval.unit'      = 'DAYS',

    -- Tell Athena how to build the S3 path from partition value
    'storage.location.template'        = 's3://my-data-lake/events/dt=${dt}/'
);

-- No MSCK REPAIR needed! Just query directly:
SELECT event_type, COUNT(*) AS cnt
FROM  sales_db.events
WHERE dt BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1
ORDER BY 2 DESC;

File Size and Format — The Other Cost Lever

Even with partition pruning, small files are expensive because Athena has to open and process many file handles. The sweet spot is 128 MB – 1 GB Parquet files with Snappy compression. This is why Spark's OPTIMIZE (Delta) or manual compaction matters so much for Athena workloads.

Format	Typical Cost	Query Speed	Recommendation
CSV (uncompressed)	Very High	Slow	Never use for large tables
JSON (uncompressed)	High	Slow	Raw landing only
Parquet + Snappy	Low	Fast	✅ Default choice
ORC + Zlib	Very Low	Fast	✅ Good alternative
Parquet + Zstd	Lowest	Fastest	✅ Best for large tables

👥

Workgroups — Cost Control Per Team GOVERNANCE ▼

What Workgroups Do

Workgroups let you isolate and control Athena usage by team, project, or environment. Each workgroup can have its own: output S3 location, data scan limit per query (cost guard-rail), query history, and IAM-controlled access. Without workgroups, a single runaway query from any user could scan a petabyte and generate a huge bill.

python — create a workgroup with data scan limit

import boto3

athena = boto3.client("athena", region_name="us-east-1")

athena.create_work_group(
    Name="de-team-prod",
    Description="Data Engineering team production workgroup",
    Configuration={
        "ResultConfiguration": {
            "OutputLocation": "s3://my-athena-results/de-team-prod/",
            "EncryptionConfiguration": {
                "EncryptionOption": "SSE_KMS",
                "KmsKey": "alias/data-lake-cmk"
            }
        },
        "EnforceWorkGroupConfiguration": True,  # users can't override below
        "BytesScannedCutoffPerQuery": 10 * 1024**3,  # 10 GB max per query
        "PublishCloudWatchMetricsEnabled": True,  # send metrics to CW
        "RequesterPaysEnabled": False
    },
    Tags=[{"Key": "Team", "Value": "DataEngineering"}]
)
print("Workgroup created.")

🔑 Production Pattern

Create separate workgroups for dev, staging, and prod environments, each with a different scan limit (e.g. 1 GB for dev, 100 GB for prod pipelines). This prevents a dev query from accidentally scanning the full production dataset and costing hundreds of dollars.

Query Result Caching

Athena can cache query results for up to 7 days. If the same query (or one with the same hash) is submitted again within the cache window, Athena returns the previous result instantly at zero cost — no S3 scan. This is especially valuable for dashboard queries that run the same aggregations repeatedly.

python — enable result reuse (caching) on a workgroup

import boto3

athena = boto3.client("athena")

athena.update_work_group(
    WorkGroup="de-team-prod",
    ConfigurationUpdates={
        "ResultReuseByAgeConfiguration": {
            "Enabled":           True,
            "MaxAgeInMinutes":   60 * 24  # cache results for 24 hours
        }
    }
)

🧊

Athena with Iceberg Tables — ACID SQL on S3 MODERN PATTERN ▼

Why Iceberg + Athena is Powerful

Standard Athena external tables are read-only — you can't UPDATE or DELETE rows. Apache Iceberg changes this. Athena has native Iceberg support, which means you can run full ACID DML — INSERT, UPDATE, DELETE, MERGE INTO — directly on S3 data via SQL, without Spark. This is ideal for lightweight CDC landing, small corrective updates, and data deletion for GDPR compliance.

sql — create and use an Iceberg table in Athena

-- Create an Iceberg table (stored on S3, managed via Glue Catalog)
CREATE TABLE sales_db.customers_iceberg (
    customer_id  STRING,
    name         STRING,
    email        STRING,
    country      STRING,
    updated_at   TIMESTAMP
)
LOCATION 's3://my-data-lake/iceberg/customers/'
TBLPROPERTIES (
    'table_type' = 'ICEBERG',
    'format'     = 'parquet',
    'write_compression' = 'snappy'
);

-- Standard INSERT
INSERT INTO sales_db.customers_iceberg
VALUES ('C001', 'Alice', 'alice@example.com', 'IN', NOW());

-- MERGE INTO — upsert from a staging table
MERGE INTO sales_db.customers_iceberg t
USING      sales_db.customers_staging s
ON         t.customer_id = s.customer_id
WHEN MATCHED THEN
    UPDATE SET name = s.name, email = s.email, updated_at = s.updated_at
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, country, updated_at)
    VALUES (s.customer_id, s.name, s.email, s.country, s.updated_at);

-- DELETE rows (e.g. GDPR right-to-erasure)
DELETE FROM sales_db.customers_iceberg
WHERE customer_id = 'C001';

-- Time travel — query historical snapshot
SELECT * FROM sales_db.customers_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 12:00:00';

🔗

Athena Federated Queries — Query RDS and DynamoDB from Athena ADVANCED ▼

What is Federated Query

Athena Federated Query lets you run SQL that joins S3 data with data in other AWS services — RDS (PostgreSQL/MySQL), DynamoDB, Redshift, ElasticSearch, and more — in a single query. It uses Lambda-based Data Source Connectors that translate Athena queries into calls against the target service.

🗄️

RDS Connector

Query PostgreSQL or MySQL tables directly from Athena. Join with S3 data in the same SQL query.

⚡

DynamoDB Connector

Query DynamoDB tables with SQL. Useful for joining pipeline metadata (stored in DynamoDB) with S3 data.

🔴

Redshift Connector

Query Redshift tables without a JDBC connection. Join Redshift fact tables with S3 dimension tables.

sql — federated query: join S3 data with RDS table

-- After installing the RDS connector Lambda and registering it as a catalog:
-- "my_rds_catalog" points to an RDS PostgreSQL instance
-- "AwsDataCatalog" points to Glue Catalog (S3 data)

SELECT
    t.transaction_id,
    t.amount,
    c.name        AS customer_name,
    c.email       AS customer_email,
    c.country
FROM
    -- S3 data via Glue Catalog
    AwsDataCatalog.sales_db.transactions t

    -- RDS PostgreSQL data via federated connector
    JOIN my_rds_catalog.public.customers c
      ON t.customer_id = c.customer_id
WHERE
    t.year   = '2024'
    AND t.month  = '01'
    AND c.country = 'IN';

⚠️ Performance Warning

Federated queries are convenient but slow for large datasets — they pull data via Lambda, which has throughput limits. Use them for enrichment joins where the external table is small (e.g. a customer dimension in RDS). Never use federated queries to scan large RDS tables — extract those to S3 first.

📝

Named Queries — Saved SQL Templates PRODUCTIVITY ▼

What Named Queries Are

Named queries are saved SQL statements stored in Athena (per workgroup). They appear in the Athena console as saved queries — useful for standard DQ checks, daily reports, or DDL templates that multiple team members run. You can also create and retrieve them via boto3 to build query libraries in code.

python — create and retrieve named queries via boto3

import boto3

athena = boto3.client("athena")

# Save a frequently-used DQ check as a named query
response = athena.create_named_query(
    Name="daily_null_check",
    Description="Check null counts in transactions table for a given date",
    Database="sales_db",
    QueryString="""
        SELECT
            COUNT(*)                                      AS total_rows,
            COUNT(*) - COUNT(transaction_id)              AS null_transaction_id,
            COUNT(*) - COUNT(customer_id)                 AS null_customer_id,
            COUNT(*) - COUNT(amount)                      AS null_amount
        FROM transactions
        WHERE year = '2024' AND month = '01' AND day = '15'
    """,
    WorkGroup="de-team-prod"
)
named_query_id = response["NamedQueryId"]
print(f"Saved named query: {named_query_id}")

# Retrieve it later to get the SQL string
nq = athena.get_named_query(NamedQueryId=named_query_id)
sql = nq["NamedQuery"]["QueryString"]
print(f"SQL: {sql}")

🐍

Full boto3 Athena Pattern — Every DE Must Know This PRODUCTION PATTERN ▼

The Complete Athena Automation Pattern

Athena is asynchronous — you submit a query, get back a QueryExecutionId, then poll until it's done, then fetch results. This 4-step pattern is what every production Athena automation looks like:

1. start_query_execution()

→

2. poll get_query_execution()

→

3. SUCCEEDED?

→

4. paginate get_query_results()

python — complete Athena automation: submit → poll → parse results

import boto3, time

athena = boto3.client("athena", region_name="us-east-1")

# ─────────────────────────────────────────────────────────────────
# Step 1: Submit the query
# ─────────────────────────────────────────────────────────────────
response = athena.start_query_execution(
    QueryString="""
        SELECT
            region,
            SUM(revenue)   AS total_revenue,
            COUNT(*)       AS num_orders
        FROM sales_db.transactions
        WHERE year = '2024' AND month = '01'
        GROUP BY region
        ORDER BY total_revenue DESC
    """,
    QueryExecutionContext={
        "Database": "sales_db",
        "Catalog":  "AwsDataCatalog"
    },
    ResultConfiguration={
        "OutputLocation": "s3://my-athena-results/de-team-prod/"
    },
    WorkGroup="de-team-prod"
)

query_execution_id = response["QueryExecutionId"]
print(f"Query submitted: {query_execution_id}")

# ─────────────────────────────────────────────────────────────────
# Step 2: Poll until SUCCEEDED or FAILED
# ─────────────────────────────────────────────────────────────────
def wait_for_query(athena_client, qeid, poll_interval=2):
    terminal = {"SUCCEEDED", "FAILED", "CANCELLED"}
    while True:
        result = athena_client.get_query_execution(QueryExecutionId=qeid)
        status = result["QueryExecution"]["Status"]
        state  = status["State"]
        print(f"  State: {state}")
        if state == "FAILED":
            reason = status.get("StateChangeReason", "Unknown")
            raise RuntimeError(f"Athena query FAILED: {reason}")
        if state == "CANCELLED":
            raise RuntimeError("Athena query was CANCELLED.")
        if state == "SUCCEEDED":
            stats = result["QueryExecution"]["Statistics"]
            scanned_mb = stats["DataScannedInBytes"] / 1024**2
            print(f"  ✅ SUCCEEDED — Scanned: {scanned_mb:.1f} MB")
            return
        time.sleep(poll_interval)

wait_for_query(athena, query_execution_id)

# ─────────────────────────────────────────────────────────────────
# Step 3: Fetch and parse results with paginator
# ─────────────────────────────────────────────────────────────────
def fetch_results(athena_client, qeid):
    paginator = athena_client.get_paginator("get_query_results")
    pages = paginator.paginate(QueryExecutionId=qeid)

    rows = []
    headers = None

    for page in pages:
        result_set = page["ResultSet"]
        if headers is None:
            # First row in first page is the header row
            headers = [
                col["VarCharValue"]
                for col in result_set["Rows"][0]["Data"]
            ]
            data_rows = result_set["Rows"][1:]  # skip header
        else:
            data_rows = result_set["Rows"]

        for row in data_rows:
            values = [
                cell.get("VarCharValue", None)
                for cell in row["Data"]
            ]
            rows.append(dict(zip(headers, values)))

    return rows

results = fetch_results(athena, query_execution_id)

# Step 4: Use results — print or convert to DataFrame
for row in results:
    print(row)
# → {'region': 'South', 'total_revenue': '1234567.89', 'num_orders': '4521'}
# → {'region': 'North', 'total_revenue': '987654.32', 'num_orders': '3200'}

# Convert to Pandas DataFrame if needed
import pandas as pd
df = pd.DataFrame(results)
df["total_revenue"] = df["total_revenue"].astype(float)
print(df)

☁️ Note on Results

All Athena result values come back as strings (VarCharValue). You must cast numeric columns yourself after fetching. This is by design — Athena's result API is type-agnostic. Always cast after parsing.

Stopping a Running Query

python — cancel a running Athena query

import boto3

athena = boto3.client("athena")

# Cancel a running query (e.g. if it's scanning too much data)
athena.stop_query_execution(
    QueryExecutionId="abc12345-1234-1234-1234-abc123456789"
)
print("Query cancellation requested.")

📋

Athena Quick Reference — Cheat Sheet SUMMARY ▼

Essential boto3 Athena API Calls

Operation	boto3 Call	Key Parameters
Run a query	`start_query_execution()`	`QueryString`, `OutputLocation`, `WorkGroup`
Check query state	`get_query_execution()`	`QueryExecutionId` → `Status.State`
Fetch results	`get_query_results()` + paginator	`QueryExecutionId`
Cancel a query	`stop_query_execution()`	`QueryExecutionId`
List past queries	`list_query_executions()` + paginator	`WorkGroup`
Save a query	`create_named_query()`	`Name`, `QueryString`, `WorkGroup`
Load a saved query	`get_named_query()`	`NamedQueryId`
Create workgroup	`create_work_group()`	`BytesScannedCutoffPerQuery`

Athena State Machine

ATHENA QUERY STATE MACHINE start_query_execution() │ ▼ QUEUED (waiting for compute) │ ▼ RUNNING (actively scanning S3) │ ┌─────┴──────┐ ▼ ▼ SUCCEEDED FAILED / CANCELLED (get results) (check StateChangeReason)

Top 5 Athena Best Practices for Data Engineers

Always filter on partition columns in WHERE clauses — this is the #1 cost and speed lever.
Use Parquet or ORC format, never CSV or JSON for production tables.
Keep files 128 MB – 1 GB — compact small files with Spark or Delta OPTIMIZE before Athena queries them.
Use Workgroups with scan byte limits — a single runaway query can cost hundreds of dollars.
Enable Partition Projection for large time-series tables — eliminates MSCK REPAIR and makes partition discovery instant.

29.9

AWS Lake Formation — Fine-Grained Data Governance

Lake Formation is AWS's centralised data lake governance service. Before Lake Formation, controlling who could access which tables, columns, or rows in your data lake required a patchwork of S3 bucket policies and IAM policies that quickly became unmanageable. Lake Formation gives you a single place to grant and revoke table-level, column-level, and row-level permissions across your entire Glue Catalog — enforced for Athena, Glue, EMR, and Redshift Spectrum automatically.

🏞️

Why Lake Formation — The Problem It Solves CORE CONCEPT ▼

The Problem Before Lake Formation

Without Lake Formation, controlling data lake access looks like this: you give a team an IAM role that grants S3 read access to a specific prefix. But S3 permissions are bucket/prefix-level — you can't say "this team can read the customers table but only the name and region columns, not email or phone". For column-level or row-level access, you need custom views, masking logic in every query tool, or separate physical copies of data per team — all unmaintainable at scale.

🏢 Analogy

S3 IAM policies are like giving someone a keycard to an entire floor. Lake Formation is like giving them a keycard to a specific office, and within that office, only certain filing cabinets, and within those cabinets, only certain folders — and the system enforces this automatically no matter which door they use (Athena, Glue, EMR).

What Lake Formation Controls

🗄️

Database Level

Grant CREATE TABLE, ALTER, DROP on entire Glue databases. Controls who can define new tables.

📋

Table Level

Grant SELECT, INSERT, DELETE on specific tables. The most common permission type.

📊

Column Level

Grant SELECT on specific columns only. Used to hide PII columns from teams that don't need them.

🔍

Row Level (Filters)

Restrict which rows a principal sees based on a filter expression — e.g. only rows WHERE country = 'IN'.

Lake Formation vs IAM — When to Use Which

Scenario	Use IAM	Use Lake Formation
Control S3 bucket access broadly	✅ Yes	Not designed for this
Table-level access in Glue Catalog	Possible but complex	✅ Preferred
Column-level access control	Not possible	✅ Yes — column grants
Row-level filtering	Not possible	✅ Yes — row filters
Cross-account data sharing	Complex RAM setup	✅ Built-in cross-account
Tag-based access policies	IAM Attribute-based AC	✅ LF-Tags (simpler)

⚠️ Important

Lake Formation works on top of IAM — it doesn't replace it. The effective permission is the intersection: both IAM and Lake Formation must allow the action. Lake Formation adds finer-grained control on top of IAM, it doesn't bypass it.

📁

Registering S3 Locations — Handing Control to Lake Formation SETUP ▼

What Registering an S3 Location Means

Before Lake Formation can govern data in an S3 bucket, you must register that S3 path with Lake Formation. Registration transfers ownership of that path's access control from S3/IAM to Lake Formation. After registration, services like Athena and Glue get data access through a Lake Formation service-linked role — not through the user's own IAM role directly touching S3. This is how Lake Formation enforces its column and row filters: it intercepts the data access at the service level.

python — register an S3 location with Lake Formation

import boto3

lf = boto3.client("lakeformation", region_name="us-east-1")

# Register the S3 path — Lake Formation now controls access here
lf.register_resource(
    ResourceArn="arn:aws:s3:::my-data-lake",
    UseServiceLinkedRole=True  # Lake Formation uses its own IAM role to access S3
)
print("S3 location registered with Lake Formation.")

# List all registered locations
response = lf.list_resources()
for r in response["ResourceInfoList"]:
    print(r["ResourceArn"], r["RoleArn"])

🔐

Permissions Model — Table, Column, Row Level MOST IMPORTANT ▼

Granting Table-Level Permissions

The most common Lake Formation operation: grant a role the ability to SELECT from a specific table in the Glue Catalog. Once granted, that role can query the table via Athena or read it in a Glue job — without needing direct S3 IAM permissions.

python — grant SELECT on a table to an IAM role

import boto3

lf = boto3.client("lakeformation")

# Grant SELECT on the transactions table to the Data Analyst role
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "Table": {
            "CatalogId":  "123456789012",
            "DatabaseName": "sales_db",
            "Name":         "transactions"
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[]   # cannot re-grant to others
)
print("Table permission granted.")

# Grant SELECT on ALL tables in a database
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataEngineerRole"
    },
    Resource={
        "Database": {
            "CatalogId":    "123456789012",
            "Name": "sales_db"
        }
    },
    Permissions=["ALL"]   # full access to the database
)

# Revoke a permission
lf.revoke_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "Table": {
            "CatalogId": "123456789012",
            "DatabaseName": "sales_db",
            "Name": "transactions"
        }
    },
    Permissions=["SELECT"]
)

Column-Level Permissions — Hide PII Columns

Column-level security is one of Lake Formation's most valuable features for PII protection. You can grant SELECT on specific columns only — the principal sees the table in Athena but querying excluded columns returns an access denied error.

python — grant SELECT on specific columns only (hide PII)

import boto3

lf = boto3.client("lakeformation")

# customers table has: customer_id, name, email, phone, country, segment
# Analyst role should NOT see email or phone (PII)

lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "TableWithColumns": {
            "CatalogId":    "123456789012",
            "DatabaseName": "sales_db",
            "Name":         "customers",
            "ColumnNames": [
                "customer_id",
                "name",
                "country",
                "segment"
                # email and phone are NOT listed → access denied
            ]
        }
    },
    Permissions=["SELECT"]
)
print("Column-level permission granted (email and phone excluded).")

# Alternatively, use ColumnWildcard with Excluded columns
# (grant all columns EXCEPT the listed ones)
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "TableWithColumns": {
            "CatalogId":    "123456789012",
            "DatabaseName": "sales_db",
            "Name":         "customers",
            "ColumnWildcard": {
                "ExcludedColumnNames": ["email", "phone"]
            }
        }
    },
    Permissions=["SELECT"]
)

Row-Level Security — Data Filters

Row-level security lets you restrict which rows a principal can see. You create a Data Filter with a filter expression (a SQL WHERE clause fragment), then attach it to a table grant. When the principal queries the table, Lake Formation automatically applies the filter — they only see rows that match the expression.

python — create a row filter and apply it to a grant

import boto3

lf = boto3.client("lakeformation")

# Step 1: Create a row filter (a named WHERE clause)
lf.create_data_cells_filter(
    TableData={
        "TableCatalogId": "123456789012",
        "DatabaseName":   "sales_db",
        "TableName":      "transactions",
        "Name":           "india_only_filter",  # filter name
        "RowFilter": {
            "FilterExpression": "country = 'IN'"  # SQL WHERE clause
        },
        # Optionally restrict columns too in the same filter
        "ColumnWildcard": {}  # all columns allowed in this filter
    }
)
print("Row filter created.")

# Step 2: Grant the filter to an IAM role
# The India Analytics team can only see rows where country = 'IN'
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/IndiaAnalyticsRole"
    },
    Resource={
        "DataCellsFilter": {
            "TableCatalogId": "123456789012",
            "DatabaseName":   "sales_db",
            "TableName":      "transactions",
            "Name":           "india_only_filter"
        }
    },
    Permissions=["SELECT"]
)
print("Row-level filter grant applied.")
# Now when IndiaAnalyticsRole queries transactions in Athena,
# they ONLY see rows where country = 'IN' — automatically enforced.

✅ Real-World Use Case

A global retail company has transaction data for 50 countries. Each regional team (India, USA, UK) should only see their own country's data. Instead of creating 50 separate physical tables, you create one table + 50 row filters, one per country. Each regional role gets its country's filter applied — same table, isolated views, zero data duplication.

🏷️

LF-Tags — Attribute-Based Access Control at Scale ADVANCED ▼

Why LF-Tags Exist — The Scale Problem

If you have 500 tables in your lake and need to grant 20 teams access to different subsets, managing individual table grants becomes unmanageable — that's potentially 10,000 grant statements to maintain. LF-Tags (Lake Formation Tag-Based Access Control) solve this with attribute-based access. You tag tables and columns with key-value labels, then grant access to a tag expression rather than specific resources. When you add a new table with the right tags, it's automatically included in existing grants.

🏷️ Analogy

Instead of giving your employee a key for each of 500 file cabinets, you tag each cabinet as department=finance or sensitivity=public and give the employee a badge that works on all cabinets with department=finance AND sensitivity=public. When you add a new cabinet, just tag it — the employee's badge automatically works on it.

Creating and Assigning LF-Tags

python — create LF-Tags and assign them to tables

import boto3

lf = boto3.client("lakeformation")

# ── Step 1: Create LF-Tag keys and their allowed values ─────────
lf.create_lf_tag(
    TagKey="sensitivity",
    TagValues=["public", "internal", "confidential", "restricted"]
)

lf.create_lf_tag(
    TagKey="domain",
    TagValues=["sales", "finance", "marketing", "hr"]
)
print("LF-Tags created.")

# ── Step 2: Assign tags to a database ───────────────────────────
lf.add_lf_tags_to_resource(
    Resource={
        "Database": {
            "CatalogId": "123456789012",
            "Name": "sales_db"
        }
    },
    LFTags=[
        {"TagKey": "domain",      "TagValues": ["sales"]},
        {"TagKey": "sensitivity", "TagValues": ["internal"]}
    ]
)

# ── Step 3: Assign a more restrictive tag to a specific table ────
lf.add_lf_tags_to_resource(
    Resource={
        "Table": {
            "CatalogId":    "123456789012",
            "DatabaseName": "sales_db",
            "Name":         "customers"  # has PII
        }
    },
    LFTags=[
        {"TagKey": "sensitivity", "TagValues": ["confidential"]}
    ]
)

# ── Step 4: Grant access via tag expression ──────────────────────
# Sales team can access all tables tagged domain=sales AND sensitivity=internal
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/SalesAnalyticsRole"
    },
    Resource={
        "LFTagPolicy": {
            "CatalogId":    "123456789012",
            "ResourceType": "TABLE",
            "Expression": [
                {"TagKey": "domain",      "TagValues": ["sales"]},
                {"TagKey": "sensitivity", "TagValues": ["public", "internal"]}
            ]
        }
    },
    Permissions=["SELECT"]
)
print("Tag-based grant applied.")
# SalesAnalyticsRole can now query any table tagged
# domain=sales AND (sensitivity=public OR sensitivity=internal)
# The 'customers' table (tagged confidential) is automatically excluded.
# Any NEW table added to sales_db with these tags is automatically accessible.

🌐

Cross-Account Data Sharing ENTERPRISE PATTERN ▼

Sharing a Table Across AWS Accounts

Lake Formation makes cross-account data sharing straightforward. Account A (the data producer) grants Lake Formation permissions to Account B's IAM principal. Account B then creates a resource link in their own Glue Catalog that points to Account A's table. Teams in Account B can query Account A's data via Athena as if it were a local table — without any S3 cross-account policy complexity.

CROSS-ACCOUNT DATA SHARING PATTERN ACCOUNT A (Data Producer — 111111111111) ├── S3 bucket: s3://account-a-lake/ ├── Glue Catalog: sales_db.transactions (registered with Lake Formation) └── Lake Formation grant: GRANT SELECT ON sales_db.transactions TO arn:aws:iam::222222222222:role/ConsumerRole │ Lake Formation cross-account grant ▼ ACCOUNT B (Data Consumer — 222222222222) ├── Accepts the grant (RAM resource share) ├── Creates a resource link in their own Glue Catalog: │ sales_db_shared.transactions → Account A: sales_db.transactions └── Queries via Athena: SELECT * FROM sales_db_shared.transactions (data physically stays in Account A's S3) (Lake Formation enforces column/row permissions)

python — Account A: grant cross-account Lake Formation permission

import boto3

# Run in ACCOUNT A (data producer)
lf_producer = boto3.client("lakeformation", region_name="us-east-1")

# Grant SELECT on a specific table to Account B's consumer role
lf_producer.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::222222222222:role/ConsumerRole"
    },
    Resource={
        "Table": {
            "CatalogId":    "111111111111",
            "DatabaseName": "sales_db",
            "Name":         "transactions"
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=["SELECT"]  # allow B to re-share
)
print("Cross-account grant sent to Account B.")

# ────────────────────────────────────────────────────────────────
# In ACCOUNT B: create a resource link to access the shared table
# ────────────────────────────────────────────────────────────────
glue_consumer = boto3.client("glue", region_name="us-east-1")

# Create a local resource link pointing to Account A's table
glue_consumer.create_table(
    DatabaseName="shared_data",
    TableInput={
        "Name": "transactions_from_a",  # local alias
        "TargetTable": {
            "CatalogId":    "111111111111",  # Account A's ID
            "DatabaseName": "sales_db",
            "Name":         "transactions"
        }
    }
)
print("Resource link created. Account B can now query Account A's table.")

🔗

Integration with Glue Catalog & Query Engines HOW IT WORKS ▼

How Lake Formation Integrates with Services

Lake Formation permissions are enforced automatically for all services that use the Glue Data Catalog. You don't need to configure Athena, Glue, or EMR separately — once Lake Formation is set up, all three respect its grants:

Service	How Lake Formation Enforces Permissions
Amazon Athena	Before executing a query, Athena checks Lake Formation permissions. Columns/rows excluded by grants are invisible in query results.
AWS Glue Jobs	Glue ETL jobs running with a role that has Lake Formation grants can only read the allowed columns/rows from the data source.
Amazon Redshift Spectrum	Spectrum queries on Glue Catalog tables respect Lake Formation column and row permissions.
EMR (Ranger integration)	EMR with Apache Ranger plugin integrates with Lake Formation for Spark SQL permission enforcement.

Setting Up a Data Lake Admin

python — set a Lake Formation data lake administrator

import boto3

lf = boto3.client("lakeformation")

# Designate a role as a Data Lake Administrator
# (admins can grant any permission to anyone)
lf.put_data_lake_settings(
    DataLakeSettings={
        "DataLakeAdmins": [
            {"DataLakePrincipalIdentifier":
                "arn:aws:iam::123456789012:role/DataLakeAdminRole"},
            {"DataLakePrincipalIdentifier":
                "arn:aws:iam::123456789012:user/de-lead"}
        ],
        "CreateDatabaseDefaultPermissions": [],  # no implicit grants
        "CreateTableDefaultPermissions":    [],  # no implicit grants
        "TrustedResourceOwners": []
    }
)
print("Data lake admins configured.")

# List all permissions currently in effect
paginator = lf.get_paginator("list_permissions")
for page in paginator.paginate():
    for perm in page["PrincipalResourcePermissions"]:
        principal = perm["Principal"]["DataLakePrincipalIdentifier"]
        resource  = perm["Resource"]
        perms     = perm["Permissions"]
        print(f"  {principal} → {perms} on {resource}")

📋

Lake Formation Quick Reference SUMMARY ▼

Permission Hierarchy at a Glance

LAKE FORMATION PERMISSION HIERARCHY MOST BROAD MOST SPECIFIC ────────────────────────────────────────────────────────────────── Database Level Table Level Column/Row Level ──────────────── ───────────────────── ────────────────────── CREATE_DATABASE SELECT on table ColumnNames: [col1] ALTER INSERT on table ColumnWildcard + Exclude DROP DELETE on table DataCellsFilter (row) DESCRIBE DESCRIBE on table (combines both) CREATE_TABLE ────────────────────────────────────────────────────────────────── Principle: Grant the MINIMUM permission needed (least privilege) LF-TAGS (attribute-based access) sit ACROSS all levels: Tag a database → all tables inherit the tag (unless overridden) Grant via tag expression → auto-includes future tagged resources

Key boto3 Calls

Operation	boto3 Call
Register S3 location	`lf.register_resource()`
Grant table/column/row permission	`lf.grant_permissions()`
Revoke permission	`lf.revoke_permissions()`
List all permissions	`lf.list_permissions()` + paginator
Create LF-Tag	`lf.create_lf_tag()`
Assign tag to resource	`lf.add_lf_tags_to_resource()`
Grant via tag expression	`lf.grant_permissions()` with `LFTagPolicy` resource
Create row filter	`lf.create_data_cells_filter()`
Set data lake admins	`lf.put_data_lake_settings()`
Cross-account share	`lf.grant_permissions()` with cross-account principal ARN

29.10 — AMAZON REDSHIFT

Amazon Redshift

Redshift is AWS's fully managed, petabyte-scale cloud data warehouse. As a data engineer, you'll use it as the serving layer for analytics — loading transformed data from S3 via COPY, exporting results via UNLOAD, optimising query performance with distribution and sort keys, and federating queries to S3 with Redshift Spectrum. Every topic below is something you'll encounter in a real production warehouse.

🏗️

Redshift Architecture — Leader Node, Compute Nodes, RA3 FOUNDATION ▼

Leader Node

The leader node is the entry point for every query. It receives SQL from your client (BI tool, psql, JDBC), builds an execution plan, divides the work across compute nodes, aggregates partial results, and returns the final answer. You never directly interact with compute nodes — all connections go through the leader node on port 5439. The leader node does not store any actual table data — it only stores query plans and cluster metadata.

🏭 Analogy

The leader node is like a factory manager. It receives the order (SQL query), breaks it into tasks, assigns those tasks to workers (compute nodes), collects all partial results, and delivers the final product to the customer.

Compute Nodes

Compute nodes store slices of data and execute the actual query tasks. Each compute node is divided into node slices — on a ds2.xlarge node there are 2 slices; on a dc2.8xlarge there are 16. Data is distributed across slices, and each slice processes its portion of a query in parallel. More nodes = more parallelism = faster queries on large data.

⚡

dc2 Nodes (Dense Compute)

SSD-backed. Best for high-performance queries on datasets under ~1 TB per node. Fixed storage — you pay for compute AND storage together.

📦

ds2 Nodes (Dense Storage)

HDD-backed. Legacy option for very large datasets (multi-petabyte). Largely superseded by RA3.

🌟

RA3 Nodes (Managed Storage)

SSD cache + S3 backend. Compute and storage scale independently. The recommended modern choice.

RA3 Nodes — Managed Storage (The Modern Standard)

RA3 nodes decouple compute and storage. The local SSD is a high-speed cache — Redshift automatically keeps hot data on local SSD and cold data in S3-backed managed storage. You can scale compute nodes up or down without migrating data, and you pay for compute and storage separately. This makes RA3 the recommended node type for all new Redshift clusters.

📌 Key Rule

For any new Redshift cluster, always choose RA3. It gives you: (1) independent compute/storage scaling, (2) Redshift Serverless compatibility, (3) no data migration when resizing. dc2 is only for legacy workloads or very small clusters where SSD-local performance matters more than flexibility.

REDSHIFT CLUSTER ARCHITECTURE Client (BI Tool / Spark / psql) │ port 5439 (PostgreSQL protocol) ▼ ┌─────────────────────────────────┐ │ LEADER NODE │ │ - SQL parsing │ │ - Query planning │ │ - Task distribution │ │ - Result aggregation │ └──────────────┬──────────────────┘ │ internal network ┌────────┴────────┐ ▼ ▼ ┌──────────┐ ┌──────────┐ │ Compute │ │ Compute │ ...N nodes │ Node 1 │ │ Node 2 │ │ Slice 0 │ │ Slice 0 │ │ Slice 1 │ │ Slice 1 │ │ [Data] │ │ [Data] │ └──────────┘ └──────────┘ │ │ └────────┬────────┘ ▼ RA3: S3-backed Managed Storage (cold data automatically offloaded)

📥

COPY Command — Loading Data from S3 (Most Important) MOST USED ▼

Why COPY — Not INSERT

The COPY command is the fastest and most efficient way to load data into Redshift. It loads in parallel — each compute node slice reads directly from S3 independently. A single COPY loading 1 TB from 200 Parquet files is orders of magnitude faster than equivalent INSERTs. Never use INSERT for bulk loads — always COPY.

🚚 Analogy

INSERT is like delivering 1 million packages one by one to a warehouse. COPY is like having 16 trucks deliver to 16 different warehouse docks simultaneously. Same cargo, 16× the throughput.

COPY from S3 — Parquet (Preferred Format)

Parquet is the recommended format for loading into Redshift — columnar, compressed, and schema-embedded. With Parquet you don't need to specify delimiter or column types — Redshift reads them from the Parquet metadata automatically.

sql — COPY from S3 Parquet (recommended)

-- Load all Parquet files under a prefix in parallel
COPY analytics.orders
FROM 's3://my-data-lake/silver/orders/year=2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

-- Load a specific manifest file (list of exact S3 paths)
COPY analytics.orders
FROM 's3://my-data-lake/manifests/orders_2024_load.manifest'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET
MANIFEST;

📌 Parallelism Tip

For maximum COPY throughput, ensure your S3 prefix has at least as many files as total node slices (e.g. 16 slices → 16+ files). If you have 1 large file, only 1 slice works — parallelism is wasted. Use Spark to write partitioned output before loading into Redshift.

COPY from S3 — CSV

CSV COPY requires specifying the delimiter, quote character, and whether to ignore a header row. Use IGNOREHEADER 1 when the CSV has a column header on row 1. Always use GZIP-compressed CSV to reduce S3 transfer time.

sql — COPY from S3 CSV

COPY analytics.customers
FROM 's3://my-data-lake/raw/customers/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
DELIMITER ','
QUOTE '"'
IGNOREHEADER 1
GZIP
TIMEFORMAT 'auto'
DATEFORMAT  'auto'
TRUNCATECOLUMNS       -- silently truncate strings that exceed column length
MAXERROR 10;          -- allow up to 10 bad rows before failing

-- Check what was rejected (stl_load_errors is Redshift's error log)
SELECT filename, err_reason, raw_line
FROM   stl_load_errors
ORDER BY starttime DESC
LIMIT  20;

Triggering COPY from Python (boto3 / psycopg2)

In a data pipeline you'll run COPY programmatically — either via psycopg2 (direct JDBC-equivalent for Python) or via the Redshift Data API (boto3, no driver required). The Data API is preferred in serverless environments like Lambda and Glue because it doesn't need a VPC connection.

python — run COPY via Redshift Data API (boto3 — no driver needed)

import boto3, time

redshift_data = boto3.client("redshift-data", region_name="us-east-1")

copy_sql = """
COPY analytics.orders
FROM 's3://my-data-lake/silver/orders/year=2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
"""

# Submit the COPY statement
resp = redshift_data.execute_statement(
    ClusterIdentifier="my-redshift-cluster",
    Database="analytics",
    DbUser="etl_user",
    Sql=copy_sql
)
stmt_id = resp["Id"]

# Poll until FINISHED or FAILED
while True:
    status = redshift_data.describe_statement(Id=stmt_id)["Status"]
    print(f"COPY status: {status}")
    if status in ("FINISHED", "FAILED", "ABORTED"):
        break
    time.sleep(5)

if status == "FINISHED":
    print("✅ COPY completed successfully")
else:
    err = redshift_data.describe_statement(Id=stmt_id).get("Error")
    raise RuntimeError(f"COPY failed: {err}")

Spark Redshift Connector — Write DataFrame directly to Redshift

The spark-redshift connector (open-source, also available as Databricks built-in) lets you write a PySpark DataFrame directly to Redshift. Under the hood it: (1) writes the DataFrame to S3 as Avro/Parquet, (2) runs a COPY command from that S3 path into Redshift. It uses S3 as a staging area, so you need an IAM role that grants both S3 write and Redshift COPY permissions.

python — PySpark write to Redshift via spark-redshift connector

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("RedshiftWriter") \
    .config("spark.jars.packages",
            "io.github.spark-redshift-community:spark-redshift_2.12:6.2.0-spark_3.5") \
    .getOrCreate()

df = spark.read.parquet("s3://my-data-lake/silver/orders/")

# Write DataFrame → S3 staging → Redshift COPY (all automatic)
df.write \
  .format("io.github.spark_redshift_community.spark.redshift") \
  .option("url",            "jdbc:redshift://my-cluster.abc.us-east-1.redshift.amazonaws.com:5439/analytics") \
  .option("dbtable",        "analytics.orders") \
  .option("tempdir",        "s3://my-temp-bucket/redshift-staging/") \
  .option("aws_iam_role",   "arn:aws:iam::123456789012:role/RedshiftS3ReadRole") \
  .option("user",           "etl_user") \
  .option("password",       db_password) \
  .mode("append") \           # or "overwrite" to truncate first
  .save()

print("DataFrame written to Redshift via COPY")

💡 Real Pipeline Example

Spark reads raw JSON from S3 → transforms to clean DataFrame → writes to s3://temp/staging/ as Parquet → connector issues COPY analytics.orders FROM 's3://temp/staging/' FORMAT AS PARQUET → data lands in Redshift. The staging S3 path is cleaned up automatically.

📤

UNLOAD — Exporting Query Results to S3 EXPORT ▼

What is UNLOAD and When to Use It

UNLOAD exports the results of a SELECT query to S3 — in parallel, one file per slice. Use it to: export data for downstream consumers (reporting, ML training), create data extracts for sharing, or move aggregated results to S3 for Athena queries. UNLOAD always writes parallel files, not a single file — use a manifest to track what was written.

🏭 Analogy

COPY is the loading dock (S3 → Redshift). UNLOAD is the shipping department (Redshift → S3). Same parallel approach — all slices write simultaneously.

UNLOAD to Parquet (Recommended)

sql — UNLOAD query results to S3 as Parquet

-- Export aggregated sales summary to S3 as compressed Parquet
UNLOAD ('
  SELECT  order_date,
          product_category,
          SUM(amount)    AS total_amount,
          COUNT(*)       AS order_count
  FROM    analytics.orders
  WHERE   order_date >= \'\'2024-01-01\'\'
  GROUP BY 1, 2
')
TO 's3://my-data-lake/exports/sales_summary/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET
ALLOWOVERWRITE    -- overwrite existing files
PARALLEL ON       -- write one file per slice (default, fast)
MANIFEST;         -- write a manifest listing all output files

-- Output: s3://my-data-lake/exports/sales_summary/0000_part_00.parquet
--         s3://my-data-lake/exports/sales_summary/0001_part_00.parquet
--         s3://my-data-lake/exports/sales_summary/manifest (manifest JSON)

UNLOAD to CSV + Single File

sql — UNLOAD to a single CSV file (PARALLEL OFF)

-- Export to a single CSV (slow on large tables — only use for small result sets)
UNLOAD ('SELECT * FROM analytics.dim_product')
TO 's3://my-data-lake/exports/dim_product/product_export_'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
DELIMITER ','
ADDQUOTES          -- wrap string values in quotes
HEADER             -- add column header row
GZIP               -- compress output
PARALLEL OFF       -- write exactly ONE file (slow — only for small tables)
ALLOWOVERWRITE;

-- Output: s3://my-data-lake/exports/dim_product/product_export_000.gz

⚠️ Warning

PARALLEL OFF forces all data through a single leader node thread — it is dramatically slower on large tables. Only use it when downstream systems require exactly one file (some legacy ETL tools). For everything else, use PARALLEL ON (the default).

⚖️

Distribution Styles — KEY, EVEN, ALL PERFORMANCE ▼

Why Distribution Style Matters

Redshift distributes rows of each table across node slices. The distribution style controls which rows go to which slice. Choosing the right distribution style eliminates or minimises data redistribution during joins — the most expensive operation in a distributed query. If two large tables are joined on a column and their rows are on the same slice, the join is local (fast). If rows are on different slices, Redshift must redistribute data over the network (slow).

EVEN — Default, Round-Robin Distribution

Rows are distributed round-robin across slices regardless of content. Every slice gets an equal number of rows — no skew. Use EVEN for tables that are not frequently joined on a consistent key, or when you don't know what the join pattern will be. It's the safe default but doesn't optimise any specific join.

sql — EVEN distribution

CREATE TABLE analytics.pipeline_audit (
    run_id        VARCHAR(64),
    pipeline_name VARCHAR(128),
    status        VARCHAR(20),
    row_count     BIGINT,
    run_date      DATE
)
DISTSTYLE EVEN;   -- round-robin; good for tables not joined on a key

KEY — Hash Distribution (Best for Large Table Joins)

Rows with the same distribution key value always go to the same slice. If two large tables are both distributed on the same join column (e.g. customer_id), their matching rows are co-located on the same slice — the join requires no network redistribution. This is the most impactful distribution choice for star schema fact-dimension joins.

sql — KEY distribution for fact + dimension co-location

-- Fact table distributed on customer_id
CREATE TABLE analytics.fact_orders (
    order_id    BIGINT,
    customer_id BIGINT,
    product_id  BIGINT,
    order_date  DATE,
    amount      DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)        -- distribute by customer_id
SORTKEY (order_date);        -- sort within each slice by date

-- Dimension table also distributed on customer_id
CREATE TABLE analytics.dim_customer (
    customer_id   BIGINT,
    customer_name VARCHAR(200),
    country       VARCHAR(50)
)
DISTSTYLE KEY
DISTKEY (customer_id);   -- same key → co-located with fact_orders

-- This join requires ZERO network redistribution → very fast
SELECT  c.country, SUM(o.amount)
FROM    analytics.fact_orders o
JOIN    analytics.dim_customer c USING (customer_id)
GROUP BY 1;

⚠️ Skew Risk

If your DISTKEY column has a highly skewed distribution (e.g. 80% of rows have customer_id = 1), one slice will be overloaded and queries will be slow despite co-location. Always check skew with SELECT slice, COUNT(*) FROM svv_table_info ... after loading.

ALL — Broadcast Small Dimension Tables

A full copy of the table is placed on every node. This means joining a large fact table to a small dimension table is always local — the dimension rows are already on every slice. Use ALL only for small, rarely-updated dimension tables (under a few million rows). The downside is that writes are 4–16× slower because every node must be updated.

sql — ALL distribution for small dimensions

-- Small lookup table — copy to ALL nodes for fast local joins
CREATE TABLE analytics.dim_country (
    country_code CHAR(2),
    country_name VARCHAR(100),
    region       VARCHAR(50)
)
DISTSTYLE ALL;   -- full copy on every node; tiny table, updated rarely

Style	Use When	Join Performance	Write Speed
EVEN	No dominant join key; staging tables	Medium (may redistribute)	Fast
KEY	Large fact tables joined on a consistent column	Best (co-located)	Fast
ALL	Small dimensions (<5M rows, rarely updated)	Best (local broadcast)	Slow

🗂️

Sort Keys — Compound vs Interleaved QUERY SPEED ▼

What Sort Keys Do

Redshift stores rows on disk in sort key order within each slice. When a query filters on the sort key column, Redshift uses zone maps — per-block min/max metadata — to skip entire 1 MB blocks that can't contain matching rows. This is called zone map pruning and it means a well-sorted table can answer a filtered query by reading 1% of the data instead of 100%.

📚 Analogy

Sort keys are like a book index. If you want records from January 2024 and the table is sorted by date, Redshift jumps directly to the January blocks and reads only those. Without sort keys it must read every page.

Compound Sort Key (Most Common)

A compound sort key creates a multi-column sort order — similar to an ORDER BY clause. The first column in the sort key is the primary sort, and subsequent columns only help when the query also filters on earlier columns. Best for tables that are consistently filtered on the same leading columns (e.g. always filter by order_date first, then optionally region).

sql — compound sort key (most common choice)

CREATE TABLE analytics.fact_orders (
    order_id    BIGINT,
    customer_id BIGINT,
    order_date  DATE,
    region      VARCHAR(50),
    amount      DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, region);
-- Queries filtering on order_date benefit most
-- Queries filtering on order_date AND region benefit even more
-- Queries filtering on ONLY region (not order_date) get no benefit

Interleaved Sort Key

An interleaved sort key gives equal weight to all sort key columns — a query filtering on any one of them benefits equally. The trade-off is slower VACUUM and COPY performance because the interleaved z-ordering index is expensive to maintain. Use interleaved only when queries have diverse filter combinations with no dominant leading column. In practice, compound is almost always preferred.

sql — interleaved sort key (rare — only for diverse filter patterns)

CREATE TABLE analytics.fact_events (
    event_id    BIGINT,
    user_id     BIGINT,
    event_date  DATE,
    event_type  VARCHAR(50)
)
INTERLEAVED SORTKEY (user_id, event_date, event_type);
-- Any single-column filter benefits equally
-- But VACUUM takes 3-5× longer — use sparingly

VACUUM and ANALYZE — Maintenance Commands

After large loads or deletes, rows may be unsorted (Redshift appends new rows at the end regardless of sort key). VACUUM re-sorts rows and reclaims space from deleted rows. ANALYZE updates table statistics used by the query planner. Run both after major data loads in production.

sql — VACUUM and ANALYZE after data load

-- Re-sort unsorted rows and reclaim space from deletes
VACUUM analytics.fact_orders;

-- VACUUM only the unsorted region (faster — doesn't re-sort already-sorted rows)
VACUUM SORT ONLY analytics.fact_orders;

-- VACUUM only reclaim space without re-sorting
VACUUM DELETE ONLY analytics.fact_orders;

-- Update query planner statistics (always run after VACUUM)
ANALYZE analytics.fact_orders;

-- Check unsorted percentage before deciding to VACUUM
SELECT "table", unsorted, stats_off, size
FROM   svv_table_info
WHERE  "table" = 'fact_orders';

📌 Production Tip

Redshift runs automatic VACUUM in the background when the cluster is idle. In practice you rarely need to run VACUUM manually unless you have a large batch delete or a critical query that needs re-sorted rows immediately.

🔭

Redshift Spectrum — Query S3 from Redshift S3 FEDERATION ▼

What is Redshift Spectrum

Redshift Spectrum allows Redshift to query data directly in S3 without loading it into Redshift storage. You define an external table that points to an S3 path (via Glue Catalog), then query it with regular SQL from Redshift. Spectrum spins up a fleet of Spectrum nodes that scan S3 in parallel, push filtering down to S3, and return results to Redshift for final aggregation. You pay per terabyte scanned.

🔭 Analogy

Redshift Spectrum is like having a Redshift telescope that can see beyond its own storage. Your data stays in the S3 data lake, but you can query it with the same SQL you use for internal Redshift tables — and even join S3 data with internal Redshift tables in the same query.

REDSHIFT SPECTRUM ARCHITECTURE Redshift SQL Client │ SELECT ... FROM external_schema.s3_table ▼ Leader Node → Query Planner │ ├── Internal Redshift tables → Compute Nodes (local) │ └── External S3 tables → Spectrum Nodes (separate fleet) │ ▼ S3 Data Lake (Parquet/ORC) Glue Catalog (schema metadata) │ Partition pruning at S3 level Column pruning at S3 level │ Results returned to Redshift for final join + aggregation

Setting Up Spectrum — External Schema + External Table

sql — create external schema pointing to Glue Catalog

-- Step 1: Create an external schema backed by Glue Catalog
CREATE EXTERNAL SCHEMA ext_silver
FROM DATA CATALOG
DATABASE 'silver_db'                   -- Glue Catalog database name
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Step 2: Now query Glue Catalog tables from Redshift directly
-- (tables defined in Glue appear automatically in ext_silver schema)
SELECT  COUNT(*), order_date
FROM    ext_silver.orders          -- this table lives in S3, not Redshift!
WHERE   year = '2024'             -- partition pruning — only reads 2024 data
GROUP BY order_date;

-- Step 3: Join S3 data with internal Redshift table in ONE query
SELECT  c.customer_name,
        SUM(o.amount) AS total_spend
FROM    ext_silver.orders     o    -- S3 via Spectrum
JOIN    analytics.dim_customer c    -- internal Redshift table
  ON  o.customer_id = c.customer_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT    20;

📌 When to Use Spectrum vs COPY

Use COPY for hot, frequently-queried data that needs sub-second latency. Use Spectrum for cold historical data (>90 days old) where you want to avoid the cost of storing it in Redshift but still need occasional SQL access. A common pattern is: keep the last 90 days in Redshift (fast), archive everything else to S3 and query via Spectrum (cheap).

🚦

WLM — Workload Management, Concurrency Scaling CONCURRENCY ▼

What is WLM

Workload Management (WLM) controls how Redshift allocates memory and query slots across different types of queries. Without WLM, a single long-running ETL query could consume all cluster memory and block all dashboard queries. WLM lets you create queues with dedicated memory percentages and concurrency limits so ETL and BI queries don't starve each other.

Automatic WLM (Recommended — Default)

Automatic WLM lets Redshift dynamically allocate memory to queries based on their size and priority. Redshift classifies queries as short, medium, or long and allocates more memory to complex queries automatically. This is the recommended mode for most clusters — you don't need to manually tune queue memory percentages.

sql — check WLM queue assignment and query priority

-- See current WLM configuration
SELECT * FROM svl_wlm_query_state;

-- Check queue wait times (is a queue a bottleneck?)
SELECT  service_class,
        COUNT(*) AS queued_queries,
        AVG(queue_start_time) AS avg_wait
FROM    stl_wlm_query
WHERE   queue_start_time > GETDATE() - INTERVAL '1 hour'
GROUP BY service_class;

-- Set query priority for a session (automatic WLM only)
SET query_group TO 'etl_low_priority';

Manual WLM Queues (For Strict SLA Control)

In manual WLM you define queues with explicit memory % and concurrency limits. A typical production setup has: a BI queue (high concurrency, low memory each — fast dashboard queries), an ETL queue (low concurrency, high memory each — large transformations), and a default queue for everything else.

MANUAL WLM QUEUE EXAMPLE Queue 1: "BI Queries" → 30% memory, 15 concurrent slots → Matched by: user group "bi_users", query group "bi" → Fast SELECT queries from Tableau/QuickSight Queue 2: "ETL Jobs" → 60% memory, 3 concurrent slots → Matched by: user group "etl_roles", query group "etl" → Large COPY, INSERT INTO SELECT, CREATE TABLE AS Queue 3: "Default" → 10% memory, 5 concurrent slots → All other queries that don't match above queues

Concurrency Scaling

Concurrency Scaling automatically adds transient Redshift clusters when the main cluster has queued queries. The additional capacity handles burst demand and is billed per second. It's transparent to users — queries just run faster during peak load. The first hour per day is free.

📌 Production Pattern

Enable Concurrency Scaling on your BI queue only. ETL jobs can wait in queue — they're batch workloads and a few extra minutes don't matter. Dashboard queries have human users waiting — those benefit most from burst capacity.

Materialized Views — Pre-computed Aggregations

A materialized view stores the result of a query physically on disk. Instead of re-computing a complex aggregation on every dashboard refresh, the MV pre-computes it and users query the pre-built result. Use materialized views for expensive recurring aggregations that don't need real-time freshness.

sql — materialized view for pre-computed daily sales summary

-- Create a materialized view of daily sales (expensive to compute live)
CREATE MATERIALIZED VIEW analytics.mv_daily_sales AS
SELECT  order_date,
        region,
        product_category,
        SUM(amount)    AS total_amount,
        COUNT(DISTINCT customer_id) AS unique_customers,
        COUNT(*)       AS order_count
FROM    analytics.fact_orders
GROUP BY 1, 2, 3;

-- Dashboard queries now hit the pre-built MV — instant results
SELECT * FROM analytics.mv_daily_sales
WHERE  order_date >= CURRENT_DATE - 30;

-- Refresh the MV after each ETL load (or on a schedule)
REFRESH MATERIALIZED VIEW analytics.mv_daily_sales;

-- Auto-refresh option (Redshift refreshes automatically when base table changes)
ALTER MATERIALIZED VIEW analytics.mv_daily_sales AUTO REFRESH YES;

🏭

Production Redshift Patterns for Data Engineers PRODUCTION ▼

Incremental Load Pattern — Truncate + COPY

The simplest and most reliable incremental load for daily partitioned data: delete the target partition's rows, then COPY the fresh partition from S3. Atomic — if COPY fails, the delete is rolled back (Redshift is ACID at statement level within a transaction).

sql — daily incremental load: delete partition + COPY

BEGIN;

-- Delete today's partition from target (idempotent — safe to re-run)
DELETE FROM analytics.fact_orders
WHERE  order_date = '2024-06-15';

-- Load fresh data from S3 (today's Parquet partition)
COPY analytics.fact_orders
FROM 's3://my-data-lake/silver/orders/year=2024/month=06/day=15/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

COMMIT;   -- atomic: both succeed or both rolled back

UPSERT Pattern via Staging Table

Redshift has no native UPSERT (MERGE) in older versions. The classic pattern is: COPY new data into a staging table → DELETE matching rows from the target → INSERT from staging. This is fully atomic within a transaction.

sql — upsert via staging table (classic Redshift pattern)

-- Step 1: COPY incoming data into a staging table
CREATE TEMP TABLE stg_orders (LIKE analytics.fact_orders);

COPY stg_orders
FROM 's3://my-data-lake/silver/orders/incremental/2024-06-15/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

BEGIN;

-- Step 2: Delete existing rows that match incoming keys
DELETE FROM analytics.fact_orders
USING  stg_orders
WHERE  analytics.fact_orders.order_id = stg_orders.order_id;

-- Step 3: Insert all rows from staging (includes new + updated)
INSERT INTO analytics.fact_orders
SELECT * FROM stg_orders;

COMMIT;

-- Staging table auto-dropped when session ends (TEMP table)

Full Pipeline: Spark → S3 → Redshift

Source DB / Kafka

→

Spark Transform

→

S3 Silver (Parquet)

→

Redshift COPY

→

BI / Reporting

python — full pipeline: Spark write + boto3 Redshift COPY

import boto3, time, json
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, current_date

spark = SparkSession.builder.appName("OrdersPipeline").getOrCreate()

# ── 1. Read from source and transform ──────────────────────────
df = spark.read.parquet("s3://my-lake/bronze/orders/date=2024-06-15/")
df_clean = df.filter(df.amount > 0) \
             .withColumn("load_date", current_date()) \
             .dropDuplicates(["order_id"])

# ── 2. Write to S3 Silver as Parquet ───────────────────────────
s3_path = "s3://my-lake/silver/orders/date=2024-06-15/"
df_clean.write.mode("overwrite").parquet(s3_path)
print(f"Wrote {df_clean.count()} rows to S3")

# ── 3. Load from S3 into Redshift via Data API ─────────────────
sm = boto3.client("secretsmanager")
creds = json.loads(sm.get_secret_value(SecretId="prod/redshift/etl")["SecretString"])

rdclient = boto3.client("redshift-data", region_name="us-east-1")

copy_sql = f"""
BEGIN;
DELETE FROM analytics.fact_orders WHERE order_date = '2024-06-15';
COPY analytics.fact_orders
FROM '{s3_path}'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
COMMIT;
"""

resp = rdclient.execute_statement(
    ClusterIdentifier="prod-redshift",
    Database="analytics",
    DbUser=creds["username"],
    Sql=copy_sql
)
stmt_id = resp["Id"]

# ── 4. Poll until complete ─────────────────────────────────────
for _ in range(60):
    status = rdclient.describe_statement(Id=stmt_id)["Status"]
    if status in ("FINISHED", "FAILED", "ABORTED"):
        break
    time.sleep(10)

print(f"Redshift COPY status: {status}")
if status != "FINISHED":
    raise RuntimeError("Redshift COPY failed — check stl_load_errors")

29.11 — AMAZON MSK

Amazon MSK
Managed Streaming for Apache Kafka

Amazon MSK is AWS's fully managed Apache Kafka service. As a data engineer you'll use MSK as the central event backbone — ingesting real-time data from producers, feeding Spark Structured Streaming consumers, and connecting to S3 via MSK Connect. Every topic below maps directly to what you configure and debug in production streaming pipelines.

📚

Kafka Fundamentals — Brokers, Topics, Partitions, Consumer Groups FOUNDATION ▼

Brokers

A Kafka broker is a single server in the Kafka cluster. It stores topic partitions on disk and serves read/write requests from producers and consumers. In MSK, AWS manages the broker fleet — you choose the number of brokers and the instance type. A typical production MSK cluster has 3 brokers (one per Availability Zone) for HA. Each broker hosts a subset of all topic partitions.

🏢 Analogy

Brokers are like post offices in different cities. Each post office holds certain mailboxes (partitions). Mail (messages) arrives at the right post office based on the mailbox address (partition key hash).

Topics and Partitions

A topic is a named category of messages — e.g. prod.orders, prod.clicks. Each topic is split into partitions — ordered, immutable sequences of messages. Partitions are the unit of parallelism: more partitions = more consumers reading in parallel = higher throughput. Each partition lives on exactly one broker (its leader) but is replicated to other brokers for fault tolerance.

📂

Partition = Ordered Log

Messages within a partition are strictly ordered by offset (0, 1, 2, …). Order is only guaranteed within a partition — not across partitions of the same topic.

⚡

Parallelism

In Spark Structured Streaming, each Kafka partition maps to one Spark task. 12 partitions = 12 parallel Spark tasks reading simultaneously.

🔑

Partition Key

Producer assigns a message key (e.g. customer_id). Kafka hashes the key to determine which partition it goes to — same key always goes to same partition = ordering per key.

Offsets and Retention

Every message in a partition has a unique, monotonically increasing offset (0, 1, 2, …). Consumers track their position by offset — they can replay from any offset, read only new messages, or start from the beginning. Kafka retains messages for a configurable period (default 7 days) or until a size limit — after that, old messages are deleted. Retention must be long enough to cover your recovery window.

KAFKA TOPIC STRUCTURE Topic: prod.orders (3 partitions, replication factor 3) Partition 0: [offset 0] [offset 1] [offset 2] [offset 3] → newest Leader: Broker 1 | Replicas: Broker 2, Broker 3 Partition 1: [offset 0] [offset 1] [offset 2] → newest Leader: Broker 2 | Replicas: Broker 1, Broker 3 Partition 2: [offset 0] [offset 1] → newest Leader: Broker 3 | Replicas: Broker 1, Broker 2 Consumer Group "spark-etl-cg": Task 0 → reads Partition 0 (last committed offset: 3) Task 1 → reads Partition 1 (last committed offset: 2) Task 2 → reads Partition 2 (last committed offset: 1)

Consumer Groups

A consumer group is a set of consumers that collectively read a topic. Kafka assigns each partition to exactly one consumer in the group — no two consumers in the same group read the same partition simultaneously. This enables parallel processing without duplicate reads. Different consumer groups can independently read the same topic — each group has its own offset tracking. Your Spark Structured Streaming job is one consumer group; a separate Flink job can be another.

📌 Rule of Thumb

The maximum useful parallelism for a consumer group equals the number of partitions. If you have 12 partitions and 15 consumers, 3 consumers will be idle. If you have 12 partitions and 6 consumers, each consumer handles 2 partitions. Match Kafka partition count to your desired Spark parallelism.

Replication Factor and ISR

The replication factor controls how many broker copies each partition has. Replication factor 3 means every partition has 1 leader + 2 followers. The In-Sync Replicas (ISR) set is the list of followers that are fully caught up with the leader. A message is only acknowledged to the producer when all ISR members have written it (when acks=all) — this guarantees no data loss even if a broker crashes immediately after acknowledgement.

python — create an MSK topic with boto3 (via Kafka admin client)

from kafka.admin import KafkaAdminClient, NewTopic

admin = KafkaAdminClient(
    bootstrap_servers="b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9092,"
                      "b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9092",
    client_id="topic-creator"
)

topic = NewTopic(
    name="prod.orders",
    num_partitions=12,         # 12 partitions = 12 parallel Spark tasks
    replication_factor=3,      # 1 leader + 2 followers per partition
    topic_configs={
        "retention.ms":    "604800000",  # 7 days in ms
        "compression.type":"lz4",        # broker-side compression
        "min.insync.replicas": "2"         # need 2 ISR to accept writes
    }
)

admin.create_topics([topic])
print("Topic prod.orders created with 12 partitions")

🏗️

MSK Architecture — Broker Sizing, Scaling, MSK Serverless AWS MSK ▼

MSK Provisioned — Broker Sizing

MSK Provisioned gives you dedicated Kafka brokers on EC2-backed instances. You choose the instance type and number of brokers. Key sizing considerations: throughput (MB/s in + out × replication factor) determines network requirements; storage (retention period × daily data volume) determines EBS volume size; partition count drives CPU and memory requirements for ZooKeeper / KRaft metadata.

Instance Type	Network	Use Case
`kafka.t3.small`	Up to 5 Gbps	Dev/test only — not for production
`kafka.m5.large`	Up to 10 Gbps	Low-throughput production (<50 MB/s)
`kafka.m5.4xlarge`	Up to 25 Gbps	Medium production (50–300 MB/s)
`kafka.m5.16xlarge`	Up to 100 Gbps	High-throughput production (>300 MB/s)

MSK Serverless

MSK Serverless automatically scales capacity based on traffic — you don't choose broker count or instance type. Pay per partition-hour and GB throughput. Ideal for variable or unpredictable workloads where you don't want to over-provision brokers. Limitation: does not support all Kafka configurations (e.g. log compaction is limited) and has higher per-unit cost at sustained high throughput than provisioned.

✅

Use MSK Serverless When

Traffic is spiky or unpredictable. You want zero cluster management. Dev/staging environments. Event-driven pipelines with variable load.

⚡

Use MSK Provisioned When

High sustained throughput (>100 MB/s). Need full Kafka configuration control. Predictable workload where right-sizing is cost-effective.

📋

Schema Registry — Avro, JSON, Protobuf + Compatibility Modes SCHEMA GOVERNANCE ▼

Why Schema Registry

Without a schema registry, every producer and consumer must agree on the message format out-of-band. When the producer adds a new field, every consumer breaks. A Schema Registry is a central repository for message schemas. Producers register a schema and embed only a small schema ID in each message. Consumers fetch the schema by ID and deserialise correctly — even across schema versions. This decouples producers from consumers and enables schema evolution without downtime.

📚 Analogy

The Schema Registry is like a shared dictionary. Producers write messages using the dictionary's definitions. Consumers look up definitions in the same dictionary. When producers update a definition (add a field), consumers can handle old and new messages gracefully — as long as the change is backward compatible.

Compatibility Modes

The Schema Registry enforces compatibility rules when a new schema version is registered. Choose the mode based on your deployment strategy:

Mode	Rule	Producer/Consumer Upgrade Order	Use Case
BACKWARD	New schema can read data written with old schema	Upgrade consumers first, then producers	Most common — add optional fields
FORWARD	Old schema can read data written with new schema	Upgrade producers first, then consumers	Remove fields consumers don't use
FULL	Both backward AND forward compatible	Either order	Strictest — only add optional fields, never remove
NONE	No compatibility check	Any order (risky)	Dev only — never production

Avro Producer and Consumer with AWS Glue Schema Registry

AWS provides a Glue Schema Registry that integrates with MSK. Producers register schemas in Glue and embed the schema version ID in each message. Consumers look up the schema from Glue to deserialise. Below is the full pattern using the AWS Glue Schema Registry serialiser.

python — Avro producer with AWS Glue Schema Registry

import boto3, json
from kafka import KafkaProducer
from aws_schema_registry import SchemaRegistryClient
from aws_schema_registry.avro import AvroSchema
from aws_schema_registry.serde import KafkaSerializer

# ── Define the Avro schema ─────────────────────────────────────
ORDER_SCHEMA = AvroSchema("""{
  "type": "record",
  "name": "Order",
  "namespace": "com.mycompany.events",
  "fields": [
    {"name": "order_id",    "type": "long"},
    {"name": "customer_id", "type": "long"},
    {"name": "amount",      "type": "double"},
    {"name": "order_date",  "type": "string"},
    {"name": "region",      "type": ["null", "string"], "default": null}
  ]
}""")

# ── Create Glue Schema Registry client ───────────────────────
glue_client = boto3.client("glue", region_name="us-east-1")
registry_client = SchemaRegistryClient(glue_client, registry_name="prod-registry")
serializer = KafkaSerializer(registry_client)

# ── Kafka producer with Avro serialiser ──────────────────────
producer = KafkaProducer(
    bootstrap_servers="b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9092",
    value_serializer=lambda v: serializer.serialize("prod.orders", v, ORDER_SCHEMA)
)

order_event = {
    "order_id":    1001,
    "customer_id": 42,
    "amount":      299.99,
    "order_date":  "2024-06-15",
    "region":      "us-east"
}

producer.send("prod.orders", key=str(order_event["order_id"]).encode(), value=order_event)
producer.flush()
print("Order event published with Avro schema")

🔒

MSK Security — IAM Auth, TLS, SASL/SCRAM SECURITY ▼

IAM Authentication (Recommended for AWS Services)

IAM authentication lets Kafka clients authenticate using their AWS IAM role or user credentials — no separate Kafka username/password needed. The Kafka client signs requests with AWS Signature V4. This is the recommended authentication method for Lambda, Glue, EMR, and EKS workloads because they already have IAM roles.

python — Kafka producer with MSK IAM authentication

from kafka import KafkaProducer
from aws_msk_iam_sasl_signer import MSKAuthTokenProvider

def oauth_cb(oauth_config):
    """Called by kafka-python to get a fresh IAM token."""
    auth_token, expiry_ms = MSKAuthTokenProvider.generate_auth_token("us-east-1")
    return auth_token, expiry_ms / 1000

producer = KafkaProducer(
    bootstrap_servers=[
        "b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9098",   # IAM port = 9098
        "b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9098"
    ],
    security_protocol="SASL_SSL",
    sasl_mechanism="OAUTHBEARER",
    sasl_oauth_token_provider=oauth_cb,
    value_serializer=lambda v: v.encode("utf-8")
)

producer.send("prod.orders", value='{"order_id": 1001}')
producer.flush()
print("Message sent with IAM auth")

TLS Encryption in Transit

MSK encrypts data in transit using TLS by default. Clients connect on port 9094 (TLS) or 9098 (IAM + TLS). Enable TLS-only mode on your MSK cluster to reject plaintext connections. Download the Amazon CA certificate for client trust store configuration.

python — Kafka consumer with TLS + SASL/SCRAM authentication

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    "prod.orders",
    bootstrap_servers=[
        "b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9096"  # SASL_SSL port
    ],
    security_protocol="SASL_SSL",
    sasl_mechanism="SCRAM-SHA-512",
    sasl_plain_username="kafka-user",
    sasl_plain_password="kafka-password",       # fetched from Secrets Manager
    ssl_cafile="/etc/kafka/amazon-root-ca.pem",   # AWS CA certificate
    group_id="spark-etl-cg",
    auto_offset_reset="earliest",
    value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)

for msg in consumer:
    print(f"Partition={msg.partition} Offset={msg.offset} Value={msg.value}")

🔑

Port 9092

Plaintext — no encryption, no auth. Never use in production.

🔐

Port 9094

TLS only — encrypted, mutual TLS optional. Good for internal VPC clients.

🛡️

Port 9096

SASL_SSL (SCRAM) — username/password + TLS. Good for non-AWS clients.

☁️

Port 9098

IAM auth + TLS — best for AWS services (Lambda, Glue, EMR). No separate credentials.

📊

Consumer Lag Monitoring — CloudWatch, Burrow, Kafka UI MONITORING ▼

What is Consumer Lag

Consumer lag is the difference between the latest message offset in a partition (the log end offset) and the last offset the consumer group has committed. Lag = 0 means the consumer is fully caught up. Lag = 10,000 means the consumer is 10,000 messages behind. Rising lag is the most important signal of a struggling consumer — your Spark job is processing slower than messages are arriving.

🏃 Analogy

Consumer lag is like the gap between a runner and the finish line that keeps moving. If the finish line moves faster than the runner, the gap grows — the consumer will never catch up without scaling. If the runner is faster, the gap shrinks to zero.

CloudWatch MSK Metrics for Consumer Lag

python — check consumer lag via CloudWatch MSK metrics

import boto3
from datetime import datetime, timedelta

cw = boto3.client("cloudwatch", region_name="us-east-1")

# MSK publishes SumOffsetLag metric per consumer group + topic
response = cw.get_metric_statistics(
    Namespace="AWS/Kafka",
    MetricName="SumOffsetLag",
    Dimensions=[
        {"Name": "Cluster Name",     "Value": "prod-msk-cluster"},
        {"Name": "Consumer Group",   "Value": "spark-etl-cg"},
        {"Name": "Topic",            "Value": "prod.orders"}
    ],
    StartTime=datetime.utcnow() - timedelta(minutes=30),
    EndTime=datetime.utcnow(),
    Period=60,
    Statistics=["Maximum"]
)

for dp in sorted(response["Datapoints"], key=lambda x: x["Timestamp"]):
    print(f"{dp['Timestamp'].strftime('%H:%M')}  lag={dp['Maximum']:,.0f}")

# ── Create a CloudWatch alarm if lag exceeds 100k messages ─────
cw.put_metric_alarm(
    AlarmName="MSK-spark-etl-lag-high",
    MetricName="SumOffsetLag",
    Namespace="AWS/Kafka",
    Dimensions=[
        {"Name": "Cluster Name",   "Value": "prod-msk-cluster"},
        {"Name": "Consumer Group", "Value": "spark-etl-cg"},
        {"Name": "Topic",          "Value": "prod.orders"}
    ],
    Statistic="Maximum",
    Period=300,
    EvaluationPeriods=2,
    Threshold=100000,
    ComparisonOperator="GreaterThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:123456789012:data-eng-alerts"],
    TreatMissingData="notBreaching"
)

⚡

Spark + MSK Integration — readStream, writeStream MOST IMPORTANT ▼

Reading from MSK with Spark Structured Streaming

Spark Structured Streaming reads from MSK (Kafka) using readStream with the kafka format. Each Kafka partition maps to one Spark task. The value column comes as raw bytes — you must cast/parse it. Checkpoint location stores the committed offsets so the job can resume from where it left off after a restart.

python — Spark Structured Streaming from MSK (IAM auth)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp
from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType

spark = SparkSession.builder \
    .appName("MSKOrdersConsumer") \
    .config("spark.sql.shuffle.partitions", "12") \
    .getOrCreate()

# ── Schema for the JSON payload inside Kafka value ─────────────
order_schema = StructType([
    StructField("order_id",    LongType(),   nullable=False),
    StructField("customer_id", LongType(),   nullable=False),
    StructField("amount",      DoubleType(), nullable=True),
    StructField("order_date",  StringType(), nullable=True),
    StructField("region",      StringType(), nullable=True)
])

MSK_BROKERS = ("b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9098,"
               "b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9098")

# ── Read stream from MSK ───────────────────────────────────────
raw_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers",           MSK_BROKERS) \
    .option("kafka.security.protocol",            "SASL_SSL") \
    .option("kafka.sasl.mechanism",               "AWS_MSK_IAM") \
    .option("kafka.sasl.jaas.config",
            "software.amazon.msk.auth.iam.IAMLoginModule required;") \
    .option("kafka.sasl.client.callback.handler.class",
            "software.amazon.msk.auth.iam.IAMClientCallbackHandler") \
    .option("subscribe",             "prod.orders") \
    .option("startingOffsets",       "latest") \
    .option("maxOffsetsPerTrigger",  50000) \    # backpressure control
    .load()

# ── Parse JSON value column ────────────────────────────────────
orders = raw_stream \
    .select(
        col("key").cast("string").alias("msg_key"),
        from_json(col("value").cast("string"), order_schema).alias("data"),
        col("partition"),
        col("offset"),
        col("timestamp").alias("kafka_timestamp")
    ) \
    .select("data.*", "partition", "offset", "kafka_timestamp") \
    .withColumn("processed_at", current_timestamp())

# ── Write to Delta Lake (S3) ───────────────────────────────────
query = orders.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://my-lake/checkpoints/orders-msk/") \
    .option("path",              "s3://my-lake/bronze/orders/") \
    .trigger(processingTime="60 seconds") \
    .start()

query.awaitTermination()

Writing Back to MSK from Spark

Use writeStream with kafka format to publish enriched or transformed events back to an MSK topic. The DataFrame must have a value column (and optionally key and topic columns). Convert complex types to JSON string with to_json().

python — write enriched stream back to MSK topic

from pyspark.sql.functions import to_json, struct

# Enrich the orders stream — join with static product dimension
dim_product = spark.read.parquet("s3://my-lake/silver/dim_product/")
enriched = orders.join(dim_product, "product_id", "left")

# Prepare Kafka output: must have 'value' column as bytes/string
kafka_output = enriched.select(
    col("order_id").cast("string").alias("key"),
    to_json(struct("*")).alias("value")
)

# Write enriched events to a new MSK topic
kafka_output.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", MSK_BROKERS) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism",    "AWS_MSK_IAM") \
    .option("kafka.sasl.jaas.config",
            "software.amazon.msk.auth.iam.IAMLoginModule required;") \
    .option("kafka.sasl.client.callback.handler.class",
            "software.amazon.msk.auth.iam.IAMClientCallbackHandler") \
    .option("topic",             "prod.orders-enriched") \
    .option("checkpointLocation", "s3://my-lake/checkpoints/orders-enriched/") \
    .outputMode("append") \
    .trigger(processingTime="60 seconds") \
    .start() \
    .awaitTermination()

🔌

MSK Connect — S3 Sink Connector, JDBC Source Connector MANAGED CONNECTORS ▼

What is MSK Connect

MSK Connect is a fully managed service that runs Kafka Connect connectors without you managing workers. You define a connector configuration, upload the connector plugin, and MSK Connect runs it for you — auto-scaling, monitoring, and patching included. The two most important connectors for data engineers are: S3 Sink (Kafka → S3) and JDBC Source (database → Kafka).

S3 Sink Connector — Kafka → S3 Automatically

The Confluent S3 Sink Connector reads messages from Kafka topics and writes them to S3 as Parquet, Avro, or JSON — automatically, without writing any code. Use it to archive all Kafka events to your data lake for batch processing.

json — S3 Sink Connector configuration

{
  "connector.class":                  "io.confluent.connect.s3.S3SinkConnector",
  "tasks.max":                        "6",
  "topics":                           "prod.orders,prod.clicks",
  "s3.region":                        "us-east-1",
  "s3.bucket.name":                   "my-data-lake",
  "s3.part.size":                     "67108864",
  "topics.dir":                       "bronze",
  "flush.size":                       "100000",
  "rotate.interval.ms":               "300000",
  "storage.class":                    "io.confluent.connect.s3.storage.S3Storage",
  "format.class":                     "io.confluent.connect.s3.format.parquet.ParquetFormat",
  "parquet.codec":                    "snappy",
  "partitioner.class":               "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
  "path.format":                      "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
  "timestamp.extractor":             "RecordField",
  "timestamp.field":                 "order_date",
  "schema.compatibility":            "FULL"
}
// Writes: s3://my-data-lake/bronze/prod.orders/year=2024/month=06/day=15/hour=10/
//         part-000001.snappy.parquet
// Every 100k messages OR every 5 minutes — whichever comes first

JDBC Source Connector — RDS → Kafka (CDC)

The JDBC Source Connector polls a relational database table on a schedule and publishes new/changed rows to Kafka. Use it for lightweight CDC when full log-based CDC (Debezium) is overkill. Configure mode=timestamp+incrementing to capture rows updated since the last poll using an updated_at column and auto-increment primary key.

json — JDBC Source Connector configuration (RDS PostgreSQL → MSK)

{
  "connector.class":              "io.confluent.connect.jdbc.JdbcSourceConnector",
  "tasks.max":                    "3",
  "connection.url":               "jdbc:postgresql://my-rds.abc.us-east-1.rds.amazonaws.com:5432/mydb",
  "connection.user":              "kafka_reader",
  "connection.password":          "${file:/opt/kafka/secrets.properties:db.password}",
  "mode":                         "timestamp+incrementing",
  "timestamp.column.name":        "updated_at",
  "incrementing.column.name":     "order_id",
  "table.whitelist":              "public.orders,public.customers",
  "topic.prefix":                 "jdbc.rds.",
  "poll.interval.ms":             "60000",
  "batch.max.rows":               "5000",
  "numeric.mapping":              "best_fit"
}
// Publishes to topics: jdbc.rds.orders, jdbc.rds.customers
// Every 60 seconds polls for rows where updated_at > last_poll_time

📌 JDBC vs Debezium for CDC

JDBC Source Connector polls on a schedule (minute-level freshness, no deletes detected). Debezium reads the database transaction log (second-level freshness, captures deletes as tombstones). Use JDBC for simple cases; use Debezium when you need sub-minute latency or must capture deletes.

29.12 — AMAZON DYNAMODB

Amazon DynamoDB

DynamoDB is AWS's fully managed, serverless NoSQL database built for single-digit millisecond performance at any scale. For Data Engineers it is not a primary data warehouse — it is the operational backbone of pipelines: storing job audit records, pipeline control tables, checkpoint state, metadata configs, and watermarks. It requires zero server management and scales to millions of requests per second automatically.

🗄️

Fundamentals — Partition Key, Sort Key & Capacity Modes CORE ▼

Partition Key Design

Every DynamoDB table requires a partition key (also called a hash key). DynamoDB hashes this value and uses it to decide which physical partition stores the item. A good partition key has high cardinality — many distinct values — so items spread evenly across partitions and you avoid "hot partitions" that throttle reads/writes.

🗺️ Analogy

Think of the partition key like the first letter of a surname in a filing cabinet. If you name all files "A", they all land in one overloaded drawer. A better key distributes filing evenly — like using a run_id UUID instead of a fixed job_name string.

python — creating a table with a partition key

import boto3

dynamodb = boto3.client("dynamodb", region_name="us-east-1")

# Partition key only — simple primary key
dynamodb.create_table(
    TableName="pipeline_audit",
    KeySchema=[
        {"AttributeName": "run_id", "KeyType": "HASH"}   # partition key
    ],
    AttributeDefinitions=[
        {"AttributeName": "run_id", "AttributeType": "S"}   # S = String
    ],
    BillingMode="PAY_PER_REQUEST"   # on-demand — no capacity planning needed
)

Sort Key Design

An optional sort key (range key) combined with the partition key forms a composite primary key. Items with the same partition key are stored together and sorted by the sort key — enabling efficient range queries like "all runs for pipeline X between date A and date B". This is the most useful design for pipeline audit tables where you want to query all runs of a given pipeline.

python — composite key: pipeline_id + run_timestamp

# Composite primary key — very useful for pipeline audit tables
dynamodb.create_table(
    TableName="pipeline_runs",
    KeySchema=[
        {"AttributeName": "pipeline_id", "KeyType": "HASH"},   # partition key
        {"AttributeName": "run_timestamp", "KeyType": "RANGE"}  # sort key
    ],
    AttributeDefinitions=[
        {"AttributeName": "pipeline_id",   "AttributeType": "S"},
        {"AttributeName": "run_timestamp", "AttributeType": "S"}
    ],
    BillingMode="PAY_PER_REQUEST"
)

# Now you can query: "all runs for sales-etl, sorted newest-first"
# → KeyConditionExpression: pipeline_id = "sales-etl"
# → ScanIndexForward = False  (descending by sort key)

📌 DE Pattern

Use pipeline_id as the partition key and run_timestamp (ISO 8601 string like 2024-01-15T08:00:00Z) as the sort key. This lets you instantly fetch the last N runs of any pipeline with a single query() call and ScanIndexForward=False.

Capacity Modes — On-Demand vs Provisioned

DynamoDB offers two billing and scaling models. For most DE use cases (pipeline metadata, audit tables with bursty writes at job completion), On-Demand (PAY_PER_REQUEST) is the right choice — no capacity planning, no throttling, and cost is proportional to actual usage.

Mode	How It Works	Best For	Cost Model
On-Demand `PAY_PER_REQUEST`	AWS auto-scales instantly to any load	Bursty, unpredictable workloads — like pipeline audit writes	Pay per read/write request
Provisioned	You specify RCU and WCU; AWS holds that capacity reserved	Steady, predictable throughput at high volume	Pay per provisioned capacity unit per hour

python — RCU / WCU explained

# RCU = Read Capacity Unit  → 1 strongly consistent read of up to 4 KB/s
# WCU = Write Capacity Unit → 1 write of up to 1 KB/s
# On-Demand skips all this — AWS handles it automatically

# Provisioned example — only if you have steady predictable load
dynamodb.create_table(
    TableName="high_volume_events",
    KeySchema=[{"AttributeName": "event_id", "KeyType": "HASH"}],
    AttributeDefinitions=[{"AttributeName": "event_id", "AttributeType": "S"}],
    BillingMode="PROVISIONED",
    ProvisionedThroughput={
        "ReadCapacityUnits":  100,
        "WriteCapacityUnits": 50
    }
)

🏗️

Data Engineering Use Cases PATTERNS ▼

Metadata Tables for Pipeline State Tracking

DynamoDB is the ideal store for pipeline config metadata — a table that describes every active pipeline (its source, target, load type, schedule, watermark column). The ETL orchestrator reads this table at startup and drives execution dynamically. This pattern is called metadata-driven ETL.

python — reading pipeline config from DynamoDB at runtime

import boto3
from boto3.dynamodb.conditions import Key

# Use the resource (higher-level) API for cleaner item access
ddb = boto3.resource("dynamodb", region_name="us-east-1")
table = ddb.Table("pipeline_config")

# Fetch config for a specific pipeline by its primary key
response = table.get_item(Key={"pipeline_id": "sales-orders-etl"})
config = response["Item"]

print(config["source_table"])     # → "raw.orders"
print(config["target_table"])     # → "silver.orders_cleaned"
print(config["load_type"])        # → "incremental"
print(config["watermark_column"]) # → "updated_at"
print(config["is_active"])        # → True

📋 Typical pipeline_config Item

A config item might contain: pipeline_id, source_system (e.g. "salesforce"), source_table, target_table, load_type (full/incremental), watermark_column, schedule, is_active (bool), owner_team. Any field can be updated in DynamoDB without changing pipeline code.

Job Audit Tables — run_id, status, start_time, end_time, row_count

Every pipeline run should write an audit record to DynamoDB — both at the start (status: RUNNING) and end (status: SUCCEEDED or FAILED). This gives you a complete operational log of every run, queryable without spinning up a database server. DynamoDB's low-latency writes make this a negligible overhead even in tight pipelines.

python — writing audit records at pipeline start and end

import boto3, uuid
from datetime import datetime, timezone

ddb   = boto3.resource("dynamodb")
table = ddb.Table("pipeline_audit")

run_id     = str(uuid.uuid4())
pipeline   = "sales-orders-etl"
start_time = datetime.now(timezone.utc).isoformat()

# ① Write RUNNING record at pipeline start
table.put_item(Item={
    "run_id":      run_id,
    "pipeline_id": pipeline,
    "status":      "RUNNING",
    "start_time":  start_time,
    "end_time":    None,
    "rows_read":   0,
    "rows_written":0,
    "error_msg":   None
})

# … run your ETL logic here …
rows_written = 142_830

# ② Update to SUCCEEDED at pipeline end
table.update_item(
    Key={"run_id": run_id},
    UpdateExpression="SET #s = :s, end_time = :et, rows_written = :rw",
    ExpressionAttributeNames={"#s": "status"},   # "status" is a reserved word
    ExpressionAttributeValues={
        ":s":  "SUCCEEDED",
        ":et": datetime.now(timezone.utc).isoformat(),
        ":rw": rows_written
    }
)

Lightweight Config Storage

For pipeline-level configuration that changes between environments (dev/staging/prod) or between runs — like S3 bucket names, JDBC URLs, batch sizes, or feature flags — DynamoDB is a fast, cheap, serverless alternative to hardcoding values or reading from S3. A single get_item() takes under 5ms and costs a fraction of a cent.

python — config lookup pattern

def get_config(env: str, key: str) -> str:
    """Fetch a single config value for a given environment."""
    ddb   = boto3.resource("dynamodb")
    table = ddb.Table("etl_config")
    resp  = table.get_item(Key={"env": env, "config_key": key})
    return resp["Item"]["config_value"]

# Usage in your Glue / EMR job
s3_output = get_config("prod", "silver_bucket")     # → "s3://my-co-silver"
batch_size = get_config("prod", "jdbc_batch_size")   # → "50000"

Pipeline Control Table Pattern

A control table stores the current watermark (last successfully processed timestamp or offset) for each pipeline. Before each run, the pipeline reads the watermark to know where to start; after a successful run, it updates the watermark. This enables safe, idempotent incremental processing — if a run fails, the watermark is not updated so the next run re-processes from the last safe point.

python — watermark read → process → update pattern

import boto3
from boto3.dynamodb.conditions import Key

ddb       = boto3.resource("dynamodb")
wm_table  = ddb.Table("pipeline_watermarks")
pipeline  = "sales-orders-etl"

# ① Read last watermark before ETL run
resp = wm_table.get_item(Key={"pipeline_id": pipeline})
last_wm = resp["Item"]["last_watermark"]   # e.g. "2024-01-14T23:59:59Z"
print(f"Reading records updated after {last_wm}")

# ② Run incremental ETL — read only new rows from source
# df = spark.read.jdbc(...).filter(f"updated_at > '{last_wm}'")

# ③ After successful ETL — update watermark to now
new_wm = "2024-01-15T23:59:59Z"
wm_table.update_item(
    Key={"pipeline_id": pipeline},
    UpdateExpression="SET last_watermark = :wm, updated_at = :ts",
    ExpressionAttributeValues={
        ":wm": new_wm,
        ":ts": datetime.now(timezone.utc).isoformat()
    }
)
print(f"Watermark updated to {new_wm}")

⚠️ Critical — Update Watermark Only on Success

Never update the watermark inside a try block that might partially succeed. Only call update_item after all downstream writes (S3, Delta, Redshift) have been confirmed. If the pipeline fails mid-run, the unchanged watermark ensures the next run starts from the last safe point.

⚡

Spark Integration — Reading & Writing DynamoDB SPARK ▼

Reading DynamoDB from Spark (EMR Connector)

On EMR, AWS ships the EMR DynamoDB Connector — a Hadoop InputFormat that lets Spark read a DynamoDB table as an RDD/DataFrame. It handles parallel scanning across multiple Spark tasks, segment-level reads, and throughput throttle management. This is useful when you need to join pipeline metadata or config stored in DynamoDB with large datasets in S3.

python — reading DynamoDB table from EMR Spark job

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DynamoDB-Read").getOrCreate()
sc    = spark.sparkContext

# EMR DynamoDB connector — uses Hadoop InputFormat under the hood
rdd = sc.hadoopConfiguration.setAll([
    ("dynamodb.input.tableName", "pipeline_config"),
    ("dynamodb.endpoint",        "dynamodb.us-east-1.amazonaws.com"),
    ("dynamodb.splits",          "4")   # number of parallel segments
])

# Alternative: use boto3 scan in a foreachPartition for smaller tables
def scan_dynamo_partition(_):
    import boto3
    ddb   = boto3.resource("dynamodb")
    table = ddb.Table("pipeline_config")
    pages = table.scan()
    for item in pages["Items"]:
        yield item

# For small config/metadata tables — simplest approach
config_items = sc.parallelize([1]).flatMap(scan_dynamo_partition).collect()
config_df    = spark.createDataFrame(config_items)

📌 When to Read DynamoDB from Spark

Most of the time you don't need to read DynamoDB from Spark — the pipeline driver (Python script / Glue job) reads config via boto3 before launching Spark. Only read DynamoDB from within a Spark job if you need to broadcast a large config table as a lookup during joins.

Writing Pipeline Results to DynamoDB

The recommended pattern for writing Spark results to DynamoDB is not a direct connector — it's using foreachPartition to batch-write items from each Spark partition using boto3 inside the executor. This is far more efficient than writing one item per row, since DynamoDB's batch_write_item supports up to 25 items per call.

python — writing Spark DataFrame rows to DynamoDB via foreachPartition

from pyspark.sql.functions import col
import boto3
from boto3.dynamodb.types import TypeSerializer

serializer = TypeSerializer()

def write_partition_to_dynamo(rows):
    """Called once per Spark partition — batch-writes to DynamoDB."""
    ddb   = boto3.resource("dynamodb", region_name="us-east-1")
    table = ddb.Table("pipeline_audit")
    batch = []

    for row in rows:
        item = row.asDict()
        batch.append({"PutRequest": {"Item": item}})

        if len(batch) == 25:   # DynamoDB batch limit is 25 items
            with table.batch_writer() as bw:
                for req in batch:
                    bw.put_item(Item=req["PutRequest"]["Item"])
            batch = []

    # Flush remaining items
    if batch:
        with table.batch_writer() as bw:
            for req in batch:
                bw.put_item(Item=req["PutRequest"]["Item"])

# Write a small summary DataFrame (e.g. per-pipeline row counts) to DynamoDB
summary_df.foreachPartition(write_partition_to_dynamo)

💡 Why batch_writer()?

DynamoDB's batch_writer() context manager automatically handles batching into groups of 25, retries on unprocessed items, and throttle-aware backoff — far safer than manually calling batch_write_item() and handling UnprocessedItems yourself.

🔑

Key DynamoDB Concepts Every DE Must Know DEEP DIVE ▼

Item Structure — Schemaless but Typed

DynamoDB is schemaless — you only define the primary key attributes at table creation. Every other attribute is defined per-item and can vary between items. But each attribute value is typed: S (String), N (Number), B (Binary), BOOL (Boolean), NULL, L (List), M (Map), SS (String Set), NS (Number Set). The resource API handles Python → DynamoDB type conversion automatically.

python — putting a rich audit item with mixed types

from decimal import Decimal  # DynamoDB resource uses Decimal for numbers

table.put_item(Item={
    "run_id":       "run-abc-123",          # S — partition key
    "pipeline_id":  "sales-orders-etl",    # S
    "status":       "SUCCEEDED",           # S
    "rows_written": Decimal("142830"),     # N — use Decimal not int/float
    "dq_score":     Decimal("98.5"),       # N
    "is_backfill":  False,                 # BOOL
    "tags": ["sales", "incremental"],     # L — List
    "meta": {                              # M — Map (nested dict)
        "source_table": "raw.orders",
        "target_table": "silver.orders"
    }
})

⚠️ Use Decimal — Not float

The boto3 DynamoDB resource API does not accept Python float — always use Decimal("98.5") for numeric attributes. The client API uses DynamoDB's low-level type notation ({"N": "98.5"}) which avoids this issue.

Conditional Writes — Optimistic Locking

DynamoDB supports conditional expressions that make a write succeed only if a condition on the current item is true. For pipeline state machines (e.g., only update status to FAILED if current status is RUNNING), this prevents race conditions when multiple processes might update the same record.

python — conditional update: only mark FAILED if currently RUNNING

from botocore.exceptions import ClientError

try:
    table.update_item(
        Key={"run_id": run_id},
        UpdateExpression="SET #s = :failed, error_msg = :err",
        # Condition: only succeed if the item currently has status = RUNNING
        ConditionExpression="#s = :running",
        ExpressionAttributeNames={"#s": "status"},
        ExpressionAttributeValues={
            ":failed":  "FAILED",
            ":running": "RUNNING",
            ":err":     "Out of memory on executor 3"
        }
    )
except ClientError as e:
    if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
        print("Run was already marked SUCCEEDED or FAILED — skipping")
    else:
        raise

Query vs Scan — Always Prefer Query

A query() uses the partition key to read only items in one partition — fast, cheap, O(result size). A scan() reads every item in the table — slow, expensive, O(table size), and consumes capacity proportional to the full table. Always design your table's primary key so that your most common access pattern maps to a query() not a scan().

python — query vs scan for pipeline runs

from boto3.dynamodb.conditions import Key, Attr

# ✅ QUERY — fast, targeted: all runs for a specific pipeline, newest first
resp = table.query(
    KeyConditionExpression=Key("pipeline_id").eq("sales-orders-etl"),
    ScanIndexForward=False,    # descending by sort key (run_timestamp)
    Limit=10                    # last 10 runs only
)
runs = resp["Items"]

# ❌ SCAN — reads entire table, never use in production for lookups
resp = table.scan(
    FilterExpression=Attr("status").eq("FAILED")  # filter happens AFTER reading all items!
)
failed_runs = resp["Items"]   # costs as if you read everything

📌 When scan() Is Acceptable

A scan() is fine for small tables (under ~1,000 items) like a config table or a pipeline registry with a few dozen rows. For audit tables with millions of rows per month — always design for query().

Global Secondary Indexes (GSI) for Alternate Access Patterns

If you need to query your audit table by a different key — for example, your primary key is run_id (partition) + run_timestamp (sort), but you also want to query all FAILED runs regardless of pipeline — you add a Global Secondary Index (GSI) with status as the partition key. A GSI is a separate automatically maintained projection of your table with a different key structure.

python — adding a GSI at table creation for status-based queries

dynamodb.create_table(
    TableName="pipeline_audit",
    KeySchema=[
        {"AttributeName": "run_id",        "KeyType": "HASH"},
        {"AttributeName": "run_timestamp",  "KeyType": "RANGE"}
    ],
    AttributeDefinitions=[
        {"AttributeName": "run_id",        "AttributeType": "S"},
        {"AttributeName": "run_timestamp",  "AttributeType": "S"},
        {"AttributeName": "status",         "AttributeType": "S"}
    ],
    GlobalSecondaryIndexes=[{
        "IndexName": "status-index",
        "KeySchema": [
            {"AttributeName": "status",       "KeyType": "HASH"},
            {"AttributeName": "run_timestamp", "KeyType": "RANGE"}
        ],
        "Projection": {"ProjectionType": "ALL"},
        "BillingMode": "PAY_PER_REQUEST"
    }],
    BillingMode="PAY_PER_REQUEST"
)

# Now query the GSI for all FAILED runs today
resp = table.query(
    IndexName="status-index",
    KeyConditionExpression=(
        Key("status").eq("FAILED") &
        Key("run_timestamp").begins_with("2024-01-15")
    )
)

📋

Complete Pipeline Audit Pattern — Production Design PRODUCTION ▼

Recommended Audit Table Schema

Every production DE team uses some form of pipeline audit table. Here is the recommended schema that covers all operational needs — queryable by run, by pipeline, and by status via GSIs.

pipeline_audit TABLE DESIGN ══════════════════════════════════════════════ PRIMARY KEY partition key : pipeline_id (S) → groups all runs of a pipeline sort key : run_timestamp (S) → ISO 8601 — enables range queries ATTRIBUTES (schemaless — define per item) run_id S → UUID — globally unique run identifier status S → RUNNING | SUCCEEDED | FAILED | PARTIAL trigger_type S → SCHEDULED | MANUAL | RETRY | BACKFILL start_time S → ISO 8601 UTC end_time S → ISO 8601 UTC (null if still running) duration_secs N → wall clock duration rows_read N → records read from source rows_written N → records written to target rows_rejected N → records sent to DLQ/quarantine dq_score N → data quality score 0-100 error_msg S → last error message (null if success) error_type S → OOM | TIMEOUT | DQ_FAILURE | UPSTREAM s3_output_path S → s3://... path of output files batch_id S → idempotency key for exactly-once GLOBAL SECONDARY INDEXES status-index → status (HASH) + run_timestamp (RANGE) → query all FAILED runs across all pipelines COMMON ACCESS PATTERNS ① Last 10 runs for a pipeline → query(pipeline_id=X, limit=10, desc) ② All runs today → query(pipeline_id=X, begins_with date) ③ All FAILED runs this hour → query GSI status-index(FAILED, begins_with)

Full Audit Writer Helper — Production Code

Encapsulate the audit table logic into a reusable class that your Glue jobs, Lambda functions, and EMR scripts can all import from a shared library. This ensures consistent audit records across every pipeline in your platform.

python — reusable PipelineAudit class

import boto3, uuid
from decimal import Decimal
from datetime import datetime, timezone

class PipelineAudit:
    """Reusable audit writer for all DE pipelines."""

    def __init__(self, pipeline_id: str, table_name: str = "pipeline_audit"):
        ddb = boto3.resource("dynamodb")
        self.table       = ddb.Table(table_name)
        self.pipeline_id = pipeline_id
        self.run_id      = str(uuid.uuid4())
        self.start_ts    = datetime.now(timezone.utc)

    def start(self, trigger_type: str = "SCHEDULED"):
        """Write RUNNING record at pipeline start."""
        ts = self.start_ts.isoformat()
        self.table.put_item(Item={
            "pipeline_id":   self.pipeline_id,
            "run_timestamp": ts,
            "run_id":        self.run_id,
            "status":        "RUNNING",
            "trigger_type":  trigger_type,
            "start_time":    ts,
        })
        return self

    def succeed(self, rows_read=0, rows_written=0, rows_rejected=0, dq_score=100):
        """Update to SUCCEEDED with metrics."""
        end_ts  = datetime.now(timezone.utc)
        duration = (end_ts - self.start_ts).seconds
        self.table.update_item(
            Key={"pipeline_id": self.pipeline_id, "run_timestamp": self.start_ts.isoformat()},
            UpdateExpression=("SET #s=:s, end_time=:et, duration_secs=:d,"
                              "rows_read=:rr, rows_written=:rw, rows_rejected=:rej, dq_score=:dq"),
            ExpressionAttributeNames={"#s": "status"},
            ExpressionAttributeValues={
                ":s":   "SUCCEEDED",
                ":et":  end_ts.isoformat(),
                ":d":   Decimal(str(duration)),
                ":rr":  Decimal(str(rows_read)),
                ":rw":  Decimal(str(rows_written)),
                ":rej": Decimal(str(rows_rejected)),
                ":dq":  Decimal(str(dq_score))
            }
        )

    def fail(self, error_msg: str, error_type: str = "UNKNOWN"):
        """Update to FAILED with error details."""
        self.table.update_item(
            Key={"pipeline_id": self.pipeline_id, "run_timestamp": self.start_ts.isoformat()},
            UpdateExpression="SET #s=:s, end_time=:et, error_msg=:em, error_type=:et2",
            ExpressionAttributeNames={"#s": "status"},
            ExpressionAttributeValues={
                ":s":   "FAILED",
                ":et":  datetime.now(timezone.utc).isoformat(),
                ":em":  error_msg[:1000],   # truncate long stack traces
                ":et2": error_type
            }
        )


# ─── Usage in any pipeline ───────────────────────────────────────────
audit = PipelineAudit("sales-orders-etl").start("SCHEDULED")
try:
    # … your ETL code …
    rows = 142_830
    audit.succeed(rows_read=rows, rows_written=rows, dq_score=99)
except Exception as e:
    audit.fail(error_msg=str(e), error_type="RUNTIME")
    raise

29.13 — AMAZON RDS

Amazon RDS

Amazon RDS (Relational Database Service) is AWS's managed relational database — you pick the engine, AWS handles patching, backups, failover, and scaling. For Data Engineers, RDS is almost always a source system, not a destination. Your job is to efficiently extract data from RDS into your data lake (S3 / Delta / Iceberg) using Spark JDBC reads, Glue crawlers, or CDC tools — while being careful not to hammer the production database with full table scans.

🛢️

Engines — PostgreSQL & MySQL CORE ▼

PostgreSQL (Most Common in DE)

PostgreSQL is the default choice for most modern data-engineering-adjacent applications. It supports JSONB columns, arrays, window functions, and logical replication (which Debezium and DMS use for CDC). If your source system runs Postgres, you have the richest set of extraction options available.

🔌

JDBC Driver

org.postgresql.Driver
JAR: postgresql-42.x.x.jar
URL: jdbc:postgresql://host:5432/dbname

📡

CDC Support

Logical replication slots — enables Debezium and AWS DMS to stream every INSERT/UPDATE/DELETE as an event.

🔄

Incremental Read

Use updated_at or id column as watermark for timestamp-based or offset-based incrementals.

MySQL

MySQL is widely used in OLTP applications (especially older stacks and e-commerce). Its binary log (binlog) is the CDC source — DMS and Debezium both read it. Spark JDBC reads work identically to Postgres, just with a different driver and URL format.

text — JDBC URL formats

# PostgreSQL
jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db

# MySQL
jdbc:mysql://mydb.abc123.us-east-1.rds.amazonaws.com:3306/sales_db

# Aurora PostgreSQL (same driver as PostgreSQL)
jdbc:postgresql://mycluster.cluster-abc123.us-east-1.rds.amazonaws.com:5432/sales_db

⚡

Spark JDBC Reads — The Core DE Skill MOST IMPORTANT ▼

Basic JDBC Read — Single Partition (Never Use on Large Tables)

The simplest JDBC read uses a single connection — Spark sends one query, gets all results back through one thread. This is fine for small lookup tables (under ~1 million rows) but will create a bottleneck and OOM executor on large tables because all data flows through one partition.

python — basic single-partition JDBC read

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("RDS-Extract") \
    .config("spark.jars", "/opt/jars/postgresql-42.6.0.jar") \
    .getOrCreate()

# ⚠️ Single partition — only for small tables
df = spark.read \
    .format("jdbc") \
    .option("url",      "jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db") \
    .option("dbtable",  "public.orders") \
    .option("user",     "etl_reader") \
    .option("password", "secret") \
    .load()

df.printSchema()
df.show(5)

⚠️ Never Hardcode Credentials

Always fetch the password from AWS Secrets Manager at runtime — never put it in code or config files. See the Secrets Manager section for the standard pattern.

Partitioned JDBC Read — partitionColumn, lowerBound, upperBound, numPartitions

This is the most important JDBC pattern for large tables. Spark splits the read into numPartitions parallel queries, each fetching a range of the partitionColumn. All queries run simultaneously — one per Spark task — dramatically reducing extraction time. Each Spark executor opens its own JDBC connection to RDS.

🏭 Analogy

Instead of one person reading a 10,000-page book from cover to cover, you assign 20 people each to read 500 pages simultaneously. Spark does the same: it splits the table's ID range into 20 equal slices and reads each slice in parallel.

python — partitioned JDBC read for large tables

import boto3, json

# ① Fetch credentials from Secrets Manager (always)
sm     = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/sales-db")["SecretString"])

jdbc_url = f"jdbc:postgresql://{secret['host']}:{secret['port']}/{secret['dbname']}"

# ② Find the actual min/max of the partition column first
bounds_df = spark.read \
    .format("jdbc") \
    .option("url",     jdbc_url) \
    .option("dbtable", "(SELECT MIN(order_id) AS mn, MAX(order_id) AS mx FROM public.orders) t") \
    .option("user",    secret["username"]) \
    .option("password",secret["password"]) \
    .load()

mn = bounds_df.collect()[0]["mn"]  # e.g.  1
mx = bounds_df.collect()[0]["mx"]  # e.g.  10_000_000

# ③ Partitioned read — Spark spawns numPartitions parallel JDBC queries
df = spark.read \
    .format("jdbc") \
    .option("url",             jdbc_url) \
    .option("dbtable",         "public.orders") \
    .option("user",            secret["username"]) \
    .option("password",        secret["password"]) \
    .option("partitionColumn",  "order_id")    # must be numeric
    .option("lowerBound",       str(mn))        # min value of column
    .option("upperBound",       str(mx))        # max value of column
    .option("numPartitions",    "20")           # 20 parallel queries → 20 RDS connections
    .option("fetchsize",        "10000")        # rows fetched per JDBC round-trip
    .load()

print(f"Partitions: {df.rdd.getNumPartitions()}")  # → 20
print(f"Row count:  {df.count()}")

📌 How Spark Splits the Range

With lowerBound=1, upperBound=10_000_000, numPartitions=20, Spark generates 20 WHERE clauses:
Partition 0: WHERE order_id < 500_001
Partition 1: WHERE order_id >= 500_001 AND order_id < 1_000_001
…
Partition 19: WHERE order_id >= 9_500_001 (catches everything above upperBound too)

⚠️ lowerBound and upperBound are only used to calculate split ranges — they do NOT filter data. Rows outside these bounds are still read (in the first/last partition).

Partitioned Read with Custom WHERE — Incremental Extraction

For incremental loads, you don't want the full table — just rows updated since the last run. Combine a custom SQL query as the dbtable (wrapped in a subquery alias) with the partitioned read options. This pushes the WHERE updated_at > last_watermark filter down to RDS, so Spark only fetches new rows.

python — incremental partitioned JDBC read with watermark

import boto3, json
from boto3.dynamodb.conditions import Key

# ① Read last watermark from DynamoDB control table
ddb      = boto3.resource("dynamodb")
wm_table = ddb.Table("pipeline_watermarks")
last_wm  = wm_table.get_item(Key={"pipeline_id": "orders-incremental"})["Item"]["last_watermark"]
# last_wm = "2024-01-14 23:59:59"

# ② Build a subquery — only new/updated rows
query = f"""(
    SELECT order_id, customer_id, amount, status, updated_at
    FROM public.orders
    WHERE updated_at > '{last_wm}'
) incremental_orders"""

# ③ Partitioned incremental read
df = spark.read \
    .format("jdbc") \
    .option("url",            jdbc_url) \
    .option("dbtable",        query) \
    .option("user",           secret["username"]) \
    .option("password",       secret["password"]) \
    .option("partitionColumn", "order_id") \
    .option("lowerBound",      "1") \
    .option("upperBound",      "99999999") \
    .option("numPartitions",   "10") \
    .load()

print(f"New/updated rows: {df.count()}")

Predicate Pushdown

Predicate pushdown means Spark sends filter conditions down to the database so RDS executes them — returning only matching rows instead of the full table. For JDBC sources, Spark pushes simple filters (=, >, <, IN, IS NULL) automatically. You can verify pushdown is happening by checking the Spark UI's SQL tab — look for PushedFilters in the scan node.

python — verifying predicate pushdown

# Spark automatically pushes this filter to RDS — RDS runs:
# SELECT * FROM orders WHERE status = 'COMPLETED'
df_filtered = df.filter("status = 'COMPLETED'")

# Verify pushdown in the physical plan
df_filtered.explain("extended")
# Look for: PushedFilters: [IsNotNull(status), EqualTo(status,COMPLETED)]

# For complex pushdown — use the query option instead:
# .option("dbtable", "(SELECT * FROM orders WHERE status='COMPLETED') t")

fetchsize — Tuning JDBC Round-Trips

The fetchsize option controls how many rows Spark retrieves from RDS in each network round-trip. The default is very low (often 10 rows for PostgreSQL). Setting it to 10,000–50,000 dramatically reduces the number of round-trips and cuts extraction time by 5–10x on large tables.

python — recommended JDBC performance options

df = spark.read \
    .format("jdbc") \
    .option("url",             jdbc_url) \
    .option("dbtable",         "public.orders") \
    .option("user",            secret["username"]) \
    .option("password",        secret["password"]) \
    .option("partitionColumn",  "order_id") \
    .option("lowerBound",       "1") \
    .option("upperBound",       "10000000") \
    .option("numPartitions",    "20") \
    .option("fetchsize",        "50000")   # ← key performance option
    .option("driver",           "org.postgresql.Driver") \
    .load()

Option	What It Controls	Recommended Value
`numPartitions`	Parallel JDBC connections to RDS	10–50 (don't exceed RDS max connections)
`fetchsize`	Rows per JDBC network round-trip	10,000–50,000
`partitionColumn`	Column used to split range	Numeric, indexed, low-skew (primary key is ideal)
`lowerBound`	Range start (not a filter)	Actual MIN of partitionColumn
`upperBound`	Range end (not a filter)	Actual MAX of partitionColumn

🏗️

Architecture — Read Replicas & Multi-AZ ARCHITECTURE ▼

Read Replicas for Reporting Workloads

A read replica is a continuously synced copy of the primary RDS instance that accepts read-only queries. This is the most important RDS concept for Data Engineers: never run your Spark JDBC extracts against the production primary database. Heavy parallel JDBC reads (20 connections doing full table scans) can degrade application performance or cause connection pool exhaustion. Always point your ETL at a read replica.

📚 Analogy

The primary RDS instance is the original library book. A read replica is a photocopy of it. Your Spark job can highlight, mark, and read the photocopy all day — the original library book remains pristine and available for the application.

python — always use the read replica endpoint for ETL

# ❌ Primary endpoint — never use for heavy ETL reads
# jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db

# ✅ Read replica endpoint — use this for all Spark JDBC extracts
jdbc_url_replica = "jdbc:postgresql://mydb-replica.abc123.us-east-1.rds.amazonaws.com:5432/sales_db"

df = spark.read \
    .format("jdbc") \
    .option("url",     jdbc_url_replica) \   # ← replica, not primary
    .option("dbtable", "public.orders") \
    .option("user",    secret["username"]) \
    .option("password",secret["password"]) \
    .option("numPartitions", "20") \
    .load()

📌 Replication Lag

Read replicas have a small replication lag — typically milliseconds to seconds, occasionally minutes under heavy write load. For daily batch ETL this is irrelevant. For near-real-time pipelines where you need data within seconds of it being written, use CDC (Debezium/DMS) from the primary's replication slot instead.

Multi-AZ for High Availability

Multi-AZ is an RDS feature that keeps a synchronous standby replica in a different Availability Zone. If the primary fails, RDS automatically promotes the standby — typical failover is 60–120 seconds. As a DE, Multi-AZ is mostly invisible to you (your JDBC connection reconnects after failover), but you need to understand it for production RDS sizing discussions and for explaining why your pipeline might see a brief connection error during an RDS maintenance window.

Feature	Read Replica	Multi-AZ Standby
Purpose	Read scaling + ETL offload	High availability / failover
Accepts reads?	Yes — separate endpoint	No — standby only
Replication type	Asynchronous	Synchronous
Failover	Manual promotion required	Automatic — 60–120s
DE relevance	Point ETL here	Transparent — just handle reconnect

🕷️

Glue JDBC Crawler & DE Patterns PATTERNS ▼

Source for Glue JDBC Crawlers

AWS Glue Crawlers can connect directly to an RDS instance via JDBC and automatically discover all tables and their schemas, registering them in the Glue Data Catalog. Once catalogued, your Glue ETL jobs and Athena can reference these tables by name without you writing schema definitions manually.

python — creating a Glue JDBC crawler for RDS via boto3

import boto3

glue = boto3.client("glue", region_name="us-east-1")

# First, create a Glue Connection for the RDS instance
glue.create_connection(
    ConnectionInput={
        "Name": "rds-sales-db-connection",
        "ConnectionType": "JDBC",
        "ConnectionProperties": {
            "JDBC_CONNECTION_URL": "jdbc:postgresql://mydb-replica.abc123.us-east-1.rds.amazonaws.com:5432/sales_db",
            "USERNAME": "etl_reader",
            "PASSWORD": "{{resolve:secretsmanager:prod/rds/sales-db:SecretString:password}}"
        },
        "PhysicalConnectionRequirements": {
            "SubnetId":               "subnet-0abc123",   # must be in same VPC as RDS
            "SecurityGroupIdList":    ["sg-0abc123"],
            "AvailabilityZone":       "us-east-1a"
        }
    }
)

# Create a crawler to scan the public schema of the RDS database
glue.create_crawler(
    Name="rds-sales-db-crawler",
    Role="arn:aws:iam::123456789:role/GlueCrawlerRole",
    DatabaseName="raw_rds_sales",   # Glue Catalog database to write to
    Targets={"JdbcTargets": [{
        "ConnectionName": "rds-sales-db-connection",
        "Path": "sales_db/public/%"   # db/schema/table — % wildcard = all tables
    }]},
    Schedule="cron(0 1 * * ? *)"   # run nightly at 01:00 UTC
)

# Start the crawler
glue.start_crawler(Name="rds-sales-db-crawler")

Using RDS as a Metadata / Audit Database for Pipelines

For teams that are more comfortable with SQL than DynamoDB, RDS PostgreSQL (typically Aurora Serverless for cost) can serve as the pipeline metadata and audit database. You store run history, watermarks, control tables, and data quality results in relational tables — and query them with standard SQL joins. The trade-off vs DynamoDB: RDS requires connection management (pool size, VPC placement) and is not serverless in the same zero-management sense.

python — writing audit records to RDS using psycopg2

import boto3, json, psycopg2

# ① Fetch credentials
sm     = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/metadata-db")["SecretString"])

# ② Connect to RDS PostgreSQL
conn = psycopg2.connect(
    host=     secret["host"],
    port=     secret["port"],
    dbname=   secret["dbname"],
    user=     secret["username"],
    password= secret["password"],
    sslmode=  "require"   # always enforce SSL on RDS
)
cursor = conn.cursor()

# ③ Insert audit record
cursor.execute("""
    INSERT INTO pipeline_audit
        (run_id, pipeline_id, status, start_time, rows_written, error_msg)
    VALUES (%s, %s, %s, NOW(), %s, %s)
""", (run_id, "sales-orders-etl", "SUCCEEDED", 142830, None))

conn.commit()
cursor.close()
conn.close()

Complete RDS → S3 Extraction Pipeline

This is the standard end-to-end pattern for a daily incremental extract from RDS to the Bronze layer of your data lake on S3:

DAILY INCREMENTAL RDS → S3 BRONZE PATTERN ══════════════════════════════════════════════ ① DynamoDB (or RDS metadata DB) → read last_watermark for this pipeline ② Spark JDBC (pointing at READ REPLICA) → incremental query: WHERE updated_at > last_watermark → partitioned read: numPartitions=20, fetchsize=50000 → returns DataFrame of new/changed rows ③ Spark Transform (Bronze — minimal) → add audit columns: load_timestamp, batch_id, pipeline_name → cast types if needed → NO business logic at Bronze ④ Write to S3 (Bronze zone) → format: Parquet or Delta → partitionBy: year, month, day → mode: append → path: s3://company-bronze/rds/sales/orders/year=2024/month=01/day=15/ ⑤ Update Glue Catalog → MSCK REPAIR TABLE or batch_create_partition() ⑥ Update watermark in DynamoDB / RDS metadata DB → set last_watermark = MAX(updated_at) from the extracted batch ⑦ Write audit record → run_id, status=SUCCEEDED, rows_written, duration

python — full incremental RDS → Bronze S3 pipeline

import boto3, json, uuid
from datetime import datetime, timezone
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit

spark = SparkSession.builder.appName("RDS-Bronze-Extract").getOrCreate()

# ── Config ─────────────────────────────────────────────────────────
PIPELINE    = "rds-orders-incremental"
S3_OUTPUT   = "s3://company-bronze/rds/sales/orders/"
RUN_ID      = str(uuid.uuid4())

# ── Credentials ────────────────────────────────────────────────────
sm     = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/sales-db")["SecretString"])
jdbc_url = f"jdbc:postgresql://{secret['replica_host']}:5432/{secret['dbname']}"

# ── Watermark ──────────────────────────────────────────────────────
ddb     = boto3.resource("dynamodb")
wm_tbl  = ddb.Table("pipeline_watermarks")
last_wm = wm_tbl.get_item(Key={"pipeline_id": PIPELINE})["Item"]["last_watermark"]

# ── Extract ────────────────────────────────────────────────────────
query = f"(SELECT * FROM public.orders WHERE updated_at > '{last_wm}') orders_delta"

df = spark.read.format("jdbc") \
    .option("url",            jdbc_url) \
    .option("dbtable",        query) \
    .option("user",           secret["username"]) \
    .option("password",       secret["password"]) \
    .option("partitionColumn", "order_id") \
    .option("lowerBound",      "1") \
    .option("upperBound",      "99999999") \
    .option("numPartitions",   "20") \
    .option("fetchsize",       "50000") \
    .load()

rows = df.count()

# ── Bronze Audit Columns ───────────────────────────────────────────
df_bronze = df \
    .withColumn("_load_timestamp", current_timestamp()) \
    .withColumn("_batch_id",        lit(RUN_ID)) \
    .withColumn("_pipeline_name",   lit(PIPELINE))

# ── Write to S3 Bronze ─────────────────────────────────────────────
df_bronze.write \
    .partitionBy("year", "month", "day") \
    .mode("append") \
    .parquet(S3_OUTPUT)

# ── Update Watermark ───────────────────────────────────────────────
new_wm = df.agg({"updated_at": "max"}).collect()[0][0]
wm_tbl.update_item(
    Key={"pipeline_id": PIPELINE},
    UpdateExpression="SET last_watermark = :wm",
    ExpressionAttributeValues={":wm": str(new_wm)}
)

print(f"✅ Extracted {rows:,} rows → {S3_OUTPUT}")

29.14 — AMAZON EVENTBRIDGE

Amazon EventBridge

EventBridge is AWS's serverless event bus — it routes events from sources (AWS services, your own application code, SaaS tools) to targets (Lambda, SQS, SNS, Glue, Step Functions, and more) based on rules you define. For Data Engineers it serves two critical roles: cron-style pipeline scheduling (replacing simple time-based triggers) and event-driven pipeline triggers (reacting to S3 file arrivals, DMS task completions, Glue job state changes, and custom application events).

🚌

Core Concepts — Events, Rules, Targets & Event Buses CORE ▼

Events

An event is a JSON object describing something that happened — a state change, a file arrival, a job completion, or a custom business event your code publishes. Every event has a standard envelope with source, detail-type, detail (the payload), time, and account/region. EventBridge receives events and routes them to matching rules.

json — anatomy of an EventBridge event

{
  "version":     "0",
  "id":          "abc-123-def-456",
  "source":      "com.mycompany.data-platform",   // who sent it
  "detail-type": "PipelineCompleted",             // what kind of event
  "time":        "2024-01-15T08:30:00Z",
  "account":     "123456789012",
  "region":      "us-east-1",
  "detail": {                                      // your custom payload
    "pipeline_id":   "sales-orders-etl",
    "status":        "SUCCEEDED",
    "rows_written":  142830,
    "output_path":   "s3://company-silver/orders/year=2024/month=01/day=15/"
  }
}

Rules — Schedule & Event Pattern

A rule is the routing logic — it matches incoming events against a pattern (or a schedule) and sends matching events to one or more targets. There are two types of rules: Schedule rules (time-based — run at 06:00 UTC every day) and Event pattern rules (content-based — trigger when a Glue job state changes to FAILED).

📬 Analogy

EventBridge is a smart post office. Events are letters. Rules are sorting instructions: "any letter from sender X with subject Y goes to mailbox Z." The post office doesn't care about the letter's content beyond matching — it just delivers.

json — event pattern rule: match Glue job state FAILED

{
  "source": ["aws.glue"],
  "detail-type": ["Glue Job State Change"],
  "detail": {
    "state":   ["FAILED", "ERROR", "TIMEOUT"],
    "jobName": ["sales-orders-etl", "customer-dim-etl"]
  }
}

Targets

A target is where EventBridge sends the matched event. One rule can have up to 5 targets — so a single event can simultaneously trigger a Lambda, send a message to SQS, and notify an SNS topic. Common DE targets:

AWS Lambda Amazon SQS Amazon SNS AWS Glue (workflow) Step Functions Amazon ECS Task Another EventBridge Bus Kinesis Data Stream

Event Buses — Default vs Custom

Every AWS account has a default event bus that receives all AWS service events (S3, Glue, EMR, DMS state changes, etc.). You can also create custom event buses for your own application events — isolating them from AWS service events and enabling cross-account event routing. For pipeline platforms, a custom bus named something like data-platform-events keeps your events organized and separate.

python — creating a custom event bus

import boto3

eb = boto3.client("events", region_name="us-east-1")

# Create a custom event bus for all data platform events
eb.create_event_bus(Name="data-platform-events")

# Default bus is always named "default"
# Custom buses are referenced by ARN or name in rules and put_events

⏰

Scheduling — Cron & Rate-Based Rules SCHEDULING ▼

Cron Expressions

EventBridge supports cron expressions for precise scheduling — run at exactly 06:00 UTC every weekday, or at midnight on the first of every month. EventBridge cron uses a 6-field format: cron(minutes hours day-of-month month day-of-week year). Note: EventBridge does not support seconds-level granularity — minimum is 1 minute.

text — common cron expressions for DE pipelines

# Format: cron(minutes hours day-of-month month day-of-week year)

cron(0  6  *  *  ?  *)    # Daily at 06:00 UTC
cron(0  1  *  *  ?  *)    # Daily at 01:00 UTC (typical nightly batch)
cron(0  0  1  *  ?  *)    # Monthly — 1st of every month at midnight
cron(0  6  ?  *  MON-FRI *)  # Weekdays only at 06:00 UTC
cron(0  */4 *  *  ?  *)    # Every 4 hours
cron(0  8  ?  *  MON  *)   # Every Monday at 08:00 UTC (weekly batch)

# Note: use ? in day-of-month OR day-of-week (not both)

Rate Expressions

Rate expressions are simpler than cron — they just say "run every N minutes/hours/days." Use them when you don't need a specific clock time, just a regular interval. They start running immediately when the rule is created.

text — rate expressions

rate(5 minutes)    # every 5 minutes — lightweight polling pipeline
rate(1 hour)       # every hour
rate(1 day)        # every day (from rule creation time, not midnight)
rate(12 hours)     # twice a day

Creating a Scheduled Rule via boto3

The full pattern: create a rule with a schedule, create a target (e.g. Lambda that triggers a Glue job), and add the permission for EventBridge to invoke the Lambda.

python — scheduled rule → Lambda target

import boto3, json

eb     = boto3.client("events")
lam    = boto3.client("lambda")

RULE_NAME      = "nightly-sales-etl-trigger"
LAMBDA_ARN     = "arn:aws:lambda:us-east-1:123456789:function:trigger-sales-etl"
LAMBDA_FUNC    = "trigger-sales-etl"

# ① Create the schedule rule
rule_resp = eb.put_rule(
    Name=           RULE_NAME,
    ScheduleExpression= "cron(0 1 * * ? *)",  # every night at 01:00 UTC
    State=          "ENABLED",
    Description=    "Triggers the nightly sales orders ETL pipeline"
)
rule_arn = rule_resp["RuleArn"]

# ② Add Lambda as target — pass pipeline config in the input JSON
eb.put_targets(
    Rule=RULE_NAME,
    Targets=[{
        "Id":    "sales-etl-lambda-target",
        "Arn":   LAMBDA_ARN,
        "Input": json.dumps({           # static JSON sent to Lambda as event
            "pipeline_id":  "sales-orders-etl",
            "trigger_type": "SCHEDULED",
            "environment":  "prod"
        })
    }]
)

# ③ Grant EventBridge permission to invoke the Lambda
lam.add_permission(
    FunctionName=  LAMBDA_FUNC,
    StatementId=   "allow-eventbridge-invoke",
    Action=        "lambda:InvokeFunction",
    Principal=     "events.amazonaws.com",
    SourceArn=     rule_arn
)

print(f"✅ Scheduled rule created: {rule_arn}")

⚡

Event-Driven Pipeline Triggers PATTERNS ▼

S3 → EventBridge → Lambda → Glue

The most common event-driven DE pattern: a file lands in S3, EventBridge detects it, triggers a Lambda, which starts a Glue job to process the file. This eliminates polling — your pipeline reacts within seconds of file arrival rather than waiting for a fixed schedule.

S3 FILE ARRIVAL → EVENTBRIDGE → LAMBDA → GLUE PATTERN ═══════════════════════════════════════════════════════ ① Enable S3 EventBridge notifications on the bucket (one checkbox in the console, or putBucketNotificationConfiguration) ② EventBridge receives event from S3: source: "aws.s3" detail-type: "Object Created" detail: bucket.name: "company-raw-landing" object.key: "feeds/salesforce/2024/01/15/accounts.csv" ③ EventBridge rule matches on: source = ["aws.s3"] detail.bucket.name = ["company-raw-landing"] detail.object.key = [{"prefix": "feeds/salesforce/"}] ④ Target: Lambda function → Lambda reads event, extracts bucket + key → validates file (head_object — check size > 0) → starts Glue job with --source-path argument ⑤ Glue job: → reads the file from S3 → transforms → writes Parquet to Silver zone → updates Glue Catalog partitions → writes audit record to DynamoDB

python — EventBridge rule for S3 Object Created events

import boto3, json

eb = boto3.client("events")

# Rule: match S3 ObjectCreated events for a specific prefix
eb.put_rule(
    Name=        "s3-salesforce-feed-arrival",
    EventPattern= json.dumps({
        "source":      ["aws.s3"],
        "detail-type": ["Object Created"],
        "detail": {
            "bucket": {"name": ["company-raw-landing"]},
            "object": {"key": [{"prefix": "feeds/salesforce/"}]}
        }
    }),
    State=       "ENABLED",
    Description= "Trigger Glue when Salesforce feed lands in S3"
)

# Lambda handler that EventBridge calls — triggers Glue
# (This goes in your Lambda function code, not boto3 call)
LAMBDA_CODE = """
import boto3, json

def handler(event, context):
    glue   = boto3.client("glue")
    detail = event["detail"]
    bucket = detail["bucket"]["name"]
    key    = detail["object"]["key"]

    glue.start_job_run(
        JobName="salesforce-accounts-etl",
        Arguments={
            "--source-bucket": bucket,
            "--source-key":    key,
            "--trigger-type":  "S3_EVENT"
        }
    )
    return {"status": "Glue job started", "key": key}
"""

print("Lambda code ready — deploy as trigger-salesforce-etl function")

📌 Enable EventBridge Notifications on the S3 Bucket First

S3 does not send events to EventBridge by default. You must explicitly enable it: in the S3 console → bucket → Properties → Amazon EventBridge → Enable. Or via boto3: s3.put_bucket_notification_configuration(Bucket=..., NotificationConfiguration={"EventBridgeConfiguration": {}}).

S3 → EventBridge → Airflow Trigger

If your orchestration is Airflow (MWAA), you can trigger a DAG run on file arrival by routing the EventBridge event to a Lambda that calls the Airflow REST API to trigger the DAG. This gives you the best of both worlds: event-driven file detection + full Airflow orchestration for the downstream pipeline.

python — Lambda handler: EventBridge → Airflow DAG trigger

import boto3, json, requests
from base64 import b64decode

def handler(event, context):
    # Extract file details from EventBridge S3 event
    detail = event["detail"]
    s3_key = detail["object"]["key"]

    # Fetch Airflow credentials from Secrets Manager
    sm     = boto3.client("secretsmanager")
    secret = json.loads(sm.get_secret_value(SecretId="prod/airflow/api-creds")["SecretString"])

    # Trigger the Airflow DAG via REST API
    airflow_url = secret["webserver_url"]
    dag_id      = "salesforce_feed_processor"

    resp = requests.post(
        f"{airflow_url}/api/v1/dags/{dag_id}/dagRuns",
        auth=(secret["username"], secret["password"]),
        json={"conf": {"s3_key": s3_key, "trigger": "s3_event"}},
        timeout=10
    )
    resp.raise_for_status()
    return {"dag_run_id": resp.json()["dag_run_id"], "s3_key": s3_key}

Glue Job State Change → SNS Alert

AWS Glue automatically emits state-change events to EventBridge when a job transitions to SUCCEEDED, FAILED, TIMEOUT, or STOPPED. You can create a rule that routes FAILED events directly to SNS — giving you instant alerting without any polling code.

python — Glue FAILED → SNS alert rule

import boto3, json

eb  = boto3.client("events")
sns = boto3.client("sns")

SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789:de-pipeline-failures"
EB_ROLE_ARN   = "arn:aws:iam::123456789:role/EventBridgeSNSPublishRole"

# Rule: any Glue job in FAILED or TIMEOUT state
eb.put_rule(
    Name=         "glue-job-failure-alert",
    EventPattern= json.dumps({
        "source":      ["aws.glue"],
        "detail-type": ["Glue Job State Change"],
        "detail":      {"state": ["FAILED", "TIMEOUT"]}
    }),
    State=        "ENABLED"
)

# Target: SNS topic — EventBridge sends the full event JSON as message
eb.put_targets(
    Rule=    "glue-job-failure-alert",
    Targets= [{
        "Id":      "sns-failure-alert",
        "Arn":     SNS_TOPIC_ARN,
        "RoleArn": EB_ROLE_ARN   # EB needs a role to publish to SNS
    }]
)

# Allow EventBridge to publish to this SNS topic
sns.set_topic_attributes(
    TopicArn=       SNS_TOPIC_ARN,
    AttributeName=  "Policy",
    AttributeValue= json.dumps({
        "Statement": [{
            "Effect":    "Allow",
            "Principal": {"Service": "events.amazonaws.com"},
            "Action":    "SNS:Publish",
            "Resource":  SNS_TOPIC_ARN
        }]
    })
)

📤

Custom Events — put_events() from Pipeline Code CUSTOM EVENTS ▼

Publishing Custom Pipeline Events

put_events() lets your pipeline code publish its own events to EventBridge — enabling downstream pipelines to react automatically. For example, when the Bronze orders ETL completes, it publishes a BronzeLayerReady event, which triggers the Silver transformation pipeline. This creates a fully event-driven, decoupled pipeline chain without any polling or hardcoded dependencies.

python — publishing a custom pipeline completion event

import boto3, json
from datetime import datetime, timezone

eb = boto3.client("events")

def publish_pipeline_event(pipeline_id: str, status: str,
                             rows_written: int, output_path: str):
    """Publish a pipeline completion event to EventBridge."""
    resp = eb.put_events(
        Entries=[{
            "EventBusName": "data-platform-events",   # custom bus
            "Source":       "com.mycompany.data-platform",
            "DetailType":   "PipelineStateChange",
            "Detail": json.dumps({
                "pipeline_id":  pipeline_id,
                "status":       status,
                "rows_written": rows_written,
                "output_path":  output_path,
                "timestamp":    datetime.now(timezone.utc).isoformat(),
                "layer":        "bronze"
            })
        }]
    )

    failed = resp["FailedEntryCount"]
    if failed > 0:
        raise RuntimeError(f"EventBridge put_events failed: {resp['Entries']}")
    return resp["Entries"][0]["EventId"]


# At end of Bronze ETL pipeline — publish event to trigger Silver
event_id = publish_pipeline_event(
    pipeline_id=  "rds-orders-bronze",
    status=       "SUCCEEDED",
    rows_written= 142830,
    output_path=  "s3://company-bronze/rds/sales/orders/year=2024/month=01/day=15/"
)
print(f"✅ Event published: {event_id}")

Event-Driven Pipeline Chain Pattern

By combining put_events() with EventBridge rules, you can chain pipeline stages so each stage automatically triggers the next on success — with no orchestrator polling between them. This is the foundation of a reactive data platform.

EVENT-DRIVEN PIPELINE CHAIN ══════════════════════════════════════════════════════ Stage 1 — Bronze ETL (scheduled nightly by EventBridge cron) → reads RDS via JDBC → writes Parquet to S3 Bronze → put_events("PipelineStateChange", pipeline_id="rds-orders-bronze", status="SUCCEEDED") EventBridge Rule 1: match {pipeline_id: "rds-orders-bronze", status: "SUCCEEDED"} → Target: Lambda trigger-silver-etl → starts Glue job "orders-silver-transform" Stage 2 — Silver ETL (Glue job) → reads Bronze Parquet → cleans/deduplicates → writes Delta to Silver → put_events("PipelineStateChange", pipeline_id="orders-silver", status="SUCCEEDED") EventBridge Rule 2: match {pipeline_id: "orders-silver", status: "SUCCEEDED"} → Target: Lambda trigger-gold-etl → starts Glue job "orders-gold-aggregation" Stage 3 — Gold ETL (Glue job) → reads Silver Delta → aggregates → writes Gold tables → put_events("PipelineStateChange", pipeline_id="orders-gold", status="SUCCEEDED") EventBridge Rule 3: match {pipeline_id: "orders-gold", status: "SUCCEEDED"} → Target: SNS → email "Gold layer ready for BI" → Target: Lambda → invalidate Redshift Spectrum cache Any stage FAILED: → EventBridge Rule (status=FAILED) → SNS → PagerDuty alert

put_events() Limits & Batching

put_events() accepts up to 10 events per call, with each event up to 256 KB. If you need to publish more than 10 events at once (e.g. one event per table in a metadata-driven pipeline with 50 tables), batch them into groups of 10 and check FailedEntryCount on every response.

python — batching put_events for many pipeline events

import boto3, json
from datetime import datetime, timezone

eb = boto3.client("events")

def publish_events_batch(events: list):
    """Publish a list of events, batching into groups of 10."""
    failures = []
    for i in range(0, len(events), 10):   # chunk into 10s
        batch = events[i:i+10]
        resp  = eb.put_events(Entries=batch)
        if resp["FailedEntryCount"] > 0:
            failures.extend([
                e for e in resp["Entries"] if "ErrorCode" in e
            ])
    if failures:
        raise RuntimeError(f"Failed to publish {len(failures)} events: {failures}")

# Build one event per completed pipeline table
pipeline_results = [
    {"table": "orders",   "rows": 142830},
    {"table": "customers", "rows": 85200},
    # ... up to 50 tables
]

entries = [{
    "EventBusName": "data-platform-events",
    "Source":       "com.mycompany.data-platform",
    "DetailType":   "TableLoadComplete",
    "Detail":       json.dumps({
        "table_name": r["table"],
        "rows":       r["rows"],
        "timestamp":  datetime.now(timezone.utc).isoformat()
    })
} for r in pipeline_results]

publish_events_batch(entries)
print(f"✅ Published {len(entries)} table completion events")

🔍

Event Patterns, Filtering & Management APIs DEEP DIVE ▼

Event Pattern Matching — All Filter Types

EventBridge event patterns support rich filtering beyond simple equality — prefix matching, suffix matching, numeric ranges, existence checks, and anything-but (negation). This lets you route events with surgical precision.

json — event pattern filter types

{
  "source": ["com.mycompany.data-platform"],
  "detail": {

    // Exact match
    "status": ["SUCCEEDED"],

    // Multiple values — OR logic
    "layer": ["bronze", "silver"],

    // Prefix match — key starts with "orders"
    "table_name": [{ "prefix": "orders" }],

    // Anything-but — NOT these values
    "environment": [{ "anything-but": ["dev", "test"] }],

    // Exists — only match if field is present
    "output_path": [{ "exists": true }],

    // Numeric range — rows_written between 1000 and 10_000_000
    "rows_written": [{ "numeric": [">", 1000, "<=", 10000000] }]
  }
}

Management APIs — list, delete, disable rules

The full lifecycle of EventBridge rules — listing active rules, temporarily disabling them during maintenance windows, removing stale rules, and listing their targets — is managed via boto3.

python — EventBridge rule lifecycle management

import boto3

eb = boto3.client("events")

# List all rules (paginate for > 100 rules)
paginator = eb.get_paginator("list_rules")
for page in paginator.paginate(EventBusName="data-platform-events"):
    for rule in page["Rules"]:
        print(f"{rule['Name']:40s} {rule['State']:10s} {rule.get('ScheduleExpression','')}")

# Disable a rule during a maintenance window
eb.disable_rule(Name="nightly-sales-etl-trigger", EventBusName="default")

# Re-enable after maintenance
eb.enable_rule(Name="nightly-sales-etl-trigger", EventBusName="default")

# List targets for a rule
targets = eb.list_targets_by_rule(Rule="nightly-sales-etl-trigger")
for t in targets["Targets"]:
    print(f"  Target: {t['Id']} → {t['Arn']}")

# Remove a target (required before deleting the rule)
eb.remove_targets(
    Rule=    "nightly-sales-etl-trigger",
    Ids=     ["sales-etl-lambda-target"]
)

# Delete the rule (only after removing all targets)
eb.delete_rule(Name="nightly-sales-etl-trigger")

⚠️ Delete Order Matters

You must remove all targets first before calling delete_rule(). Calling delete_rule() on a rule that still has targets will raise a ValidationException. Always: remove_targets() → then delete_rule().

EventBridge vs CloudWatch Events

You may see references to CloudWatch Events in older AWS documentation and Terraform resources. EventBridge is CloudWatch Events — it was renamed and expanded in 2019. The same API, same underlying service. All new development should use the EventBridge name and console; CloudWatch Events still works but points to the same service.

Feature	EventBridge (current)	CloudWatch Events (old name)
AWS service events	✅	✅
Custom event buses	✅	❌
SaaS partner events	✅	❌
Schema registry	✅	❌
Same API calls?	Yes — identical boto3 client("events")

29.15 — AMAZON SQS

Amazon SQS — Simple Queue Service

SQS is AWS's fully managed message queue service. For Data Engineers, it acts as a buffer and decoupler between pipeline stages — absorbing bursts of messages so your downstream processors (Lambda, Glue, Spark) don't get overwhelmed. It's the backbone of event-driven architectures: file arrivals land in SQS, pipelines consume from SQS, failures get routed to a Dead Letter Queue (DLQ). Understanding SQS deeply — visibility timeouts, polling strategies, DLQ design, and the consume-process-delete pattern — is essential for reliable production pipelines.

📥

Queue Types — Standard vs FIFO CORE CONCEPT ▼

Standard Queue — At-Least-Once, Best-Effort Ordering

A Standard Queue gives you nearly unlimited throughput — it can handle thousands of messages per second. The trade-off is that it offers at-least-once delivery (a message might be delivered more than once) and best-effort ordering (messages may arrive out of order). For most data pipeline use cases — triggering Glue jobs, notifying Lambda of file arrivals, buffering pipeline events — Standard Queue is the right choice. Your consumer code simply needs to be idempotent (processing the same message twice produces the same result).

📬 Analogy

A Standard Queue is like a busy post office that guarantees every letter is delivered, but occasionally makes a duplicate copy of a letter "just to be safe." Letters may also arrive slightly out of order. Your job as the recipient (consumer) is to handle receiving the same letter twice without doing the same action twice.

🚀

Throughput

Virtually unlimited — thousands of messages/second with no configuration needed.

📦

Delivery

At-least-once — same message may be delivered more than once. Build idempotent consumers.

🔢

Ordering

Best-effort — messages usually arrive in order, but no guarantee. Do not rely on order.

💲

Cost

$0.40 per million requests. Free tier: 1 million requests/month. Extremely cheap.

FIFO Queue — Exactly-Once, Strict Ordering

A FIFO Queue (First-In-First-Out) guarantees exactly-once processing and strict message ordering within a message group. It prevents duplicates through a deduplication ID — if you send two messages with the same deduplication ID within a 5-minute window, the second is discarded. FIFO queues have a throughput limit of 3,000 messages/second with batching (300 without). Use FIFO when order matters — for example, processing database CDC events where an INSERT must be processed before the UPDATE of the same row.

💡 When to Use FIFO

Use FIFO when: (1) order matters — CDC events, bank transactions, state machine transitions; (2) exactly-once matters — no tolerance for duplicate processing even with idempotent consumers. For everything else in DE pipelines, Standard Queue is preferred due to higher throughput.

Feature	Standard Queue	FIFO Queue
Delivery guarantee	At-least-once	Exactly-once
Ordering	Best-effort	Strict (per group)
Throughput	Unlimited	3,000 msg/sec (batched)
Deduplication	No	Yes (5-min window)
Queue name suffix	Any name	Must end in `.fifo`
Use case in DE	File triggers, pipeline events, alerts	CDC events, ordered state transitions

Dead Letter Queue (DLQ) — Handling Poison Messages

A Dead Letter Queue is just another SQS queue that receives messages that have been delivered too many times without being successfully processed. You configure a maxReceiveCount on your main queue — if a message is received more than that many times without being deleted, SQS automatically moves it to the DLQ. This protects your pipeline from a "poison message" (a malformed record that keeps crashing your consumer) from blocking all other processing.

🏥 Analogy

The DLQ is like a hospital's triage room for "incurable" patients. If a patient (message) keeps coming back to the ER (consumer) without getting better (being successfully processed) 3 or 5 times, they're moved to a special ward (DLQ) for deeper investigation by a specialist (your ops team or a separate repair pipeline).

python — creating a main queue with DLQ using boto3

import boto3, json

sqs = boto3.client("sqs", region_name="us-east-1")

# ① Create the Dead Letter Queue first
dlq_resp = sqs.create_queue(
    QueueName="pipeline-events-dlq",
    Attributes={
        "MessageRetentionPeriod": "1209600"   # 14 days — max retention for DLQ
    }
)
dlq_url = dlq_resp["QueueUrl"]

# Get the DLQ ARN (needed for redrive policy)
dlq_arn = sqs.get_queue_attributes(
    QueueUrl=dlq_url,
    AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]

# ② Create the main queue with a redrive policy pointing to the DLQ
main_resp = sqs.create_queue(
    QueueName="pipeline-events",
    Attributes={
        "VisibilityTimeout":    "300",         # 5 min — must be > job runtime
        "MessageRetentionPeriod":"86400",        # 1 day
        "ReceiveMessageWaitTimeSeconds": "20", # long polling (cost-saving)
        # After 3 failed attempts, move message to DLQ
        "RedrivePolicy": json.dumps({
            "deadLetterTargetArn": dlq_arn,
            "maxReceiveCount":     3
        })
    }
)
main_queue_url = main_resp["QueueUrl"]
print(f"Main queue: {main_queue_url}")
print(f"DLQ:        {dlq_url}")

⚙️

Key Concepts — Visibility Timeout, Retention, Long Polling MUST KNOW ▼

Visibility Timeout — The Most Important SQS Setting

When a consumer receives a message, SQS makes that message invisible to all other consumers for the duration of the visibility timeout. The consumer has until the timeout expires to finish processing and delete the message. If it fails to delete within the timeout window, SQS makes the message visible again so another consumer (or the same one on retry) can pick it up. This is the core mechanism behind SQS's at-least-once delivery guarantee.

⚠️ Most Common Mistake

Setting the visibility timeout shorter than your job runtime. If your Glue job takes 10 minutes but the visibility timeout is 5 minutes, SQS re-delivers the message while the first job is still running — resulting in two parallel jobs processing the same file. Always set visibility timeout to at least 1.5× your expected processing time.

python — extending visibility timeout for long-running jobs

import boto3, threading

sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"

# Receive a message (long polling)
resp = sqs.receive_message(
    QueueUrl=QUEUE_URL,
    MaxNumberOfMessages=1,
    WaitTimeSeconds=20,          # long poll — wait up to 20s for a message
    VisibilityTimeout=300         # initially invisible for 5 minutes
)

messages = resp.get("Messages", [])
if not messages:
    print("No messages available")
else:
    msg    = messages[0]
    handle = msg["ReceiptHandle"]  # needed to delete or extend visibility
    body   = msg["Body"]

    # ── Heartbeat: extend visibility timeout every 4 minutes ───────────
    # Useful for jobs that might run longer than the initial timeout
    def extend_visibility(queue_url, receipt_handle, stop_event):
        while not stop_event.is_set():
            stop_event.wait(240)   # wait 4 minutes
            if not stop_event.is_set():
                sqs.change_message_visibility(
                    QueueUrl=      queue_url,
                    ReceiptHandle= receipt_handle,
                    VisibilityTimeout= 300    # reset to 5 more minutes
                )
                print("Visibility timeout extended by 5 minutes")

    stop_event = threading.Event()
    heartbeat  = threading.Thread(
        target=extend_visibility,
        args=(QUEUE_URL, handle, stop_event),
        daemon=True
    )
    heartbeat.start()

    try:
        # ── Process the message (e.g. start a Glue job) ────────────────
        print(f"Processing: {body}")
        # ... do actual work here ...

        # ── On success: DELETE the message from the queue ──────────────
        sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=handle)
        print("✅ Message processed and deleted")
    except Exception as e:
        print(f"❌ Processing failed: {e} — message will become visible again")
        # Do NOT delete — let SQS re-deliver (up to maxReceiveCount times)
    finally:
        stop_event.set()   # stop the heartbeat thread

Message Retention Period

SQS stores messages for a configurable retention period — from 1 minute to 14 days (default is 4 days). If a message is not consumed and deleted within the retention period, SQS permanently deletes it. For your DLQ, set the maximum 14-day retention so your ops team has ample time to investigate failures and replay messages. For the main queue, 1–4 days is usually enough.

📅

Main Queue

1–4 days retention. Messages processed quickly. Don't need long retention.

🗄️

DLQ Retention

Always set to 14 days. Gives team time to investigate, fix, and replay failed messages.

📏

Message Size

Max 256 KB per message. For large payloads, store data in S3 and send only the S3 key in SQS.

Long Polling vs Short Polling

Short polling (default) immediately returns — even if the queue is empty — and you are charged for the API call. Long polling (WaitTimeSeconds=20) waits up to 20 seconds for a message to arrive before returning an empty response. Long polling dramatically reduces costs (fewer empty API calls) and reduces latency for message consumers. Always use long polling in production — set WaitTimeSeconds to 20 at the queue level or on each receive_message() call.

python — setting long polling at the queue level (recommended)

import boto3

sqs = boto3.client("sqs")

# Enable long polling on an existing queue
sqs.set_queue_attributes(
    QueueUrl="https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events",
    Attributes={
        "ReceiveMessageWaitTimeSeconds": "20"   # 20 seconds = maximum long poll
    }
)

# Now every receive_message() call on this queue will automatically long-poll
# (up to 20 seconds) without needing to set WaitTimeSeconds each time

# Check current queue settings
attrs = sqs.get_queue_attributes(
    QueueUrl="https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events",
    AttributeNames=["All"]
)["Attributes"]

print(f"VisibilityTimeout:         {attrs['VisibilityTimeout']}s")
print(f"MessageRetentionPeriod:    {attrs['MessageRetentionPeriod']}s")
print(f"WaitTimeSeconds (polling): {attrs['ReceiveMessageWaitTimeSeconds']}s")
print(f"ApproxMessages in queue:   {attrs['ApproximateNumberOfMessages']}")
print(f"ApproxMessages in-flight:  {attrs['ApproximateNumberOfMessagesNotVisible']}")

Message Attributes — Metadata on Messages

Message attributes are key-value metadata attached to a message — separate from the message body. They let consumers route or filter messages without parsing the body. For example, you can attach source_table=orders and environment=prod as attributes, and consumers can check these before deciding whether to process the message. SQS supports up to 10 message attributes per message, with String, Number, and Binary types.

python — sending a message with attributes

import boto3, json

sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"

sqs.send_message(
    QueueUrl=    QUEUE_URL,
    MessageBody= json.dumps({
        "s3_bucket": "company-raw",
        "s3_key":    "uploads/orders/2024-01-15/orders.csv",
        "file_size": 15728640     # 15 MB
    }),
    MessageAttributes={
        "source_table": {
            "DataType":    "String",
            "StringValue": "orders"
        },
        "environment": {
            "DataType":    "String",
            "StringValue": "prod"
        },
        "file_count": {
            "DataType":    "Number",
            "StringValue": "1"
        }
    }
)

# Receive with attributes
resp = sqs.receive_message(
    QueueUrl=              QUEUE_URL,
    MaxNumberOfMessages=   1,
    WaitTimeSeconds=       20,
    MessageAttributeNames= ["All"]   # request all attributes
)

for msg in resp.get("Messages", []):
    env = msg.get("MessageAttributes", {}).get("environment", {}).get("StringValue")
    tbl = msg.get("MessageAttributes", {}).get("source_table", {}).get("StringValue")
    print(f"env={env} table={tbl} body={msg['Body']}")

🔄

Consume-Process-Delete Pattern — The Core SQS Loop MOST IMPORTANT ▼

The Pattern Explained

The core SQS consumer pattern is always: receive → process → delete on success only. Never delete before processing. Never skip the delete on success. This three-step pattern, combined with the visibility timeout, is what gives SQS its reliability guarantee — if your process crashes between receive and delete, SQS automatically re-delivers the message after the timeout expires.

CONSUME-PROCESS-DELETE PATTERN ══════════════════════════════════════════ ┌─────────────────────────────────────┐ │ SQS MAIN QUEUE │ │ [msg1] [msg2] [msg3] [msg4] ... │ └──────────────┬──────────────────────┘ │ receive_message() — up to 10 msgs ▼ ┌─────────────────────────────────────┐ │ CONSUMER (Lambda / Glue / EC2) │ │ │ │ msg is INVISIBLE during processing │ │ (visibility timeout ticking...) │ └──────────┬──────────────┬───────────┘ │ SUCCESS │ FAILURE ▼ ▼ delete_message() visibility timeout expires (msg gone forever) msg becomes VISIBLE again │ ▼ (after maxReceiveCount attempts) ┌───────────────────┐ │ DEAD LETTER │ │ QUEUE │ │ (investigate & │ │ replay/fix) │ └───────────────────┘

Production Consumer Loop — Full Code

This is the production-grade pattern for a Python process (Lambda, Glue Python shell, EC2 worker) that continuously polls SQS and triggers a Glue job for each message:

python — production SQS consumer that triggers Glue jobs

import boto3, json, logging, time
from botocore.exceptions import ClientError

logger  = logging.getLogger(__name__)
sqs     = boto3.client("sqs")
glue    = boto3.client("glue")

QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"

def process_message(body: dict) -> None:
    """Trigger a Glue job for the S3 file described in the SQS message."""
    s3_key = body["s3_key"]
    table  = body["source_table"]

    logger.info(f"Starting Glue job for {table} file: {s3_key}")
    glue.start_job_run(
        JobName=  f"ingest-{table}",
        Arguments={
            "--s3_key":     s3_key,
            "--table_name": table
        }
    )

def consumer_loop(max_iterations: int = None):
    """Main polling loop — runs indefinitely in a worker process."""
    iteration = 0

    while True:
        if max_iterations and iteration >= max_iterations:
            break

        try:
            # ① RECEIVE — long poll, up to 10 messages at once
            resp = sqs.receive_message(
                QueueUrl=             QUEUE_URL,
                MaxNumberOfMessages=  10,   # batch up to 10
                WaitTimeSeconds=      20,   # long poll — ALWAYS use this
                VisibilityTimeout=    600,  # 10 min — longer than Glue job
                MessageAttributeNames=["All"]
            )

            messages = resp.get("Messages", [])
            if not messages:
                logger.debug("Queue empty — polling again")
                iteration += 1
                continue

            for msg in messages:
                receipt_handle = msg["ReceiptHandle"]
                try:
                    # ② PROCESS
                    body = json.loads(msg["Body"])
                    process_message(body)

                    # ③ DELETE — only on success
                    sqs.delete_message(
                        QueueUrl=      QUEUE_URL,
                        ReceiptHandle= receipt_handle
                    )
                    logger.info(f"✅ Message processed and deleted: {body.get('s3_key')}")

                except ClientError as e:
                    code = e.response["Error"]["Code"]
                    logger.error(f"❌ AWS error processing message: {code} — will retry")
                    # Do NOT delete — visibility timeout expires → SQS re-delivers

                except Exception as e:
                    logger.error(f"❌ Unexpected error: {e} — will retry (maxReceiveCount times)")
                    # Do NOT delete — let SQS handle retry → eventually goes to DLQ

        except ClientError as e:
            logger.error(f"SQS receive failed: {e} — sleeping 30s")
            time.sleep(30)

        iteration += 1

if __name__ == "__main__":
    consumer_loop()

Batch Delete — delete_message_batch()

When processing messages in batches of 10, use delete_message_batch() to delete all successfully processed messages in a single API call instead of 10 separate calls. This reduces API costs and latency. Always check Failed in the response — partial batch failures are possible.

python — batch delete after processing 10 messages

import boto3, json

sqs       = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"

# Receive up to 10 messages
resp     = sqs.receive_message(QueueUrl=QUEUE_URL, MaxNumberOfMessages=10, WaitTimeSeconds=20)
messages = resp.get("Messages", [])

successful_handles = []  # collect receipt handles of successfully processed messages

for i, msg in enumerate(messages):
    try:
        body = json.loads(msg["Body"])
        # process body...
        successful_handles.append({
            "Id":            str(i),            # unique ID within this batch
            "ReceiptHandle": msg["ReceiptHandle"]
        })
    except Exception as e:
        print(f"Failed to process message {i}: {e} — leaving in queue for retry")

# Batch delete all successfully processed messages in ONE API call
if successful_handles:
    delete_resp = sqs.delete_message_batch(
        QueueUrl= QUEUE_URL,
        Entries=  successful_handles
    )
    if delete_resp.get("Failed"):
        print(f"⚠️ Some deletes failed: {delete_resp['Failed']}")
    else:
        print(f"✅ Batch deleted {len(successful_handles)} messages")

🏗️

Data Engineering Patterns with SQS PRODUCTION PATTERNS ▼

Pattern 1 — S3 File Arrival → SQS → Lambda → Glue

The most common data engineering SQS pattern: when a file lands in S3, an S3 event notification sends a message to SQS. A Lambda function polls the queue, validates the file, and starts a Glue ETL job. The queue acts as a buffer — if Lambda is throttled or Glue is at its concurrent job limit, messages wait safely in SQS instead of being lost.

FILE ARRIVAL PIPELINE PATTERN ══════════════════════════════════════════ Partner uploads file │ ▼ S3 bucket (raw zone) s3://company-raw/orders/2024-01-15/orders.csv │ │ S3 Event Notification (on PUT) ▼ SQS queue: "file-arrival-queue" [{"bucket":"company-raw","key":"orders/2024-01-15/orders.csv"}] │ │ Lambda polls every ~20s (long polling) ▼ Lambda "file-trigger" ① head_object() — verify file exists & check size ② start_job_run("orders-ingest", {--s3_key: ...}) ③ delete_message() — only after Glue job started │ ▼ Glue ETL job "orders-ingest" → reads CSV → transforms → writes Parquet to Silver → updates audit table in DynamoDB │ ▼ SNS notification → email/Slack on success or failure

python — S3 event notification configuration to send to SQS

import boto3, json

s3  = boto3.client("s3")
sqs = boto3.client("sqs")

BUCKET    = "company-raw"
QUEUE_ARN = "arn:aws:sqs:us-east-1:123456789012:file-arrival-queue"

# ① Attach a resource policy to SQS allowing S3 to send messages to it
sqs.set_queue_attributes(
    QueueUrl=  "https://sqs.us-east-1.amazonaws.com/123456789012/file-arrival-queue",
    Attributes={
        "Policy": json.dumps({
            "Version": "2012-10-17",
            "Statement": [{
                "Sid":       "AllowS3SendMessage",
                "Effect":    "Allow",
                "Principal": {"Service": "s3.amazonaws.com"},
                "Action":    "SQS:SendMessage",
                "Resource":  QUEUE_ARN,
                "Condition": {"ArnLike": {"aws:SourceArn": f"arn:aws:s3:::{BUCKET}"}}
            }]
        })
    }
)

# ② Configure S3 to send notifications to SQS on object creation
s3.put_bucket_notification_configuration(
    Bucket= BUCKET,
    NotificationConfiguration={
        "QueueConfigurations": [{
            "QueueArn": QUEUE_ARN,
            "Events":   ["s3:ObjectCreated:*"],
            "Filter": {
                "Key": {"FilterRules": [
                    {"Name": "prefix", "Value": "orders/"},   # only orders/ prefix
                    {"Name": "suffix", "Value": ".csv"}          # only .csv files
                ]}
            }
        }]
    }
)
print("S3 → SQS notification configured")

Pattern 2 — Decoupling Pipeline Stages

SQS decouples fast producers from slow consumers. If your Bronze ETL processes 100 tables per hour but the Silver transformation can only handle 20 tables per hour, an SQS queue between them absorbs the difference. Bronze finishes and enqueues all 100 completion events; Silver processes them at its own pace over 5 hours without dropping any. This is the classic producer-consumer decoupling pattern.

DECOUPLED BRONZE → SILVER PIPELINE ══════════════════════════════════════════ PRODUCER (fast) Bronze ETL — processes 100 tables in 1 hour → For each table: sqs.send_message({table: "orders", path: "s3://..."}) → Enqueues 100 messages quickly, then exits ↓↓↓↓↓↓↓↓↓↓ (messages safely buffered in SQS) ↓↓↓↓↓↓↓↓↓↓ SQS queue: "bronze-ready-queue" [orders] [customers] [products] [... 97 more ...] CONSUMER (slow) Silver transformer — processes 20 tables/hour → polls queue, processes 20/hr over 5 hours → No data loss — all 100 tables eventually processed → delete_message() after each successful Silver write

Pattern 3 — DLQ Monitoring and Replay

A DLQ is only useful if you monitor it and act on messages in it. Set up a CloudWatch alarm on the ApproximateNumberOfMessagesVisible metric of your DLQ — any message in the DLQ means a pipeline step failed. After fixing the root cause, use the Start DLQ Redrive feature to move messages back to the main queue for reprocessing.

python — monitor DLQ depth and alarm on failures

import boto3

sqs = boto3.client("sqs")
cw  = boto3.client("cloudwatch")

DLQ_URL  = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events-dlq"
SNS_ARN  = "arn:aws:sns:us-east-1:123456789012:data-eng-alerts"

# Check DLQ depth right now
attrs = sqs.get_queue_attributes(
    QueueUrl=       DLQ_URL,
    AttributeNames= ["ApproximateNumberOfMessages",
                     "ApproximateNumberOfMessagesNotVisible"]
)["Attributes"]

dlq_depth     = int(attrs["ApproximateNumberOfMessages"])
in_flight     = int(attrs["ApproximateNumberOfMessagesNotVisible"])
print(f"DLQ depth: {dlq_depth} visible, {in_flight} in-flight")

# Create CloudWatch alarm: alert if DLQ has ANY messages
cw.put_metric_alarm(
    AlarmName=          "pipeline-events-DLQ-not-empty",
    AlarmDescription=   "Messages in DLQ — pipeline processing failures detected",
    MetricName=         "ApproximateNumberOfMessagesVisible",
    Namespace=          "AWS/SQS",
    Dimensions=[{"Name": "QueueName", "Value": "pipeline-events-dlq"}],
    Statistic=          "Sum",
    Period=             60,           # 1-minute evaluation
    EvaluationPeriods=  1,
    Threshold=          0,            # alarm if ANY message appears
    ComparisonOperator= "GreaterThanThreshold",
    AlarmActions=       [SNS_ARN],
    TreatMissingData=   "notBreaching"
)
print("DLQ alarm created — will alert on first failure")

send_message_batch() — Bulk Sending

When you need to enqueue many messages at once (e.g. one message per table at the start of a metadata-driven pipeline run), use send_message_batch() to send up to 10 messages per API call instead of 10 separate calls. Always check Failed in the response — retry any failed entries.

python — batch send: enqueue one message per table

import boto3, json

sqs       = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"

# Build one message per table (simulating 25 tables)
tables = ["orders", "customers", "products", "returns", "inventory",
          # ... more tables]

def send_in_batches(queue_url: str, table_list: list):
    """Send one SQS message per table, batched in groups of 10."""
    failed_tables = []
    for i in range(0, len(table_list), 10):
        batch = table_list[i:i+10]
        entries = [
            {
                "Id":          str(j),
                "MessageBody": json.dumps({
                    "table_name": t,
                    "run_date":   "2024-01-15",
                    "pipeline":   "daily-incremental"
                })
            }
            for j, t in enumerate(batch)
        ]
        resp = sqs.send_message_batch(QueueUrl=queue_url, Entries=entries)
        if resp.get("Failed"):
            failed = [entries[int(f["Id"])]["MessageBody"] for f in resp["Failed"]]
            failed_tables.extend(failed)
            print(f"⚠️ {len(resp['Failed'])} sends failed in this batch")
    print(f"✅ Enqueued {len(table_list) - len(failed_tables)} tables")
    return failed_tables

failed = send_in_batches(QUEUE_URL, tables)
if failed:
    print(f"Retry these: {failed}")

📋

SQS Quick Reference — Key Settings & Limits REFERENCE ▼

Production Configuration Checklist

Setting	Recommended Value	Why
VisibilityTimeout	1.5× job runtime	Prevents re-delivery while processing is still running
WaitTimeSeconds	20 (long polling)	Reduces empty polls → lower cost, lower latency
MessageRetentionPeriod (main)	86400–345600 (1–4 days)	Enough buffer for delayed consumers
MessageRetentionPeriod (DLQ)	1209600 (14 days)	Maximum time to investigate failures
maxReceiveCount (redrive)	3–5	Retry transient failures; avoid infinite loops
MaxNumberOfMessages	10	Always batch receive for cost efficiency

SQS Limits Every DE Should Know

📏

Max Message Size

256 KB. For larger payloads: store in S3, send the S3 key in SQS. Use S3 Extended Client for this pattern.

📦

Batch Size

Max 10 messages per receive/send/delete batch call. Always use batch to reduce API costs.

⏱️

Max Visibility Timeout

12 hours. Use change_message_visibility() to extend for very long-running jobs.

💾

In-Flight Messages

Standard: 120,000 max in-flight. FIFO: 20,000. In-flight = received but not yet deleted.

Complete SQS API Summary for Data Engineers

create_queue() get_queue_url() delete_queue() send_message() send_message_batch() receive_message() delete_message() delete_message_batch() change_message_visibility() get_queue_attributes() set_queue_attributes() list_queues()

☁️ The Golden Rule of SQS

Always: receive → process → delete on success only. Never delete before confirming success. Never skip the delete on success. This one rule is what makes SQS reliable for production pipelines. Everything else (visibility timeout, DLQ, long polling) is configuration around this core pattern.

29.16

Amazon SNS — Simple Notification Service

SNS is AWS's managed pub/sub messaging service. You publish one message to a topic and SNS fans it out to every subscriber simultaneously — email inboxes, SQS queues, Lambda functions, HTTPS endpoints, SMS, and more. For Data Engineers SNS is the standard way to alert on pipeline failures, trigger multiple downstream consumers from a single event, and integrate CloudWatch alarms with on-call tooling.

📣

Topics, Subscriptions, and the Fan-Out Model CORE CONCEPT ▼

What is a Topic?

An SNS topic is a named communication channel. Producers publish messages to a topic; SNS immediately delivers copies to every active subscriber. Topics are regional and identified by an ARN: arn:aws:sns:us-east-1:123456789012:pipeline-alerts.

📡 Analogy

Think of SNS as a radio broadcast tower. You speak once into the microphone (publish one message) and every radio tuned to that frequency (subscriber) hears it simultaneously. You don't need to know who's listening or how many listeners there are — the tower handles delivery.

📋

Standard Topic

At-least-once delivery. Best-effort ordering. Supports all subscription types. Unlimited throughput. Use for most pipeline alerts.

🔢

FIFO Topic

Exactly-once delivery, strict ordering. Only supports SQS FIFO queues as subscribers. Use when order matters — e.g. CDC event ordering.

Subscription Types

A single topic can have many subscriptions of different types simultaneously. When you publish one message, SNS delivers to all of them in parallel — this is the fan-out pattern.

Protocol	Endpoint	Data Engineering Use Case
email	email address	Alert on-call engineer on pipeline failure
sqs	SQS queue ARN	Fan-out to multiple processing queues
lambda	Lambda function ARN	Trigger automated remediation on alert
https	Webhook URL	Post alert to Slack / PagerDuty / Teams
sms	Phone number	Critical on-call SMS for SLA breaches

✅ Real-World Fan-Out

A single SNS topic prod-pipeline-failures has three subscriptions: (1) SQS queue consumed by a retry Lambda, (2) email to the data engineering team, (3) HTTPS endpoint to PagerDuty. One publish triggers all three simultaneously.

Architecture Diagram — Fan-Out

SNS FAN-OUT ARCHITECTURE ┌──────────────────────────────────────────────────────────────┐ │ PRODUCER │ │ Glue job fails → boto3 sns.publish(TopicArn=..., ...) │ └──────────────────────┬───────────────────────────────────────┘ │ one publish() ▼ ┌─────────────────────┐ │ SNS TOPIC │ │ prod-alerts │ │ (Standard) │ └──────┬──────────────┘ ┌───────────┼────────────────┐ ▼ ▼ ▼ ┌────────────┐ ┌──────────┐ ┌──────────────┐ │ SQS Queue │ │ Email │ │ HTTPS │ │ retry-dlq │ │ team@co │ │ PagerDuty │ │ │ │ │ │ webhook │ └─────┬──────┘ └──────────┘ └──────────────┘ │ ▼ Lambda retries the failed job

🔀

SNS → SQS Fan-Out Pattern KEY PATTERN ▼

Why SNS + SQS Together?

SNS alone delivers in real-time — if a subscriber is down it loses the message. By subscribing SQS queues to SNS topics you get the best of both worlds: SNS handles the fan-out, SQS provides durable buffering so downstream consumers can process at their own pace and survive outages.

📌 The Golden Rule

SNS = delivery fan-out. SQS = durable buffer. Always pair them for production pipeline event handling. Never consume SNS directly in a Lambda that does heavy processing — put an SQS queue in between so failures can be retried from the queue.

python — create topic, subscribe SQS queue, fan-out setup

import boto3, json

sns = boto3.client("sns", region_name="us-east-1")
sqs = boto3.client("sqs", region_name="us-east-1")

# ── 1. Create the SNS topic ───────────────────────────────────────
topic_resp = sns.create_topic(
    Name="prod-pipeline-alerts",
    Attributes={
        "DisplayName": "Prod Pipeline Alerts"
    },
    Tags=[{"Key": "Env", "Value": "prod"}]
)
topic_arn = topic_resp["TopicArn"]
print(f"Topic ARN: {topic_arn}")

# ── 2. Subscribe an SQS queue to the topic ───────────────────────
queue_url = sqs.get_queue_url(QueueName="pipeline-retry-queue")["QueueUrl"]
queue_arn  = sqs.get_queue_attributes(
    QueueUrl=queue_url,
    AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]

sub_resp = sns.subscribe(
    TopicArn=topic_arn,
    Protocol="sqs",
    Endpoint=queue_arn,
    Attributes={
        "RawMessageDelivery": "true"   # skip the SNS JSON wrapper
    }
)
print(f"SQS subscription ARN: {sub_resp['SubscriptionArn']}")

# ── 3. Subscribe an email address ─────────────────────────────────
# Note: email subscriptions require manual confirmation via inbox link
sns.subscribe(
    TopicArn=topic_arn,
    Protocol="email",
    Endpoint="data-engineering@company.com"
)
print("Email subscription created — confirm via inbox link")

# ── 4. Update SQS queue policy to allow SNS to send ──────────────
# (Without this, SNS cannot write to the SQS queue)
policy = {
    "Version": "2012-10-17",
    "Statement": [{
        "Effect":    "Allow",
        "Principal": {"Service": "sns.amazonaws.com"},
        "Action":    "sqs:SendMessage",
        "Resource":  queue_arn,
        "Condition": {
            "ArnEquals": {"aws:SourceArn": topic_arn}
        }
    }]
}
sqs.set_queue_attributes(
    QueueUrl=queue_url,
    Attributes={"Policy": json.dumps(policy)}
)
print("SQS policy updated — SNS can now deliver messages")

⚠️ Must Update SQS Queue Policy

By default, SQS denies all cross-service writes. If you forget to add the sns.amazonaws.com allow policy to the SQS queue, messages from SNS will be silently dropped — no error on the publish side.

🚨

Data Engineering Use Cases — Pipeline Alerts DE PATTERNS ▼

Pipeline Failure Alert — The Standard Pattern

Every production pipeline should publish to an SNS topic on failure. This is far better than hardcoding email logic in your Spark code because SNS decouples the pipeline from the notification mechanism — you can add Slack, PagerDuty, or SMS alerts later without touching pipeline code.

python — publish pipeline failure alert from Glue/EMR job

import boto3, json, traceback
from datetime import datetime, timezone

sns       = boto3.client("sns")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"

def publish_failure(pipeline_name: str, run_id: str, error: Exception) -> None:
    """Publish a structured failure event to SNS."""
    payload = {
        "status":        "FAILED",
        "pipeline":      pipeline_name,
        "run_id":        run_id,
        "timestamp":     datetime.now(timezone.utc).isoformat(),
        "error_type":    type(error).__name__,
        "error_message": str(error)[:500],          # truncate long messages
        "stack_trace":   traceback.format_exc()[:1000]
    }

    sns.publish(
        TopicArn=ALERT_ARN,
        Subject=f"❌ Pipeline FAILED: {pipeline_name}",
        Message=json.dumps(payload, indent=2),
        MessageAttributes={
            "pipeline_name": {
                "DataType":    "String",
                "StringValue": pipeline_name
            },
            "severity": {
                "DataType":    "String",
                "StringValue": "HIGH"
            }
        }
    )


# ── Wrap your entire pipeline in try/except ───────────────────────
import uuid

run_id        = str(uuid.uuid4())
pipeline_name = "silver-orders-etl"

try:
    # ... your actual Spark / Glue / EMR job logic here ...
    print("Running pipeline...")
    # simulate an error:
    raise ValueError("Source table orders has 0 rows — possible upstream failure")

except Exception as e:
    publish_failure(pipeline_name, run_id, e)
    raise   # re-raise so Glue/EMR marks the job as FAILED

📌 Always Re-Raise After Publishing

After publish_failure(), always raise the exception again. If you swallow it, Glue/EMR marks the job as SUCCEEDED even though it failed — CloudWatch alarms won't fire and the on-call engineer won't be paged.

CloudWatch Alarm → SNS → Notification

The most common production pattern is: CloudWatch detects a metric breach → fires an alarm → alarm sends to SNS topic → SNS fans out to email + PagerDuty. You don't even need to write any Python for this path — it's pure AWS configuration.

CW Metric Alarm

→

ALARM state

→

SNS Topic

→

Email ✉️

+

PagerDuty 📟

+

Slack 💬

python — wire a CloudWatch alarm to an SNS topic via boto3

import boto3

cw        = boto3.client("cloudwatch")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"

# Fire SNS alert if the Glue job runs longer than 90 minutes
cw.put_metric_alarm(
    AlarmName="GlueJobDurationBreached-silver-orders",
    AlarmDescription="silver-orders-etl exceeded 90 min SLA",
    MetricName="glue.driver.ExecutorRunTime",
    Namespace="Glue",
    Dimensions=[{"Name": "JobName", "Value": "silver-orders-etl"}],
    Statistic="Maximum",
    Period=300,                         # check every 5 minutes
    EvaluationPeriods=1,
    Threshold=5400000,                  # 90 min in milliseconds
    ComparisonOperator="GreaterThanThreshold",
    AlarmActions=[ALERT_ARN],            # SNS topic to notify
    OKActions=[ALERT_ARN],               # also notify when it recovers
    TreatMissingData="notBreaching"
)
print("CloudWatch alarm wired to SNS")

Triggering Multiple Downstream Consumers

When a pipeline finishes successfully, you often need to notify several downstream systems: update a data catalog, trigger a downstream transformation job, send a Slack message to the business team, and update a dashboard. Rather than calling each from your pipeline code, publish one success event to SNS and let each subscriber handle its own action independently.

python — publish pipeline success event with metadata

import boto3, json
from datetime import datetime, timezone

sns          = boto3.client("sns")
SUCCESS_ARN  = "arn:aws:sns:us-east-1:123456789012:pipeline-success-events"

def publish_success(pipeline_name, run_id, rows_written, s3_path, duration_sec):
    payload = {
        "status":         "SUCCEEDED",
        "pipeline":       pipeline_name,
        "run_id":         run_id,
        "timestamp":      datetime.now(timezone.utc).isoformat(),
        "rows_written":   rows_written,
        "output_s3_path": s3_path,
        "duration_sec":   duration_sec
    }

    sns.publish(
        TopicArn=SUCCESS_ARN,
        Subject=f"✅ Pipeline SUCCEEDED: {pipeline_name}",
        Message=json.dumps(payload, indent=2),
        MessageAttributes={
            "pipeline_name": {
                "DataType":    "String",
                "StringValue": pipeline_name
            },
            "status": {
                "DataType":    "String",
                "StringValue": "SUCCEEDED"
            }
        }
    )

# Call at the end of your job
publish_success(
    pipeline_name = "silver-orders-etl",
    run_id        = run_id,
    rows_written  = 4_823_441,
    s3_path       = "s3://data-lake/silver/orders/year=2024/month=06/day=15/",
    duration_sec  = 312
)

✅ Subscribers on This Success Topic

Subscriber 1 — SQS queue → Lambda → triggers the next pipeline (gold layer aggregation)
Subscriber 2 — Email → business BI team: "Silver orders table updated, 4.8M rows"
Subscriber 3 — HTTPS → Slack webhook → posts to #data-engineering channel
All triggered by the single sns.publish() call above.

🔍

Message Filtering — Filter Policies ADVANCED ▼

What Are Filter Policies?

By default, every subscriber on a topic receives every message. Filter policies let you attach a JSON rule to a subscription so that subscriber only receives messages whose MessageAttributes match the rule. This lets you have one topic serve many use cases without each subscriber processing irrelevant messages.

📬 Analogy

Imagine one post box (SNS topic) that receives all mail. Without filtering, every tenant in the building must sort through all letters. With filter policies, each tenant puts a label on their mailbox — "only deliver envelopes marked CRITICAL" — so they only receive what's relevant to them.

python — set a filter policy on a subscription

import boto3, json

sns = boto3.client("sns")

# Existing subscription ARN (from subscribe() call)
subscription_arn = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts:abc123"

# ── Filter: only deliver messages where severity = "HIGH" or "CRITICAL"
#    AND pipeline_name starts with "silver-"
filter_policy = {
    "severity":      ["HIGH", "CRITICAL"],
    "pipeline_name": [{"prefix": "silver-"}]
}

sns.set_subscription_attributes(
    SubscriptionArn=subscription_arn,
    AttributeName="FilterPolicy",
    AttributeValue=json.dumps(filter_policy)
)
print("Filter policy applied — subscriber now only gets HIGH/CRITICAL silver-* alerts")

# ── To remove the filter (receive all messages again) ────────────
sns.set_subscription_attributes(
    SubscriptionArn=subscription_arn,
    AttributeName="FilterPolicy",
    AttributeValue=""   # empty string clears the filter
)

📌 Filter Is On the Subscription, Not the Topic

The filter is set per-subscription using set_subscription_attributes(). The publisher doesn't change anything — it just populates MessageAttributes on the publish() call and each subscriber's filter decides whether to accept or drop it.

🐍

Complete Boto3 SNS API Reference API REFERENCE ▼

Topic Management — Create, List, Delete

python — topic CRUD operations

import boto3

sns = boto3.client("sns", region_name="us-east-1")

# ── CREATE ────────────────────────────────────────────────────────
resp = sns.create_topic(
    Name="prod-pipeline-alerts",
    Attributes={"DisplayName": "Prod Pipeline Alerts"},
    Tags=[{"Key": "Env", "Value": "prod"}]
)
topic_arn = resp["TopicArn"]   # idempotent — same ARN if topic already exists

# ── LIST (use paginator) ──────────────────────────────────────────
paginator = sns.get_paginator("list_topics")
for page in paginator.paginate():
    for topic in page["Topics"]:
        print(topic["TopicArn"])

# ── DELETE ────────────────────────────────────────────────────────
sns.delete_topic(TopicArn=topic_arn)
print("Topic deleted")

Subscribe, List, Unsubscribe

python — manage subscriptions

# ── SUBSCRIBE ─────────────────────────────────────────────────────
# SQS
sub = sns.subscribe(
    TopicArn=topic_arn,
    Protocol="sqs",
    Endpoint="arn:aws:sqs:us-east-1:123456789012:retry-queue",
    Attributes={"RawMessageDelivery": "true"},
    ReturnSubscriptionArn=True     # get ARN immediately (no email confirmation needed)
)
sub_arn = sub["SubscriptionArn"]

# Lambda
sns.subscribe(
    TopicArn=topic_arn,
    Protocol="lambda",
    Endpoint="arn:aws:lambda:us-east-1:123456789012:function:failure-handler"
)

# HTTPS (Slack webhook via proxy)
sns.subscribe(
    TopicArn=topic_arn,
    Protocol="https",
    Endpoint="https://hooks.slack.com/services/T.../B.../..."
)

# ── LIST subscriptions for a topic ───────────────────────────────
paginator = sns.get_paginator("list_subscriptions_by_topic")
for page in paginator.paginate(TopicArn=topic_arn):
    for s in page["Subscriptions"]:
        print(f"{s['Protocol']:10} → {s['Endpoint']}")

# ── UNSUBSCRIBE ───────────────────────────────────────────────────
sns.unsubscribe(SubscriptionArn=sub_arn)

publish() and publish_batch()

python — publish single and batch messages

import uuid

# ── SINGLE PUBLISH ────────────────────────────────────────────────
sns.publish(
    TopicArn=topic_arn,
    Subject="❌ Pipeline FAILED: silver-orders-etl",   # email subject line
    Message='{"status":"FAILED","pipeline":"silver-orders-etl","rows":0}',
    MessageAttributes={
        "severity": {
            "DataType":    "String",
            "StringValue": "HIGH"
        },
        "pipeline_name": {
            "DataType":    "String",
            "StringValue": "silver-orders-etl"
        }
    }
)

# ── BATCH PUBLISH (up to 10 messages per call) ───────────────────
# Useful when you need to notify multiple pipeline failures at once
entries = [
    {
        "Id":      str(i),                          # unique ID within batch
        "Message": f'Pipeline {i} failed',
        "Subject": f'Alert: Pipeline {i}',
        "MessageAttributes": {
            "severity": {"DataType": "String", "StringValue": "HIGH"}
        }
    }
    for i in range(3)
]

batch_resp = sns.publish_batch(
    TopicArn=topic_arn,
    PublishBatchRequestEntries=entries
)

print(f"Successful: {len(batch_resp['Successful'])}")
print(f"Failed:     {len(batch_resp['Failed'])}")

# ── Handle partial failures in batch ─────────────────────────────
for failure in batch_resp.get("Failed", []):
    print(f"Failed ID {failure['Id']}: {failure['Code']} — {failure['Message']}")

Quick Reference — All Key SNS APIs

create_topic() delete_topic() list_topics() paginator subscribe() unsubscribe() list_subscriptions_by_topic() paginator publish() publish_batch() set_subscription_attributes() — filter policy get_topic_attributes()

☁️ 29.16 Summary

SNS = pub/sub fan-out. One publish() call delivers to every subscriber simultaneously — email, SQS, Lambda, HTTPS, SMS. The critical production pattern is SNS → SQS fan-out: SNS provides the fan-out, SQS provides durable buffering so subscribers don't miss messages if they're temporarily down. Always update the SQS queue policy to allow sns.amazonaws.com to write. Use MessageAttributes + filter policies to route different message types to different subscribers on the same topic. In every pipeline, wrap job logic in try/except, call publish() on failure, and always re-raise so the job is marked FAILED in Glue/EMR.

29.17

AWS Lambda — Serverless Functions for Data Pipelines

Lambda lets you run Python code without managing any servers. You upload a function, configure a trigger, and AWS runs it in milliseconds whenever the trigger fires — you pay only for the milliseconds of execution time. For Data Engineers, Lambda is the glue between pipeline stages: it reacts to file arrivals, triggers Glue jobs, updates metadata, sends alerts, and handles lightweight transformations.

⚡

Lambda Fundamentals — Handler, Event, Context, Runtime CORE CONCEPT ▼

The Handler Function

Every Lambda function has a handler — a Python function that AWS calls when the trigger fires. It receives two arguments: event (the input data from the trigger) and context (runtime metadata like the function name, remaining time, and request ID). The handler's return value becomes the response for synchronous invocations.

🏭 Analogy

Lambda is like a vending machine. You configure what happens when someone presses a button (the trigger). When the button is pressed, the machine wakes up, does its job (dispenses the item), and goes back to sleep. You don't care about the machine's internals — you just care what happens when the button is pressed.

python — minimal Lambda handler structure

import json, logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event: dict, context) -> dict:
    """
    event   — dict containing the trigger payload (S3 event, SQS message, etc.)
    context — LambdaContext object with runtime metadata
    """
    # ── Context object useful fields ──────────────────────────────
    logger.info(f"Function name:    {context.function_name}")
    logger.info(f"Function version: {context.function_version}")
    logger.info(f"Request ID:       {context.aws_request_id}")
    logger.info(f"Memory (MB):      {context.memory_limit_in_mb}")
    logger.info(f"Remaining ms:     {context.get_remaining_time_in_millis()}")

    # ── Log the incoming event ────────────────────────────────────
    logger.info(f"Event received: {json.dumps(event, default=str)}")

    # ── Your business logic here ──────────────────────────────────
    result = {"status": "ok", "processed": True}

    return {
        "statusCode": 200,
        "body":       json.dumps(result)
    }

Memory, Timeout, and Runtimes

Lambda allocates CPU proportional to memory — doubling memory doubles CPU speed. For data engineering tasks (parsing files, calling boto3 APIs, triggering jobs), 256–512 MB is usually sufficient. Timeout can be up to 15 minutes — long enough to poll a Glue job status a few times but not for heavy Spark work.

Setting	Range	DE Recommendation
Memory	128 MB – 10 GB	256–512 MB for orchestration; 1–3 GB for in-memory data processing
Timeout	1 sec – 15 min	30 sec for simple triggers; 5–10 min if polling an async job status
Runtime	Python 3.9 / 3.10 / 3.11 / 3.12	Use Python 3.12 for new functions (latest, fastest cold start)
Ephemeral Storage	512 MB – 10 GB	/tmp directory; increase if you need to stage files before uploading to S3

Lambda Layers — Shared Libraries

A Layer is a ZIP archive of Python packages that Lambda mounts at /opt/python before your function runs. Multiple functions can share the same layer. This keeps your deployment package small and lets you manage library versions centrally — e.g. a single boto3-latest layer shared across all 40 pipeline Lambda functions.

bash — build and publish a Lambda layer with extra packages

# ── 1. Install packages into a folder ────────────────────────────
mkdir -p layer/python
pip install pandas pyarrow tenacity -t layer/python/

# ── 2. Zip the layer ──────────────────────────────────────────────
cd layer && zip -r ../my-de-layer.zip python/
cd ..

# ── 3. Publish the layer to AWS ───────────────────────────────────
aws lambda publish-layer-version \
  --layer-name     my-de-layer \
  --description    "pandas + pyarrow + tenacity for DE lambdas" \
  --zip-file       fileb://my-de-layer.zip \
  --compatible-runtimes python3.12

# ── 4. Attach layer to a function ─────────────────────────────────
aws lambda update-function-configuration \
  --function-name  file-arrival-handler \
  --layers         arn:aws:lambda:us-east-1:123456789012:layer:my-de-layer:1

🔔

Triggers for Data Pipelines — S3, SQS, SNS, EventBridge KEY PATTERN ▼

S3 Event Trigger — File Arrival → Lambda

The most common data engineering Lambda pattern: a new file lands in S3, S3 fires an event notification, and Lambda immediately processes it — validates the file, kicks off a Glue job, or writes metadata. The event object contains the bucket name, object key, size, and event time.

python — Lambda triggered by S3 file arrival

import boto3, json, logging
from urllib.parse import unquote_plus

logger = logging.getLogger()
logger.setLevel(logging.INFO)

glue = boto3.client("glue")
s3   = boto3.client("s3")

def lambda_handler(event, context):
    # ── Parse S3 event ────────────────────────────────────────────
    for record in event["Records"]:
        bucket = record["s3"]["bucket"]["name"]
        key    = unquote_plus(record["s3"]["object"]["key"])
        size   = record["s3"]["object"]["size"]

        logger.info(f"New file: s3://{bucket}/{key} ({size} bytes)")

        # ── Validate: reject empty files ──────────────────────────
        if size == 0:
            logger.warning("Empty file — skipping")
            continue

        # ── Validate: check expected prefix ───────────────────────
        if not key.startswith("landing/orders/"):
            logger.info("File not in orders prefix — ignoring")
            continue

        # ── Trigger Glue job with file path as argument ───────────
        run = glue.start_job_run(
            JobName="bronze-orders-ingest",
            Arguments={
                "--source_bucket": bucket,
                "--source_key":    key,
                "--file_size":     str(size)
            }
        )
        logger.info(f"Started Glue job run: {run['JobRunId']}")

    return {"statusCode": 200, "body": "done"}

⚠️ URL-Decode the S3 Key

S3 event notifications URL-encode the object key — spaces become + and special chars become %XX. Always call unquote_plus(key) before using the key in any boto3 call, otherwise NoSuchKey errors will appear on files with spaces or special characters.

SQS Trigger — Consume-Process-Delete Pattern

When Lambda is triggered by SQS, AWS polls the queue for you and delivers batches of messages to the handler. Lambda automatically deletes successfully processed messages from the queue. If the handler raises an exception, the message stays in the queue and becomes visible again after the visibility timeout — naturally enabling retries.

python — Lambda triggered by SQS, processes pipeline events

import boto3, json, logging
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

glue = boto3.client("glue")
ddb  = boto3.resource("dynamodb").Table("pipeline_audit")

def lambda_handler(event, context):
    """Process a batch of SQS messages — each is a pipeline trigger."""
    failed_ids = []

    for record in event["Records"]:
        message_id = record["messageId"]
        try:
            body = json.loads(record["body"])
            logger.info(f"Processing message {message_id}: {body}")

            # ── Trigger Glue job ──────────────────────────────────────
            run = glue.start_job_run(
                JobName=body["job_name"],
                Arguments={"--run_date": body["run_date"]}
            )
            logger.info(f"Glue run started: {run['JobRunId']}")

        except (KeyError, json.JSONDecodeError) as e:
            # Non-retryable: bad message format → route to DLQ
            logger.error(f"Bad message format {message_id}: {e}")
            failed_ids.append({"itemIdentifier": message_id})

        except ClientError as e:
            # Retryable: Glue API error → keep in queue for retry
            logger.error(f"Glue error for {message_id}: {e}")
            failed_ids.append({"itemIdentifier": message_id})

    # ── Return partial failure report ─────────────────────────────
    # Successfully processed records are auto-deleted by Lambda
    # Failed record IDs are kept in queue for retry / DLQ routing
    return {"batchItemFailures": failed_ids}

📌 batchItemFailures — Partial Batch Success

Return {"batchItemFailures": [{"itemIdentifier": msg_id}]} to tell Lambda which specific messages failed. Successfully processed messages are deleted; only failed ones go back to the queue (or DLQ after max retries). Without this, any exception causes the entire batch to be retried — including messages you already processed successfully.

EventBridge Scheduled Trigger — Cron Jobs

EventBridge can trigger Lambda on a cron or rate schedule — replacing traditional cron jobs entirely. The Lambda event contains the scheduled time and rule ARN. Use this for lightweight scheduled tasks: checking watermarks, sending daily summary emails, pruning old S3 files, or triggering a Glue crawler.

python — Lambda triggered by EventBridge schedule (daily at 06:00 UTC)

import boto3, logging
from datetime import datetime, timezone, timedelta

logger = logging.getLogger()
logger.setLevel(logging.INFO)

glue = boto3.client("glue")
sns  = boto3.client("sns")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"

def lambda_handler(event, context):
    """Daily 06:00 UTC trigger — start Glue crawler then kick off ETL."""
    logger.info(f"Scheduled trigger fired: {event.get('time', 'unknown')}")

    run_date = (datetime.now(timezone.utc) - timedelta(days=1)).strftime("%Y-%m-%d")
    logger.info(f"Processing run_date: {run_date}")

    try:
        # ── Start Glue crawler to pick up yesterday's landed files ───
        glue.start_crawler(Name="landing-zone-crawler")
        logger.info("Crawler started")

        # ── Start the ETL job with the run date ───────────────────
        run = glue.start_job_run(
            JobName="daily-silver-etl",
            Arguments={"--run_date": run_date}
        )
        logger.info(f"ETL job started: {run['JobRunId']}")

        return {"status": "ok", "run_date": run_date, "run_id": run["JobRunId"]}

    except Exception as e:
        logger.error(f"Scheduled trigger failed: {e}")
        sns.publish(
            TopicArn=ALERT_ARN,
            Subject="❌ Daily ETL trigger failed",
            Message=str(e)
        )
        raise

🏗️

Data Engineering Use Cases — Real Patterns DE PATTERNS ▼

Triggering EMR Steps from Lambda

Lambda can spin up an EMR cluster and submit a Spark step — or just add a step to an existing long-running cluster. This is useful when file arrival should kick off a Spark job that's too large for Glue (needs custom libraries, specific instance types, etc.).

python — Lambda submits a Spark step to an existing EMR cluster

import boto3, os, logging

logger  = logging.getLogger()
logger.setLevel(logging.INFO)

emr     = boto3.client("emr")
CLUSTER = os.environ["EMR_CLUSTER_ID"]   # pass via env var, not hardcoded

def lambda_handler(event, context):
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key    = event["Records"][0]["s3"]["object"]["key"]

    resp = emr.add_job_flow_steps(
        JobFlowId=CLUSTER,
        Steps=[{
            "Name": "Process landed file",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": [
                    "spark-submit", "--deploy-mode", "cluster",
                    "s3://my-scripts/process_file.py",
                    "--bucket", bucket,
                    "--key",    key
                ]
            }
        }]
    )
    step_id = resp["StepIds"][0]
    logger.info(f"EMR step submitted: {step_id}")
    return {"step_id": step_id}

Metadata Updates to DynamoDB

After a pipeline completes (triggered by SNS success event), Lambda writes a structured audit record to DynamoDB — capturing run ID, status, row count, S3 output path, and duration. This builds up a complete run history that's queryable for SLA tracking and debugging.

python — Lambda writes pipeline audit record to DynamoDB

import boto3, json, logging
from datetime import datetime, timezone
from decimal  import Decimal

logger    = logging.getLogger()
logger.setLevel(logging.INFO)
ddb_table = boto3.resource("dynamodb").Table("pipeline_audit")

def lambda_handler(event, context):
    """SNS success event → write audit record to DynamoDB."""
    # SNS wraps the message in event["Records"][0]["Sns"]["Message"]
    payload = json.loads(event["Records"][0]["Sns"]["Message"])

    ddb_table.put_item(Item={
        "run_id":         payload["run_id"],
        "pipeline_name":  payload["pipeline"],
        "status":         payload["status"],
        "rows_written":   payload.get("rows_written", 0),
        "output_s3_path": payload.get("output_s3_path", ""),
        "duration_sec":   Decimal(str(payload.get("duration_sec", 0))),
        "recorded_at":    datetime.now(timezone.utc).isoformat()
    })
    logger.info(f"Audit record written for run {payload['run_id']}")
    return {"statusCode": 200}

File Format Conversion — CSV → Parquet for Small Payloads

For small files (under ~500 MB), Lambda can convert CSV to Parquet in-memory using pandas and pyarrow — no Spark cluster needed. The file is downloaded to /tmp, converted, and uploaded back to S3. For anything larger, use Glue or EMR.

python — Lambda converts CSV to Parquet using pandas + pyarrow

import boto3, pandas as pd, logging, os
from urllib.parse import unquote_plus

logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3     = boto3.client("s3")

def lambda_handler(event, context):
    record = event["Records"][0]
    bucket = record["s3"]["bucket"]["name"]
    key    = unquote_plus(record["s3"]["object"]["key"])

    # ── Only process CSV files ─────────────────────────────────────
    if not key.endswith(".csv"):
        logger.info("Not a CSV — skipping")
        return

    local_csv     = f"/tmp/{os.path.basename(key)}"
    local_parquet = local_csv.replace(".csv", ".parquet")
    out_key       = key.replace("landing/", "bronze/").replace(".csv", ".parquet")

    # ── Download CSV from S3 ───────────────────────────────────────
    s3.download_file(bucket, key, local_csv)
    logger.info(f"Downloaded {key} to {local_csv}")

    # ── Convert with pandas ────────────────────────────────────────
    df = pd.read_csv(local_csv)
    df.to_parquet(local_parquet, index=False, engine="pyarrow", compression="snappy")
    logger.info(f"Converted to Parquet: {df.shape[0]} rows, {df.shape[1]} cols")

    # ── Upload Parquet to S3 bronze zone ──────────────────────────
    s3.upload_file(local_parquet, bucket, out_key)
    logger.info(f"Uploaded to s3://{bucket}/{out_key}")

    return {"statusCode": 200, "output_key": out_key}

⚡ /tmp Size Tip

Lambda gives you 512 MB of /tmp by default. For files up to ~200 MB compressed (which expand to ~500 MB CSV), this is enough. For larger files, increase ephemeral storage up to 10 GB in the Lambda configuration, or use Glue/EMR instead.

🛡️

Error Handling — Retries, DLQ, Structured Logging PRODUCTION ▼

Sync vs Async Invocation — Retry Behaviour Differs

Lambda has two invocation modes and they handle errors very differently. Understanding this is critical — silent message loss in production is almost always caused by not knowing which mode is in use.

Mode	Triggered By	On Error	DLQ Support
Synchronous	API Gateway, boto3 invoke (RequestResponse), Cognito	Error returned to caller immediately — no automatic retry	❌ No
Asynchronous	S3 events, SNS, EventBridge	AWS retries 2 more times (total 3 attempts) with delays between	✅ Yes
Poll-Based	SQS, Kinesis, DynamoDB Streams	Message stays in queue / stream until visibility timeout; routes to DLQ after max retries	✅ SQS DLQ

Dead Letter Queue (DLQ) for Async Lambda

For async-triggered Lambda (S3 events, SNS), configure a DLQ — an SQS queue that receives event payloads Lambda couldn't process after all retries. Without a DLQ, failed events are silently discarded after 3 attempts — you'd have no record that a file arrived but failed to trigger your pipeline.

python — configure DLQ on a Lambda function via boto3

import boto3

lm = boto3.client("lambda")

# Attach a DLQ SQS queue to the Lambda function
lm.put_function_event_invoke_config(
    FunctionName="file-arrival-handler",
    MaximumRetryAttempts=2,            # 0, 1, or 2 retries on async failure
    MaximumEventAgeInSeconds=3600,     # discard event if older than 1 hour
    DestinationConfig={
        "OnFailure": {
            "Destination": "arn:aws:sqs:us-east-1:123456789012:lambda-dlq"
        },
        "OnSuccess": {
            "Destination": "arn:aws:sns:us-east-1:123456789012:pipeline-success-events"
        }
    }
)
print("DLQ and success destination configured")

✅ Full Error Recovery Flow

S3 event → Lambda attempt 1 fails → retry after 1 min → attempt 2 fails → retry after 2 min → attempt 3 fails → event payload goes to DLQ SQS queue → CloudWatch alarm on DLQ depth → SNS alert → engineer investigates → manually re-processes from DLQ.

Structured Logging to CloudWatch

Lambda automatically sends all print() and logging output to CloudWatch Logs. Use structured JSON logging so CloudWatch Log Insights can query your logs with SQL-like syntax — finding all runs that processed over 1M rows, or all failures for a specific pipeline in the last 24 hours.

python — structured JSON logging pattern for Lambda

import json, logging, time
from datetime import datetime, timezone

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def log(level: str, message: str, **kwargs):
    """Emit a structured JSON log line — queryable in CloudWatch Log Insights."""
    entry = {
        "timestamp":    datetime.now(timezone.utc).isoformat(),
        "level":         level,
        "message":       message,
        **kwargs
    }
    print(json.dumps(entry))   # Lambda sends print() to CloudWatch automatically

def lambda_handler(event, context):
    start = time.time()

    log("INFO", "Lambda started",
        request_id=context.aws_request_id,
        function=context.function_name)

    try:
        # ... pipeline logic ...
        rows = 4_823_441
        log("INFO", "Pipeline completed",
            rows_processed=rows,
            duration_ms=int((time.time() - start) * 1000))
        return {"statusCode": 200, "rows": rows}

    except Exception as e:
        log("ERROR", "Pipeline failed",
            error_type=type(e).__name__,
            error_message=str(e),
            duration_ms=int((time.time() - start) * 1000))
        raise

# CloudWatch Log Insights query to find all failures in last 24h:
# fields @timestamp, error_type, error_message
# | filter level = "ERROR"
# | sort @timestamp desc
# | limit 50

🐍

Complete Boto3 Lambda API Reference API REFERENCE ▼

invoke() — Synchronous and Asynchronous

python — invoke Lambda sync and async from another service

import boto3, json

lm = boto3.client("lambda")

# ── SYNCHRONOUS invoke — waits for response ───────────────────────
resp = lm.invoke(
    FunctionName="file-arrival-handler",
    InvocationType="RequestResponse",    # wait for result
    Payload=json.dumps({
        "bucket": "my-data-lake",
        "key":    "landing/orders/2024-06-15.csv"
    }).encode()
)

status_code = resp["StatusCode"]          # 200 = Lambda ran (not your function's return code)
result      = json.loads(resp["Payload"].read())
func_error  = resp.get("FunctionError")    # "Handled" or "Unhandled" if function threw

print(f"Status: {status_code}, FunctionError: {func_error}")
print(f"Result: {result}")

if func_error:
    raise RuntimeError(f"Lambda function failed: {result.get('errorMessage')}")

# ── ASYNCHRONOUS invoke — fire and forget ─────────────────────────
lm.invoke(
    FunctionName="daily-report-generator",
    InvocationType="Event",               # async: returns 202, no payload back
    Payload=json.dumps({"run_date": "2024-06-15"}).encode()
)
print("Async invocation fired — not waiting for result")

⚠️ StatusCode 200 ≠ Function Succeeded

resp["StatusCode"] == 200 means Lambda received your request and ran the function — not that the function logic succeeded. Always check resp.get("FunctionError") to detect runtime exceptions inside the handler.

Function Management — Create, Update, Deploy

python — create and update Lambda functions via boto3

import boto3, zipfile, io

lm = boto3.client("lambda")

# ── Package code into a ZIP in memory ─────────────────────────────
buf = io.BytesIO()
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as zf:
    zf.write("handler.py")
zip_bytes = buf.getvalue()

# ── CREATE a new function ─────────────────────────────────────────
lm.create_function(
    FunctionName="file-arrival-handler",
    Runtime="python3.12",
    Role="arn:aws:iam::123456789012:role/lambda-de-role",
    Handler="handler.lambda_handler",      # filename.function_name
    Code={"ZipFile": zip_bytes},
    Description="Triggered on S3 landing file arrival",
    Timeout=300,                           # 5 minutes
    MemorySize=512,
    Environment={
        "Variables": {
            "EMR_CLUSTER_ID":  "j-ABCDEF123456",
            "ALERT_TOPIC_ARN": "arn:aws:sns:us-east-1:123456789012:prod-alerts"
        }
    }
)

# ── UPDATE code (redeploy) ────────────────────────────────────────
lm.update_function_code(
    FunctionName="file-arrival-handler",
    ZipFile=zip_bytes,
    Publish=True                           # publish a new numbered version
)

# ── UPDATE configuration (env vars, memory, timeout) ─────────────
lm.update_function_configuration(
    FunctionName="file-arrival-handler",
    Timeout=600,
    MemorySize=1024,
    Environment={
        "Variables": {
            "EMR_CLUSTER_ID":  "j-NEWCLUSTER",
            "ALERT_TOPIC_ARN": "arn:aws:sns:us-east-1:123456789012:prod-alerts"
        }
    }
)

# ── GET function info ─────────────────────────────────────────────
info = lm.get_function(FunctionName="file-arrival-handler")
print(f"Runtime: {info['Configuration']['Runtime']}")
print(f"Memory:  {info['Configuration']['MemorySize']} MB")
print(f"Timeout: {info['Configuration']['Timeout']} sec")

# ── LIST functions with paginator ─────────────────────────────────
paginator = lm.get_paginator("list_functions")
for page in paginator.paginate():
    for fn in page["Functions"]:
        print(f"{fn['FunctionName']:40} {fn['Runtime']:12} {fn['MemorySize']} MB")

# ── ADD S3 trigger permission (so S3 can invoke Lambda) ───────────
lm.add_permission(
    FunctionName="file-arrival-handler",
    StatementId="s3-invoke-permission",
    Action="lambda:InvokeFunction",
    Principal="s3.amazonaws.com",
    SourceArn="arn:aws:s3:::my-data-lake",
    SourceAccount="123456789012"
)

# ── DELETE function ───────────────────────────────────────────────
lm.delete_function(FunctionName="file-arrival-handler")

Quick Reference — All Key Lambda APIs

invoke() — RequestResponse / Event create_function() update_function_code() update_function_configuration() get_function() list_functions() paginator add_permission() — resource-based trigger policy put_function_event_invoke_config() — DLQ + destinations delete_function() publish_layer_version()

☁️ 29.17 Summary

Lambda = serverless event-driven code execution. Every handler receives event (trigger payload) and context (runtime metadata). Memory and CPU scale together — tune memory first for performance. The three DE trigger patterns are: S3 event (file arrival → Glue job), SQS (queue-driven pipeline orchestration with automatic retries), and EventBridge schedule (replacing cron jobs). Always configure a DLQ for async-triggered functions so failed events are never silently lost. Use batchItemFailures for SQS triggers to allow partial batch success. Structure logs as JSON for CloudWatch Log Insights queryability. For invoke(), always check FunctionError — StatusCode 200 only means Lambda ran, not that your code succeeded.

29.18

Amazon CloudWatch — Observability for Data Pipelines

CloudWatch is AWS's unified observability platform. It collects metrics (numbers over time), logs (text events), and fires alarms when thresholds are breached. For Data Engineers, CloudWatch is how you answer: "Did my pipeline finish on time? How many rows did it process? Why did it fail at 3 AM?" — without SSHing into any server.

📊

The Three Pillars — Metrics, Logs, Alarms CORE CONCEPT ▼

How the Three Pillars Fit Together

CloudWatch's three components work as a complete observability loop: Metrics tell you what is happening (numbers over time). Logs tell you why it happened (text events with details). Alarms tell you when something needs attention (thresholds on metrics that trigger SNS notifications).

🏥 Analogy

Think of CloudWatch like a hospital monitoring system. Metrics are the vital signs on the screen (heart rate, blood pressure — numbers over time). Logs are the doctor's notes explaining what happened during each visit. Alarms are the beeping monitors that alert nurses when a vital sign goes out of range.

CLOUDWATCH OBSERVABILITY LOOP ┌──────────────────────────────────────────────────────────────────┐ │ YOUR PIPELINE (Glue / EMR / Lambda) │ │ │ │ boto3 put_metric_data() ──────→ METRICS │ │ print() / logging ──────→ LOGS ──→ Log │ │ Insights │ └──────────────────────────────────────────────────────────────────┘ │ Metrics ▼ ┌─────────────────┐ │ CLOUDWATCH │ threshold breach? │ ALARM │ ─────────────────→ SNS Topic └─────────────────┘ │ ▼ Email / Slack / PagerDuty

Built-In vs Custom Metrics

AWS services automatically publish metrics to CloudWatch — you get Glue job duration, EMR step status, Lambda invocation count, and MSK consumer lag for free without writing any code. But these built-in metrics don't know your business logic. Custom metrics — rows processed, DQ score, pipeline SLA status — must be published by your pipeline code using put_metric_data().

Type	Examples	How to Get Them	Cost
Built-In	Glue.driver.ExecutorRunTime, Lambda.Duration, EMR.StepState	Automatic — no code needed	Free (Basic Monitoring)
Custom	pipeline_rows_processed, dq_score, pipeline_duration_sec	`put_metric_data()` from your code	$0.30 per 1,000 metrics

📈

Custom Metrics — put_metric_data() from Pipeline Code KEY PATTERN ▼

Publishing a Custom Metric — Structure and Concepts

Every custom metric belongs to a Namespace (a folder-like grouping), has a MetricName, a numeric Value, a Unit, and optional Dimensions (key-value labels that let you slice the metric — e.g. filter by pipeline name or environment). CloudWatch stores metrics at 1-second resolution (high-res) or 1-minute resolution (standard).

python — publish custom pipeline metrics from Glue/EMR/Lambda

import boto3
from datetime import datetime, timezone

cw = boto3.client("cloudwatch", region_name="us-east-1")

# ── Publish multiple pipeline metrics in ONE API call ─────────────
# (batch up to 1000 metrics per call — saves cost and latency)
cw.put_metric_data(
    Namespace="DataPlatform/Pipelines",     # your custom namespace
    MetricData=[
        # ① Rows written by this pipeline run
        {
            "MetricName": "RowsProcessed",
            "Value":      4_823_441,
            "Unit":       "Count",
            "Timestamp":  datetime.now(timezone.utc),
            "Dimensions": [
                {"Name": "PipelineName", "Value": "silver-orders-etl"},
                {"Name": "Environment",  "Value": "prod"}
            ]
        },
        # ② Pipeline run duration in seconds
        {
            "MetricName": "DurationSeconds",
            "Value":      312,
            "Unit":       "Seconds",
            "Timestamp":  datetime.now(timezone.utc),
            "Dimensions": [
                {"Name": "PipelineName", "Value": "silver-orders-etl"},
                {"Name": "Environment",  "Value": "prod"}
            ]
        },
        # ③ Data quality score (0.0 – 1.0)
        {
            "MetricName": "DQScore",
            "Value":      0.987,
            "Unit":       "None",        # use "None" for dimensionless ratios
            "Timestamp":  datetime.now(timezone.utc),
            "Dimensions": [
                {"Name": "PipelineName", "Value": "silver-orders-etl"}
            ]
        },
        # ④ Pipeline success/failure flag (1 = success, 0 = failure)
        {
            "MetricName": "PipelineSuccess",
            "Value":      1,
            "Unit":       "Count",
            "Timestamp":  datetime.now(timezone.utc),
            "Dimensions": [
                {"Name": "PipelineName", "Value": "silver-orders-etl"}
            ]
        }
    ]
)
print("Custom metrics published to CloudWatch")

📌 Namespace Convention for Data Platforms

Use a hierarchical namespace like DataPlatform/Pipelines, DataPlatform/DQ, DataPlatform/SLA. This groups your metrics together in the CloudWatch console, making it easy to build dashboards per domain. Never dump everything into the AWS default namespaces.

Reusable Metrics Publisher — Use in Every Pipeline

Rather than scattering put_metric_data() calls throughout your code, build a small helper class and call it at the end of every pipeline run. This ensures consistent metric names, dimensions, and error handling across all pipelines.

python — reusable PipelineMetrics class for all pipelines

import boto3, logging
from datetime import datetime, timezone
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)

class PipelineMetrics:
    """Publish standard pipeline metrics to CloudWatch."""

    def __init__(self, pipeline_name: str, environment: str = "prod"):
        self.cw            = boto3.client("cloudwatch")
        self.namespace     = "DataPlatform/Pipelines"
        self.pipeline_name = pipeline_name
        self.environment   = environment
        self.base_dims     = [
            {"Name": "PipelineName", "Value": pipeline_name},
            {"Name": "Environment",  "Value": environment}
        ]

    def _put(self, name: str, value: float, unit: str = "Count"):
        try:
            self.cw.put_metric_data(
                Namespace=self.namespace,
                MetricData=[{
                    "MetricName": name,
                    "Value":      value,
                    "Unit":       unit,
                    "Timestamp":  datetime.now(timezone.utc),
                    "Dimensions": self.base_dims
                }]
            )
        except ClientError as e:
            # Never let metric publishing crash your pipeline
            logger.warning(f"CloudWatch metric failed: {e}")

    def record_success(self, rows: int, duration_sec: float, dq_score: float = 1.0):
        self._put("RowsProcessed",   rows,         "Count")
        self._put("DurationSeconds",  duration_sec, "Seconds")
        self._put("DQScore",          dq_score,     "None")
        self._put("PipelineSuccess",  1,            "Count")
        self._put("PipelineFailure",  0,            "Count")

    def record_failure(self, duration_sec: float):
        self._put("PipelineSuccess",  0,            "Count")
        self._put("PipelineFailure",  1,            "Count")
        self._put("DurationSeconds",  duration_sec, "Seconds")


# ── Usage in any pipeline ─────────────────────────────────────────
import time

metrics   = PipelineMetrics("silver-orders-etl")
start     = time.time()

try:
    # ... your Spark / Glue logic ...
    rows_written = 4_823_441
    dq_score     = 0.987
    metrics.record_success(rows_written, time.time() - start, dq_score)

except Exception as e:
    metrics.record_failure(time.time() - start)
    raise

🚨

CloudWatch Alarms — Alerting Architecture PRODUCTION ▼

How Alarms Work — States and Transitions

A CloudWatch alarm watches one metric over a time window and transitions between three states based on whether the metric crosses a threshold. The AlarmActions list (SNS ARNs) is triggered on every state transition into ALARM — not on every data point.

State	Meaning	When It Fires AlarmActions
OK	Metric is within threshold	OKActions list (optional — for recovery alerts)
ALARM	Metric has breached threshold	AlarmActions list → SNS → email/Slack/PagerDuty
INSUFFICIENT_DATA	Not enough data points yet	InsufficientDataActions list (optional)

python — create production pipeline alarms

import boto3

cw        = boto3.client("cloudwatch")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"

# ── ALARM 1: Pipeline failure detected ───────────────────────────
cw.put_metric_alarm(
    AlarmName="PipelineFailure-silver-orders-etl",
    AlarmDescription="silver-orders-etl reported a failure",
    Namespace="DataPlatform/Pipelines",
    MetricName="PipelineFailure",
    Dimensions=[
        {"Name": "PipelineName", "Value": "silver-orders-etl"},
        {"Name": "Environment",  "Value": "prod"}
    ],
    Statistic="Sum",
    Period=300,                          # 5-minute evaluation window
    EvaluationPeriods=1,
    Threshold=1,                          # any failure triggers alarm
    ComparisonOperator="GreaterThanOrEqualToThreshold",
    AlarmActions=[ALERT_ARN],
    OKActions=[ALERT_ARN],
    TreatMissingData="notBreaching"       # no data = assume OK (pipeline not running)
)

# ── ALARM 2: DQ score dropped below 95% ──────────────────────────
cw.put_metric_alarm(
    AlarmName="DQScoreLow-silver-orders-etl",
    AlarmDescription="Data quality score below 95%",
    Namespace="DataPlatform/Pipelines",
    MetricName="DQScore",
    Dimensions=[{"Name": "PipelineName", "Value": "silver-orders-etl"}],
    Statistic="Minimum",
    Period=300,
    EvaluationPeriods=1,
    Threshold=0.95,
    ComparisonOperator="LessThanThreshold",
    AlarmActions=[ALERT_ARN],
    TreatMissingData="notBreaching"
)

# ── ALARM 3: Pipeline duration SLA breach (> 60 min) ─────────────
cw.put_metric_alarm(
    AlarmName="SLABreach-silver-orders-etl",
    AlarmDescription="silver-orders-etl exceeded 60 min SLA",
    Namespace="DataPlatform/Pipelines",
    MetricName="DurationSeconds",
    Dimensions=[
        {"Name": "PipelineName", "Value": "silver-orders-etl"},
        {"Name": "Environment",  "Value": "prod"}
    ],
    Statistic="Maximum",
    Period=300,
    EvaluationPeriods=1,
    Threshold=3600,                       # 60 minutes in seconds
    ComparisonOperator="GreaterThanThreshold",
    AlarmActions=[ALERT_ARN],
    TreatMissingData="notBreaching"
)

# ── ALARM 4: DLQ depth > 0 (messages stuck in dead letter queue) ─
cw.put_metric_alarm(
    AlarmName="DLQNotEmpty-pipeline-dlq",
    AlarmDescription="Messages in DLQ — pipeline events failed all retries",
    Namespace="AWS/SQS",                 # built-in AWS namespace
    MetricName="ApproximateNumberOfMessagesVisible",
    Dimensions=[{"Name": "QueueName", "Value": "pipeline-dlq"}],
    Statistic="Sum",
    Period=60,
    EvaluationPeriods=1,
    Threshold=1,
    ComparisonOperator="GreaterThanOrEqualToThreshold",
    AlarmActions=[ALERT_ARN],
    TreatMissingData="notBreaching"
)
print("All pipeline alarms created")

Composite Alarms — Reduce Alert Noise

A composite alarm combines multiple child alarms with AND/OR logic. Use them to avoid alert storms — e.g. only page the on-call engineer when both the failure alarm AND the DQ alarm fire at the same time, rather than getting two separate pages.

python — composite alarm: page only when both failure AND DQ drop

# Composite alarm fires when EITHER child alarm is in ALARM state
cw.put_composite_alarm(
    AlarmName="CriticalPipelineAlert-silver-orders",
    AlarmDescription="Page on-call: failure OR DQ breach in silver-orders",
    AlarmRule=(
        'ALARM("PipelineFailure-silver-orders-etl") OR '
        'ALARM("DQScoreLow-silver-orders-etl")'
    ),
    AlarmActions=[ALERT_ARN]
)

📋

CloudWatch Logs — Structured Logging and Log Insights OBSERVABILITY ▼

Log Groups and Log Streams

CloudWatch Logs organizes log data into log groups (one per service or application — e.g. /aws/glue/jobs/silver-orders-etl) and log streams (one per run/instance — e.g. the job run ID). AWS services like Lambda and Glue create these automatically. For custom applications (e.g. pipeline orchestration scripts), you create them yourself.

python — create log group, log stream, and publish log events

import boto3, json, time, logging
from datetime import datetime, timezone

logger = logging.getLogger(__name__)
cw     = boto3.client("logs")

LOG_GROUP  = "/dataplatform/pipelines/silver-orders-etl"
LOG_STREAM = f"run-{datetime.now(timezone.utc).strftime('%Y-%m-%dT%H-%M-%S')}"

# ── 1. Create log group (idempotent — safe to call if already exists) ─
try:
    cw.create_log_group(logGroupName=LOG_GROUP)
    cw.put_retention_policy(                    # keep logs for 30 days
        logGroupName=LOG_GROUP,
        retentionInDays=30
    )
except cw.exceptions.ResourceAlreadyExistsException:
    pass

# ── 2. Create log stream for this run ─────────────────────────────
try:
    cw.create_log_stream(logGroupName=LOG_GROUP, logStreamName=LOG_STREAM)
except cw.exceptions.ResourceAlreadyExistsException:
    pass

# ── 3. Publish log events ─────────────────────────────────────────
# Each event: {"timestamp": epoch_ms, "message": str}
# Must be in chronological order within one put_log_events call

def now_ms() -> int:
    return int(time.time() * 1000)

log_events = [
    {"timestamp": now_ms(), "message": json.dumps({
        "level": "INFO", "event": "pipeline_started",
        "pipeline": "silver-orders-etl", "run_date": "2024-06-15"
    })},
    {"timestamp": now_ms() + 1, "message": json.dumps({
        "level": "INFO", "event": "rows_written",
        "rows": 4_823_441, "table": "silver.orders"
    })},
    {"timestamp": now_ms() + 2, "message": json.dumps({
        "level": "INFO", "event": "pipeline_completed",
        "duration_sec": 312, "dq_score": 0.987
    })}
]

cw.put_log_events(
    logGroupName=LOG_GROUP,
    logStreamName=LOG_STREAM,
    logEvents=log_events
    # sequenceToken not needed for the first call to a new stream
    # for subsequent calls: pass the nextSequenceToken from previous response
)
print(f"Logs published to {LOG_GROUP}/{LOG_STREAM}")

CloudWatch Log Insights — Query Your Logs with SQL-Like Syntax

Log Insights lets you query across all log streams in a log group using a simple query language. If you structured your logs as JSON (as shown above), you can filter by any field — finding all pipeline runs that processed over 1M rows, all ERROR-level events in the last 24 hours, or the average duration per pipeline over the last week.

python — run a Log Insights query via boto3 and fetch results

import boto3, time
from datetime import datetime, timezone, timedelta

cw = boto3.client("logs")

# ── Start the query ───────────────────────────────────────────────
now        = datetime.now(timezone.utc)
start_time = now - timedelta(hours=24)

query_resp = cw.start_query(
    logGroupName="/dataplatform/pipelines/silver-orders-etl",
    startTime=int(start_time.timestamp()),
    endTime=int(now.timestamp()),
    queryString="""
        fields @timestamp, event, rows, duration_sec, dq_score
        | filter level = "INFO" and event = "pipeline_completed"
        | sort @timestamp desc
        | limit 20
    """
)
query_id = query_resp["queryId"]

# ── Poll until the query finishes ─────────────────────────────────
while True:
    result = cw.get_query_results(queryId=query_id)
    status = result["status"]
    print(f"Query status: {status}")
    if status in ["Complete", "Failed", "Cancelled"]:
        break
    time.sleep(1)

# ── Parse results ─────────────────────────────────────────────────
# Each result row is a list of {"field": name, "value": val} dicts
for row in result["results"]:
    record = {item["field"]: item["value"] for item in row}
    print(f"  {record.get('@timestamp')} | rows={record.get('rows')} | dq={record.get('dq_score')}")

✅ Useful Log Insights Queries for Data Engineers

Find all errors in last 24h:
filter level = "ERROR" | fields @timestamp, error_type, error_message | sort @timestamp desc

Average pipeline duration per run date:
filter event = "pipeline_completed" | stats avg(duration_sec) by run_date

Pipelines with DQ score below 95%:
filter event = "pipeline_completed" and dq_score < 0.95 | fields @timestamp, pipeline, dq_score

filter_log_events() — Search Across Streams Without Log Insights

For simple keyword searches without the query language, use filter_log_events(). It scans all streams in a log group for a pattern match. Useful for quickly finding a specific run ID or error string in production.

python — search logs for a specific run ID or error pattern

import boto3
from datetime import datetime, timezone, timedelta

cw  = boto3.client("logs")
now = datetime.now(timezone.utc)

paginator = cw.get_paginator("filter_log_events")
pages = paginator.paginate(
    logGroupName="/dataplatform/pipelines/silver-orders-etl",
    startTime=int((now - timedelta(hours=6)).timestamp() * 1000),  # epoch ms
    endTime=int(now.timestamp() * 1000),
    filterPattern='"pipeline_failed"'     # exact phrase match
)

for page in pages:
    for event in page["events"]:
        print(f"Stream: {event['logStreamName']}")
        print(f"Time:   {datetime.fromtimestamp(event['timestamp']/1000, tz=timezone.utc)}")
        print(f"Msg:    {event['message']}\n")

📺

CloudWatch Dashboards — Pipeline Health at a Glance MONITORING ▼

Creating a Pipeline Health Dashboard via boto3

A CloudWatch Dashboard is a JSON-defined collection of widgets — line charts, number widgets, alarm status panels. You define the dashboard body as a JSON string and call put_dashboard(). The result is a live monitoring page in the AWS console that your team can bookmark.

python — create a pipeline health dashboard

import boto3, json

cw = boto3.client("cloudwatch")

dashboard_body = {
    "widgets": [
        # ── Widget 1: Rows processed over time ─────────────────────
        {
            "type": "metric", "x": 0, "y": 0, "width": 12, "height": 6,
            "properties": {
                "title":  "Rows Processed Per Run",
                "view":   "timeSeries",
                "period": 300,
                "metrics": [[
                    "DataPlatform/Pipelines", "RowsProcessed",
                    "PipelineName", "silver-orders-etl",
                    "Environment", "prod"
                ]]
            }
        },
        # ── Widget 2: DQ score over time ────────────────────────────
        {
            "type": "metric", "x": 12, "y": 0, "width": 12, "height": 6,
            "properties": {
                "title":  "Data Quality Score",
                "view":   "timeSeries",
                "period": 300,
                "metrics": [[
                    "DataPlatform/Pipelines", "DQScore",
                    "PipelineName", "silver-orders-etl"
                ]]
            }
        },
        # ── Widget 3: Alarm status panel ────────────────────────────
        {
            "type": "alarm", "x": 0, "y": 6, "width": 24, "height": 3,
            "properties": {
                "title": "Pipeline Alarm Status",
                "alarms": [
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:PipelineFailure-silver-orders-etl",
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:DQScoreLow-silver-orders-etl",
                    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:DLQNotEmpty-pipeline-dlq"
                ]
            }
        }
    ]
}

cw.put_dashboard(
    DashboardName="DataPlatform-Pipeline-Health",
    DashboardBody=json.dumps(dashboard_body)
)
print("Dashboard created: DataPlatform-Pipeline-Health")

🔭

Pipeline Observability Patterns — SLA, Freshness, Anomaly ADVANCED ▼

SLA Tracking — Expected vs Actual Completion

Publish a PipelineSuccess metric (value = 1) when the pipeline finishes. Set a CloudWatch alarm that fires if the sum of PipelineSuccess over the expected completion window is zero — meaning the pipeline didn't run at all. This catches the silent failure: a pipeline that simply didn't execute rather than crashed.

python — SLA breach alarm: fire if pipeline didn't run by 07:00 UTC

# Alarm: if sum of PipelineSuccess in the 06:00–07:00 UTC window is 0,
# the pipeline didn't complete on time → SLA breach
cw.put_metric_alarm(
    AlarmName="SLAMiss-silver-orders-etl-daily",
    AlarmDescription="silver-orders-etl did not complete by 07:00 UTC",
    Namespace="DataPlatform/Pipelines",
    MetricName="PipelineSuccess",
    Dimensions=[
        {"Name": "PipelineName", "Value": "silver-orders-etl"},
        {"Name": "Environment",  "Value": "prod"}
    ],
    Statistic="Sum",
    Period=3600,                          # 1-hour window
    EvaluationPeriods=1,
    Threshold=1,
    ComparisonOperator="LessThanThreshold",
    AlarmActions=[ALERT_ARN],
    TreatMissingData="breaching"          # ← KEY: no data = SLA breach
)

📌 TreatMissingData = "breaching" is Critical for SLA Alarms

For SLA alarms, set TreatMissingData="breaching". This means "if no metric data arrived in this window, treat it as a threshold breach." Without this, a pipeline that simply never ran would show as OK — the alarm would never fire for the silent failure case.

Row Count Anomaly Detection

Publish RowsProcessed after every run. If today's run wrote 10M rows but yesterday wrote 5M, that's either a data explosion or a bug — both worth alerting on. Use CloudWatch Anomaly Detection to automatically set dynamic thresholds based on historical patterns, rather than hard-coding a fixed number.

python — enable anomaly detection on RowsProcessed metric

# Put an anomaly detector on the RowsProcessed metric
cw.put_anomaly_detector(
    Namespace="DataPlatform/Pipelines",
    MetricName="RowsProcessed",
    Dimensions=[
        {"Name": "PipelineName", "Value": "silver-orders-etl"},
        {"Name": "Environment",  "Value": "prod"}
    ],
    Stat="Sum",
    Configuration={
        "ExcludedTimeRanges": []  # optionally exclude known anomalies from training
    }
)
print("Anomaly detector training started — takes ~2 weeks of data")

🐍

Complete Boto3 CloudWatch API Reference API REFERENCE ▼

All Key APIs — Metrics, Alarms, Logs, Dashboards

python — complete CloudWatch API quick reference

import boto3

cw   = boto3.client("cloudwatch")   # metrics + alarms + dashboards
logs = boto3.client("logs")         # log groups, streams, events, insights

# ════════════════════════════════════════════════════════════════
# METRICS
# ════════════════════════════════════════════════════════════════

# Publish custom metric data points
cw.put_metric_data(Namespace="...", MetricData=[...])

# Get metric statistics (average, sum, min, max over a time range)
cw.get_metric_statistics(
    Namespace="DataPlatform/Pipelines", MetricName="DurationSeconds",
    Dimensions=[{"Name": "PipelineName", "Value": "silver-orders-etl"}],
    StartTime=..., EndTime=..., Period=86400, Statistics=["Average", "Maximum"]
)

# Batch query multiple metrics simultaneously (more efficient than get_metric_statistics)
cw.get_metric_data(
    MetricDataQueries=[
        {"Id": "rows", "MetricStat": {
            "Metric": {"Namespace": "DataPlatform/Pipelines", "MetricName": "RowsProcessed",
                        "Dimensions": [{"Name": "PipelineName", "Value": "silver-orders-etl"}]},
            "Period": 86400, "Stat": "Sum"
        }}
    ],
    StartTime=..., EndTime=...
)

# ════════════════════════════════════════════════════════════════
# ALARMS
# ════════════════════════════════════════════════════════════════

cw.put_metric_alarm(AlarmName="...", ...)     # create or update alarm
cw.put_composite_alarm(AlarmName="...", AlarmRule="...", ...)
cw.describe_alarms(AlarmNames=["..."])         # get current state + config
cw.describe_alarm_history(AlarmName="...")     # state transition history
cw.set_alarm_state(                              # manually force state (testing)
    AlarmName="test-alarm",
    StateValue="ALARM",
    StateReason="Manual test"
)
cw.delete_alarms(AlarmNames=["..."])

# ════════════════════════════════════════════════════════════════
# LOGS
# ════════════════════════════════════════════════════════════════

logs.create_log_group(logGroupName="...")
logs.put_retention_policy(logGroupName="...", retentionInDays=30)
logs.create_log_stream(logGroupName="...", logStreamName="...")
logs.put_log_events(logGroupName="...", logStreamName="...", logEvents=[...])
logs.get_paginator("filter_log_events").paginate(logGroupName="...", filterPattern="...")
logs.describe_log_streams(logGroupName="...")

# Log Insights
logs.start_query(logGroupName="...", queryString="...", startTime=..., endTime=...)
logs.get_query_results(queryId="...")          # poll until status = "Complete"

# ════════════════════════════════════════════════════════════════
# DASHBOARDS
# ════════════════════════════════════════════════════════════════

cw.put_dashboard(DashboardName="...", DashboardBody=json.dumps({...}))
cw.get_dashboard(DashboardName="...")
cw.list_dashboards()
cw.delete_dashboards(DashboardNames=["..."])

Quick Reference — All Key CloudWatch APIs

put_metric_data() get_metric_statistics() get_metric_data() put_metric_alarm() put_composite_alarm() describe_alarms() set_alarm_state() — testing create_log_group() put_log_events() filter_log_events() paginator start_query() — Log Insights get_query_results() put_dashboard() put_anomaly_detector()

☁️ 29.18 Summary

CloudWatch = the observability backbone for every data pipeline. Use custom metrics (put_metric_data()) to track rows processed, DQ score, duration, and success/failure — AWS built-ins don't know your business logic. Wire alarms to SNS topics for automated alerting on failure, DQ drops, SLA breaches, and DLQ depth spikes. Use TreatMissingData="breaching" on SLA alarms so a pipeline that simply never ran triggers the alarm. Structure all logs as JSON so CloudWatch Log Insights can query them by field. Batch metrics together in a single put_metric_data() call (up to 1000 per call) to reduce cost. Use set_alarm_state() in staging to test your SNS alerting pipeline before going to production.

29.19

AWS VPC for Data Engineers

A VPC (Virtual Private Cloud) is the private network every Glue job, EMR cluster, RDS database, and Lambda function lives inside. As a Data Engineer you rarely build VPCs from scratch — but you must understand subnets, route tables, security groups, and especially VPC Endpoints, because half of all "it works in my notebook but fails in Glue" problems are network connectivity issues, not code issues.

🧱

VPC & Subnets — Public vs Private CORE CONCEPT ▼

What Is a VPC?

A VPC is an isolated, private slice of the AWS network — your own virtual data center with its own IP address range. Every resource that needs network connectivity — EMR clusters, RDS databases, Glue jobs (when configured with a VPC connection), Lambda functions (when accessing private resources) — runs inside a VPC, even though Glue and Lambda look "serverless" from the outside.

🏢 Analogy

A VPC is like a private office building you lease in a city. The city is the AWS region. Your building (VPC) has its own address range, its own internal hallways (subnets), its own security desk (security groups), and you decide which doors open to the public street (internet) and which stay internal-only.

VPC LAYOUT — TYPICAL DATA PLATFORM ┌──────────────────────────────────────────────────────────────────┐ │ VPC (10.0.0.0/16) │ │ │ │ ┌─────────────────────────┐ ┌─────────────────────────────┐ │ │ │ PUBLIC SUBNET │ │ PRIVATE SUBNET │ │ │ │ 10.0.1.0/24 │ │ 10.0.10.0/24 │ │ │ │ │ │ │ │ │ │ NAT Gateway │ │ EMR Cluster │ │ │ │ Bastion Host │ │ RDS Database │ │ │ │ │ │ Glue Job ENIs │ │ │ └────────────┬─────────────┘ └──────────────┬────────────────┘ │ │ │ │ │ │ Internet Gateway route via NAT │ └────────────────┼──────────────────────────────────┼──────────────────┘ ▼ ▼ Internet (S3, APIs) Internet (via NAT only)

Subnets — Dividing the VPC

A VPC is split into subnets, each tied to a single Availability Zone (AZ) with its own slice of the VPC's IP range (CIDR block). Every subnet is classified as public or private based on one thing only: does its route table send 0.0.0.0/0 traffic to an Internet Gateway?

Subnet Type	Route for 0.0.0.0/0	Internet Access	Typical Resources
Public	→ Internet Gateway (IGW)	Direct, two-way	NAT Gateways, Bastion hosts, ALBs
Private	→ NAT Gateway (or none)	Outbound only (via NAT), no inbound	EMR clusters, RDS, Redshift, Glue ENIs, Lambda (VPC mode)
Isolated	No internet route at all	None — only VPC Endpoints	Highly sensitive databases, internal-only services

📌 Key Point for Data Engineers

Glue jobs, EMR clusters, and RDS instances always run in private subnets in production. They reach S3 and AWS APIs through VPC Endpoints (covered later in this section) rather than the public internet — this is both more secure and often cheaper (no NAT data-transfer charges).

CIDR Blocks — Planning IP Address Ranges

A CIDR block (e.g. 10.0.0.0/16) defines the range of private IP addresses available to a VPC or subnet. The number after the slash is the prefix length — smaller numbers mean larger ranges. A /16 gives ~65,000 addresses for the whole VPC; each subnet typically gets a /24 (256 addresses).

CIDR Notation	Number of IPs	Typical Use
`10.0.0.0/16`	65,536	Entire VPC
`10.0.1.0/24`	256	One public subnet (AZ-a)
`10.0.10.0/24`	256	One private subnet (AZ-a) — EMR/RDS
`10.0.11.0/24`	256	Private subnet (AZ-b) — for HA / Multi-AZ

⚠️ Common Mistake

Sizing subnets too small for EMR. Each EMR core/task node consumes an IP address, and large clusters with auto-scaling can quickly exhaust a tiny /28 subnet, causing scale-out failures with no obvious error in Spark itself — only in the EMR cluster provisioning logs.

python — create a VPC and a private subnet with boto3

import boto3

ec2 = boto3.client("ec2", region_name="us-east-1")

# ── Create the VPC ────────────────────────────────────────────────
vpc = ec2.create_vpc(
    CidrBlock="10.0.0.0/16",
    TagSpecifications=[{
        "ResourceType": "vpc",
        "Tags": [{"Key": "Name", "Value": "data-platform-vpc"}]
    }]
)
vpc_id = vpc["Vpc"]["VpcId"]

# ── Create a PRIVATE subnet for EMR / RDS / Glue ENIs ───────────────
private_subnet = ec2.create_subnet(
    VpcId=vpc_id,
    CidrBlock="10.0.10.0/24",
    AvailabilityZone="us-east-1a",
    TagSpecifications=[{
        "ResourceType": "subnet",
        "Tags": [{"Key": "Name", "Value": "private-subnet-emr-rds-a"}]
    }]
)
private_subnet_id = private_subnet["Subnet"]["SubnetId"]

# ── Create a PUBLIC subnet for the NAT Gateway ──────────────────────
public_subnet = ec2.create_subnet(
    VpcId=vpc_id,
    CidrBlock="10.0.1.0/24",
    AvailabilityZone="us-east-1a",
    TagSpecifications=[{
        "ResourceType": "subnet",
        "Tags": [{"Key": "Name", "Value": "public-subnet-nat-a"}]
    }]
)

print(f"VPC: {vpc_id} | Private subnet: {private_subnet_id}")

✅ Real World

When you launch an EMR cluster or attach a Glue VPC connection, you choose a subnet ID. If you accidentally pick a public subnet, AWS Glue will refuse with a validation error — Glue ENIs are required to live in subnets that can reach the internet only via NAT or VPC Endpoints, never via a direct IGW route.

29.20 — AWS COST OPTIMIZATION

AWS Cost Optimization
for Data Engineers

Cloud costs spiral quickly in data pipelines. This section covers every lever a Data Engineer controls — from Spot Instances and Reserved capacity to S3 lifecycle automation, Glue DPU tuning, Athena query cost control, Lambda memory right-sizing, and Redshift pause/resume. Master these and you can cut pipeline costs by 40–70%.

💡

Spot Instances — When and How Save 60–90% ▼

What is a Spot Instance?

Spot Instances are spare EC2 capacity that AWS sells at up to 90% discount compared to On-Demand pricing. The catch: AWS can reclaim them with a 2-minute warning when that capacity is needed elsewhere. For data pipelines this is usually fine — your EMR or EKS cluster just retries the failed tasks.

💡 Analogy

Think of Spot Instances like standby airline seats — massively cheaper, but you might get bumped. Perfect for batch jobs that can tolerate a retry; terrible for the control plane that orchestrates everything.

The Golden Rule: Spot for Task Nodes, On-Demand for Core + Master

In EMR, always keep the Master node and Core nodes On-Demand (they hold HDFS data and the YARN resource manager). Put all Task nodes on Spot — they only run tasks, hold no HDFS data, and losing them causes a retry, not a cluster failure.

🔑 Key Rule

Master = On-Demand. Core = On-Demand (or 1 On-Demand + rest Spot). Task = 100% Spot.

python — EMR with Spot task nodes

import boto3

emr = boto3.client('emr', region_name='us-east-1')

response = emr.run_job_flow(
    Name='cost-optimised-cluster',
    ReleaseLabel='emr-6.15.0',
    Applications=[{'Name': 'Spark'}],
    Instances={
        # Master — On-Demand (controls the cluster)
        'MasterInstanceType': 'm5.xlarge',

        # Core — On-Demand (holds HDFS blocks)
        'SlaveInstanceType': 'm5.2xlarge',
        'InstanceCount': 2,   # master + 1 core = 2 On-Demand

        # Task — Spot (pure compute, no HDFS data)
        'InstanceGroups': [
            {
                'Name': 'SpotTaskNodes',
                'Market': 'SPOT',
                'InstanceRole': 'TASK',
                'InstanceType': 'm5.2xlarge',
                'InstanceCount': 8,
                'BidPrice': '0.20',  # max price; AWS uses market price
            }
        ],
        'KeepJobFlowAliveWhenNoSteps': False,
        'TerminationProtected': False,
    },
    JobFlowRole='EMR_EC2_DefaultRole',
    ServiceRole='EMR_DefaultRole',
    AutoTerminateAfterNoSteps=True,
)
print(f"Cluster: {response['JobFlowId']}")

Spot Interruption Handling

When AWS reclaims a Spot node, the 2-minute notification triggers EMR's graceful decommission — it attempts to move in-progress tasks to other nodes. Spark's built-in task retry (default 4 retries) handles the rest. Configure spark.task.maxFailures to be tolerant.

python — Spark config tolerant to Spot interruptions

# In your EMR Configurations override:
configurations = [
    {
        'Classification': 'spark-defaults',
        'Properties': {
            'spark.task.maxFailures': '8',        # default 4 → raise to 8 for Spot
            'spark.stage.maxConsecutiveAttempts': '8',
            'spark.blacklist.enabled': 'true',     # blacklist repeatedly failing nodes
            'spark.blacklist.task.maxTaskAttemptsPerNode': '2',
        }
    }
]

⚠️ Watch Out

Never run stateful shuffle-heavy jobs on 100% Spot without checkpointing. If too many task nodes are lost mid-shuffle, the stage has to restart from scratch and you burn more time than you saved on cost.

Spot + On-Demand Mix Strategy

A reliable production pattern: provision 20% On-Demand task nodes as a base, and 80% Spot task nodes for burst capacity. Even if all Spot nodes are reclaimed, the job continues (slowly) on On-Demand nodes.

Node Type	Market	Why
Master (1x)	On-Demand	YARN ResourceManager — cannot be lost
Core (2x)	On-Demand	HDFS NameNode + DataNode — data locality
Task — base (2x)	On-Demand	Fallback if all Spot reclaimed
Task — burst (8x)	Spot	60–80% cheaper; retried on interruption

📅

Reserved Instances and Savings Plans Save 30–60% ▼

Reserved Instances (RI)

You commit to using a specific EC2 instance type in a specific region for 1 or 3 years in exchange for a 30–60% discount. Use RIs for your always-on baseline infrastructure — EMR master/core nodes that run every day, persistent Redshift clusters, long-running Glue workers.

RI Type	Flexibility	Discount	Use For
Standard RI	Fixed instance type	~60%	Predictable, same-type workloads
Convertible RI	Can change instance family	~40%	When you may resize later
Scheduled RI	Specific time windows	~30%	Daily batch jobs at fixed times

Savings Plans (Preferred over RIs for most DEs)

Savings Plans are more flexible than RIs. You commit to a dollar amount per hour (e.g., $5/hr) rather than a specific instance type. AWS applies the discount automatically to any matching usage. Compute Savings Plans cover EC2, Lambda, and Fargate — ideal for Data Engineers who run varied workloads.

📌 Example

You spend ~$10/hr on EMR clusters daily. You buy a 1-year Compute Savings Plan at $6/hr commitment. AWS applies 40% discount on the first $6/hr of usage. You pay $6 (committed) + $4 (on-demand rate for the remainder) = ~$10/hr total, but the first $6 worth of compute was 40% cheaper. Net saving: ~$2.40/hr → ~$21,000/year.

🔑 Rule of Thumb

Use Savings Plans for compute (flexible). Use Standard RIs for Redshift (fixed node type, always running). Layer Spot on top for burst. This three-tier approach maximises savings.

🪣

S3 Lifecycle Policies for Cost Reduction Storage Savings ▼

Storage Class Cost Ladder

S3 has multiple storage tiers — the colder the tier, the cheaper the storage but the higher the retrieval cost. Lifecycle policies automatically move objects between tiers based on age, so you never pay Standard prices for year-old raw data.

Storage Class	Cost (per GB/mo)	Retrieval	Best For
Standard	~$0.023	Instant, free	Active data (last 30 days)
Intelligent-Tiering	~$0.023 + monitoring	Instant	Unknown access patterns
Standard-IA	~$0.0125	Instant, per-GB fee	30–90 day old data, infrequent access
Glacier Instant Retrieval	~$0.004	Milliseconds	90–180 day data, rare access
Glacier Flexible Retrieval	~$0.0036	Minutes–hours	Compliance archives
Glacier Deep Archive	~$0.00099	Hours	7+ year regulatory data

Configuring Lifecycle Rules via Boto3

The typical data lake lifecycle: keep active data in Standard, move raw/bronze data after 30 days to IA, archive after 90 days, expire after your retention window. Configure this once per bucket/prefix and AWS handles it automatically forever.

python — S3 lifecycle policy for data lake

import boto3

s3 = boto3.client('s3')

# Apply tiered lifecycle to raw/ prefix (Bronze zone)
s3.put_bucket_lifecycle_configuration(
    Bucket='my-data-lake',
    LifecycleConfiguration={
        'Rules': [
            {
                'ID': 'raw-zone-tiering',
                'Status': 'Enabled',
                'Filter': {'Prefix': 'raw/'},   # only Bronze/raw data
                'Transitions': [
                    {
                        'Days': 30,
                        'StorageClass': 'STANDARD_IA'   # after 30 days → IA
                    },
                    {
                        'Days': 90,
                        'StorageClass': 'GLACIER_IR'    # after 90 days → Glacier IR
                    },
                    {
                        'Days': 365,
                        'StorageClass': 'DEEP_ARCHIVE'  # after 1 year → cheapest tier
                    },
                ],
                'Expiration': {
                    'Days': 2555  # delete after 7 years (compliance)
                },
                # Also clean up incomplete multipart uploads
                'AbortIncompleteMultipartUpload': {
                    'DaysAfterInitiation': 7
                }
            },
            {
                'ID': 'gold-zone-no-archive',
                'Status': 'Enabled',
                'Filter': {'Prefix': 'gold/'},   # Gold stays hot for dashboards
                'Transitions': [
                    {'Days': 180, 'StorageClass': 'STANDARD_IA'}
                ]
            }
        ]
    }
)
print("Lifecycle policy applied ✅")

📌 Real Saving

100 TB of raw data sitting in Standard costs ~$2,300/month. Move it to Glacier Deep Archive = ~$100/month. Lifecycle policy pays for itself in week 1.

Abort Incomplete Multipart Uploads — The Hidden Cost

When a large file upload fails mid-way, the partial chunks remain in S3 and you are billed for them. They are invisible in the console and can accumulate to gigabytes. The AbortIncompleteMultipartUpload lifecycle rule deletes them automatically after N days.

⚠️ Always Set This

Every bucket that receives Spark output or large files should have AbortIncompleteMultipartUpload set to 7 days. This is a free money-saver that almost every team overlooks.

⚙️

EMR Cost Patterns Compute Savings ▼

Auto-Terminate After Job (Most Important)

The single biggest EMR cost mistake: leaving a cluster running after the Spark job finishes. An idle m5.2xlarge cluster costs ~$0.38/hr per node — that's $273/month per idle node. Always set AutoTerminateAfterNoSteps=True or use a Lambda to terminate after the last step succeeds.

python — auto-terminate EMR cluster

# In run_job_flow: set auto-terminate
response = emr.run_job_flow(
    Name='daily-batch-job',
    Instances={
        'KeepJobFlowAliveWhenNoSteps': False,  # terminate when steps finish
        # ... other config
    },
    # OR use the top-level flag:
    AutoTerminateAfterNoSteps=True,
    Steps=[
        {
            'Name': 'SparkJob',
            'ActionOnFailure': 'TERMINATE_CLUSTER',  # also terminate on failure
            'HadoopJarStep': {
                'Jar': 'command-runner.jar',
                'Args': ['spark-submit', '--py-files', 's3://mybucket/jobs.zip',
                          's3://mybucket/main.py']
            }
        }
    ]
)

Right-Sizing Clusters

Over-provisioning is silent waste. Use the Spark UI to check executor utilisation after a job run — if executors are mostly idle, you have too many. Start with a smaller cluster, measure, then scale. EMR's managed scaling does this automatically.

python — enable managed scaling on EMR

# Managed scaling: EMR auto-resizes based on YARN pending containers
response = emr.run_job_flow(
    Name='auto-scaled-cluster',
    ManagedScalingPolicy={
        'ComputeLimits': {
            'UnitType': 'Instances',
            'MinimumCapacityUnits': 2,   # minimum 2 nodes
            'MaximumCapacityUnits': 20,  # scale up to 20 nodes
            'MaximumOnDemandCapacityUnits': 4,  # max On-Demand nodes
            'MaximumCoreCapacityUnits': 2,     # core nodes stay small
        }
    },
    # rest of config...
)

EMR Serverless — Zero Cluster Management Cost

With EMR Serverless, you pay only for the vCPU-seconds and GB-memory-seconds your Spark job actually uses. No idle cluster cost. No under/over-provisioning. For intermittent jobs (a few times per day), Serverless is almost always cheaper than a persistent cluster.

📌 When to Choose

Persistent EMR cluster wins when: jobs run continuously or near-continuously, require fast cold start, or need heavy customisation (bootstrap actions, custom AMI). Serverless wins for: scheduled batch jobs, ad-hoc runs, cost unpredictability.

🔵

Glue DPU Optimisation ETL Savings ▼

What is a DPU?

A Data Processing Unit (DPU) is the billing unit for Glue. One DPU = 4 vCPUs + 16 GB RAM. Glue charges $0.44 per DPU-hour, billed in 10-minute increments. By default, Glue allocates 10 DPUs to every job — whether your job needs it or not. This is where most teams waste money.

💡 Analogy

Paying for 10 DPUs on a job that only needs 2 is like booking 10 hotel rooms for a solo trip because that's the default package. Always specify explicitly.

Right-Sizing Glue DPUs

For small jobs (under 1 GB of data), 2 DPUs is often enough. For medium jobs (1–50 GB), try 5 DPUs first. Use the Glue Job Metrics (CloudWatch) to see actual executor utilisation, then reduce DPUs accordingly. Also consider G.1X vs G.2X workers — G.1X gives 4 vCPUs / 16 GB and costs less per worker than G.2X (8 vCPUs / 32 GB).

python — create Glue job with optimised DPUs

import boto3

glue = boto3.client('glue')

glue.create_job(
    Name='optimised-etl-job',
    Role='arn:aws:iam::123456789012:role/GlueRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://my-scripts/etl.py',
        'PythonVersion': '3',
    },
    # Worker type: G.1X = 4 vCPU / 16 GB (cost-efficient for most jobs)
    # G.2X = 8 vCPU / 32 GB (for memory-intensive transforms)
    WorkerType='G.1X',
    NumberOfWorkers=4,    # was 10 by default — start small!

    GlueVersion='4.0',

    DefaultArguments={
        '--enable-metrics': '',         # publish to CloudWatch for monitoring
        '--enable-job-insights': 'true', # Glue recommends right DPU count
        '--job-bookmark-option': 'job-bookmark-enable',
    },
    # Timeout protects against runaway cost
    Timeout=60,  # max 60 minutes; kills job after this
)

# Start the job
glue.start_job_run(JobName='optimised-etl-job')

🔑 Glue Job Insights

Enable --enable-job-insights in your job arguments. After a run, Glue will show a recommendation like "You could use 3 workers instead of 4" in the CloudWatch Logs. This is free automated right-sizing advice.

Python Shell Jobs for Lightweight Work

Not everything needs a full Spark cluster. If you're running a small metadata update, a file rename, or an API call, use a Glue Python Shell job — it runs on a single small VM (0.0625 DPU) and costs a fraction of a Spark ETL job.

python — Glue Python Shell job (cheapest option)

glue.create_job(
    Name='lightweight-metadata-job',
    Role='arn:aws:iam::123456789012:role/GlueRole',
    Command={
        'Name': 'pythonshell',          # NOT 'glueetl'
        'ScriptLocation': 's3://my-scripts/metadata_update.py',
        'PythonVersion': '3',
    },
    MaxCapacity=0.0625,   # 1/16 of a DPU — minimal cost
)

🔍

Athena Cost Control Query Savings ▼

How Athena Charges You

Athena charges $5 per TB of data scanned. That means a query that scans 10 TB costs $50 every time it runs. The two most powerful cost levers are: (1) partition pruning so Athena only scans relevant partitions, and (2) columnar format (Parquet/ORC) so it only reads relevant columns.

Scenario	Data Scanned	Cost per Query
Full table scan on CSV, no partitions	10 TB	$50.00
Full table scan on Parquet (column pruning)	2 TB	$10.00
Parquet + partition filter (WHERE dt='2024-01-01')	50 GB	$0.25

Partition Pruning — The Most Important Athena Optimisation

Always store data partitioned by date (or another high-cardinality filter column) and always include the partition column in your WHERE clause. Without this filter, Athena scans every partition. With it, Athena skips 99% of files.

sql — partition pruning in Athena

-- BAD: scans ALL partitions = $50/query
SELECT order_id, amount
FROM sales
WHERE customer_id = 12345;   -- customer_id is NOT the partition key

-- GOOD: partition filter reduces scan to 1 day = $0.25/query
SELECT order_id, amount
FROM sales
WHERE dt = '2024-01-15'        -- dt IS the partition key
  AND customer_id = 12345;

Workgroups — Per-Team Cost Control

Athena Workgroups let you set a per-query data scan limit and a per-day cost limit per team. If a query would scan more than your limit, Athena cancels it before any cost is incurred. This prevents runaway analytical queries from your data consumers from causing surprise bills.

python — Athena workgroup with cost limits

import boto3

athena = boto3.client('athena')

athena.create_work_group(
    Name='analytics-team',
    Configuration={
        'ResultConfiguration': {
            'OutputLocation': 's3://my-athena-results/analytics-team/'
        },
        'EnforceWorkGroupConfiguration': True,
        'BytesScannedCutoffPerQuery': 10 * 1024**3,  # 10 GB max per query
        'PublishCloudWatchMetricsEnabled': True,   # track usage in CloudWatch
        'EngineVersion': {
            'SelectedEngineVersion': 'Athena engine version 3'
        }
    },
    Description='Analytics team — 10 GB scan limit per query'
)

# Run query scoped to this workgroup
response = athena.start_query_execution(
    QueryString='SELECT * FROM sales WHERE dt = \'2024-01-15\'',
    QueryExecutionContext={'Database': 'prod_db'},
    WorkGroup='analytics-team',   # 🔑 always specify workgroup
)

Query Result Caching

Athena caches results for up to 7 days. If the same query is run again within the cache window, Athena returns the cached result at zero cost (no data scanned). Enable this at the workgroup level. Perfect for dashboard queries that run every 5 minutes on static data.

python — enable result reuse (caching) per query

response = athena.start_query_execution(
    QueryString='SELECT region, SUM(revenue) FROM sales GROUP BY 1',
    QueryExecutionContext={'Database': 'prod_db'},
    WorkGroup='analytics-team',
    ResultReuseConfiguration={
        'ResultReuseByAgeConfiguration': {
            'Enabled': True,
            'MaxAgeInMinutes': 60   # reuse cached result if < 60 mins old
        }
    }
)

λ

Lambda Cost — Memory Tuning Compute Savings ▼

How Lambda Charges You

Lambda charges on two dimensions: number of invocations ($0.20 per 1M requests) and GB-seconds of duration ($0.0000166667 per GB-second). GB-seconds = memory allocated (GB) × duration (seconds). Allocating more memory means higher cost per second but faster execution — you need to find the sweet spot.

💡 Analogy

Lambda memory is like hiring faster workers. A 1-GB worker processes your file in 10 seconds = 10 GB-seconds. A 3-GB worker does it in 3 seconds = 9 GB-seconds. The 3-GB worker is actually cheaper AND faster.

AWS Lambda Power Tuning Tool

AWS provides an open-source Lambda Power Tuning step-function that runs your function at multiple memory settings (128 MB to 10,240 MB) and plots cost vs speed. Use it to find the optimal memory for your specific function. Most data engineering Lambda functions are CPU-bound and benefit from 512 MB–1024 MB.

python — update Lambda memory via boto3

import boto3

lam = boto3.client('lambda')

# Right-size Lambda memory after profiling
lam.update_function_configuration(
    FunctionName='file-arrival-trigger',
    MemorySize=512,        # MB — tuned from default 128 MB
    Timeout=300,           # 5 minutes max (fail fast)
    Environment={
        'Variables': {
            'GLUE_JOB_NAME': 'my-etl-job',
        }
    }
)

# Check current config
config = lam.get_function_configuration(FunctionName='file-arrival-trigger')
print(f"Memory: {config['MemorySize']}MB, Timeout: {config['Timeout']}s")

🔴

Redshift Cost — Pause/Resume Clusters DB Savings ▼

Pause and Resume Redshift

Redshift provisioned clusters charge by the hour — even when idle. For dev/staging clusters or analytics clusters only needed during business hours, pause them overnight and on weekends. A paused cluster retains all data and configuration, you only stop paying for compute. You can automate this with boto3 or Redshift's built-in scheduler.

python — pause and resume Redshift cluster

import boto3

redshift = boto3.client('redshift')

def pause_cluster(cluster_id: str):
    """Call this at end of business hours (e.g., 7 PM via EventBridge)."""
    redshift.pause_cluster(ClusterIdentifier=cluster_id)
    print(f"Cluster {cluster_id} paused ✅")

def resume_cluster(cluster_id: str):
    """Call this at start of business hours (e.g., 7 AM via EventBridge)."""
    redshift.resume_cluster(ClusterIdentifier=cluster_id)
    print(f"Cluster {cluster_id} resuming... ⏳")

# Check cluster status
def get_cluster_status(cluster_id: str) -> str:
    resp = redshift.describe_clusters(ClusterIdentifier=cluster_id)
    return resp['Clusters'][0]['ClusterStatus']
    # Returns: 'available', 'paused', 'pausing', 'resuming'

# Usage
pause_cluster('my-analytics-cluster')

📌 Saving Calculation

A ra3.4xlarge Redshift node costs ~$3.26/hr. Running 24/7 = $2,347/month. Pausing 16 hrs/day (nights + weekends ~67% of time) = ~$774/month. You save ~$1,573/month per node.

Redshift Serverless — Pay Per Query

For unpredictable or low-frequency query workloads, Redshift Serverless charges per RPU-second (Redshift Processing Unit). You pay nothing when idle. For dev environments or occasional reporting, Serverless is almost always cheaper than a provisioned cluster.

🔑 Decision Guide

Provisioned cluster (with RIs): for predictable, high-concurrency production workloads running 8+ hours daily. Redshift Serverless: for dev, staging, ad-hoc, or infrequent analytics. Pause/Resume: for provisioned clusters with predictable downtime windows.

📊

Cost Optimisation Cheat Sheet Quick Reference ▼

Top 10 Actions Every DE Should Take

#	Action	Service	Typical Saving
1	Use Spot for EMR task nodes	EMR	60–80% compute
2	Auto-terminate clusters after job	EMR	Eliminate idle cost
3	S3 lifecycle: move raw data to IA after 30d	S3	45–95% storage
4	Always filter on partition column in Athena	Athena	80–99% scan cost
5	Right-size Glue DPUs (start with 4, not 10)	Glue	40–60% DPU cost
6	Pause Redshift dev clusters at night/weekends	Redshift	60–70% compute
7	Use Compute Savings Plans for baseline EC2/Lambda	All	30–40% overall
8	Enable Athena result caching for dashboards	Athena	Near-zero repeat queries
9	AbortIncompleteMultipartUpload lifecycle rule	S3	Small but free
10	Use Python Shell for lightweight Glue work	Glue	94% vs Spark job

29.21

AWS Data Governance

Data Governance is about knowing who can access what data, where it came from, and whether it can be trusted. On AWS, the governance stack centers on Lake Formation (permissions), Glue Catalog (metadata), and patterns for lineage, PII detection, and cross-account sharing. A senior Data Engineer must be able to design and implement this stack — not just write pipelines.

🏛️

Lake Formation as Governance Layer CORE ▼

What is AWS Lake Formation?

AWS Lake Formation is a centralized permission layer that sits on top of S3 + Glue Catalog. Instead of writing complex bucket policies and IAM statements for every user and table, you grant permissions at the database, table, column, or row level through a single Lake Formation console or API call. Glue, Athena, Redshift Spectrum, and EMR all honor these permissions automatically.

🗄️ Analogy

Think of Glue Catalog as a library catalogue (it knows what books exist and where they are), and Lake Formation as the librarian who decides which people are allowed to read which books. You don't change the books or the shelves — you just tell the librarian the rules.

LAKE FORMATION GOVERNANCE STACK ══════════════════════════════════════════════════════ Analysts / Data Scientists / Spark Jobs / Athena / Redshift Spectrum │ ▼ ┌──────────────────────────┐ │ AWS Lake Formation │ ← Permission enforcement layer │ (who can see what) │ └──────────┬───────────────┘ │ governs access to ▼ ┌──────────────────────────┐ │ AWS Glue Data Catalog │ ← Metadata: databases, tables, schemas └──────────┬───────────────┘ │ describes data stored in ▼ ┌──────────────────────────┐ │ Amazon S3 │ ← Actual data: Parquet, Delta, Iceberg └──────────────────────────┘ PERMISSION HIERARCHY IN LAKE FORMATION: Database level → can see all tables in this database Table level → can see this specific table Column level → can see only these columns (hide PII columns) Row level → can see only rows matching a filter expression Cell level → mask specific cells (column masking policy)

Registering S3 Locations

Before Lake Formation can govern data, you must register the S3 location with Lake Formation. This tells Lake Formation "I am taking ownership of permissions for this S3 path — no longer controlled by raw bucket policies alone." After registration, access to this location requires a Lake Formation permission grant, not just an IAM policy.

python — register S3 location with Lake Formation

import boto3

lf = boto3.client("lakeformation")

# Register the S3 data lake root with Lake Formation
# RoleArn is the service-linked role LF uses to access S3
lf.register_resource(
    ResourceArn="arn:aws:s3:::company-data-lake",
    UseServiceLinkedRole=True
)
print("✅ S3 location registered with Lake Formation")

# List all registered S3 locations
resp = lf.list_resources()
for resource in resp["ResourceInfoList"]:
    print(resource["ResourceArn"], "→", resource.get("RoleArn", "service-linked"))

🔑 Key Point

You only need to register the root S3 path (e.g. s3://company-data-lake). All sub-paths (bronze/, silver/, gold/) are automatically governed by Lake Formation once the root is registered.

Granting Table-Level Permissions

After registering S3 and crawling data into the Glue Catalog, you grant Lake Formation permissions to IAM principals (users, roles, groups). The most common grants are SELECT (read), INSERT (write), ALTER (schema changes), and DROP (delete table).

python — grant and revoke Lake Formation table permissions

import boto3

lf = boto3.client("lakeformation")

# Grant SELECT on a table to a data analyst IAM role
lf.grant_permissions(
    Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/DataAnalystRole"},
    Resource={
        "Table": {
            "CatalogId":    "123456789012",
            "DatabaseName": "gold_layer",
            "Name":         "sales_summary"
        }
    },
    Permissions=["SELECT"],
    PermissionsWithGrantOption=[]  # analyst cannot re-grant to others
)
print("✅ SELECT granted on gold_layer.sales_summary")

# Grant SELECT on all tables in a database to an ETL role
lf.grant_permissions(
    Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/GlueETLRole"},
    Resource={
        "Database": {
            "CatalogId":    "123456789012",
            "Name":         "silver_layer"
        }
    },
    Permissions=["SELECT", "DESCRIBE"]
)

# Revoke permissions
lf.revoke_permissions(
    Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/OldRole"},
    Resource={
        "Table": {
            "CatalogId":    "123456789012",
            "DatabaseName": "gold_layer",
            "Name":         "sales_summary"
        }
    },
    Permissions=["SELECT"]
)

📚

Glue Catalog as Central Metadata Store METADATA ▼

What the Glue Catalog Stores

The AWS Glue Data Catalog is a fully managed Hive-compatible metadata store. It stores the schema, partition info, table location, and data format for every table in your data lake. Athena, EMR Spark, Redshift Spectrum, Glue ETL jobs, and Lake Formation all share this single catalog — meaning a table created by Glue is instantly queryable by Athena without any extra registration.

🗄️

Databases

Logical groupings for tables. Typically one per layer: raw_layer, bronze_layer, silver_layer, gold_layer.

📋

Tables

Schema definitions pointing to S3 paths. Includes column names, types, partition keys, file format (Parquet, ORC, Delta).

📂

Partitions

Partition metadata so Athena/EMR can prune which S3 files to read. E.g. year=2024/month=01/day=15.

🔄

Schema Versions

Glue tracks schema changes over time. You can detect when columns were added or types changed.

Reading Catalog Metadata via boto3

Use the Glue client to programmatically inspect the catalog — list all tables, get their schemas, check partition counts. This is useful in governance audits, data discovery tools, and metadata-driven pipeline config generation.

python — inspect Glue Catalog for governance

import boto3

glue = boto3.client("glue")

# List all databases in the catalog
paginator = glue.get_paginator("get_databases")
for page in paginator.paginate():
    for db in page["DatabaseList"]:
        print(f"Database: {db['Name']}")

# Get all tables in a database with their schema
paginator = glue.get_paginator("get_tables")
for page in paginator.paginate(DatabaseName="gold_layer"):
    for table in page["TableList"]:
        name     = table["Name"]
        location = table["StorageDescriptor"]["Location"]
        fmt      = table["StorageDescriptor"]["InputFormat"]
        columns  = table["StorageDescriptor"]["Columns"]
        print(f"Table: {name}  |  S3: {location}  |  Format: {fmt}")
        for col in columns:
            print(f"   {col['Name']:30s}  {col['Type']}")

# Get partition count for a table
part_paginator = glue.get_paginator("get_partitions")
count = 0
for page in part_paginator.paginate(
        DatabaseName="gold_layer", TableName="sales_summary"):
    count += len(page["Partitions"])
print(f"Partition count: {count}")

🔗

Data Lineage — Tracking Source → Transform → Target LINEAGE ▼

What is Data Lineage and Why It Matters

Data lineage is the ability to answer: "Where did this data come from? What transformations happened to it? Where did it go?" Without lineage, when a column value looks wrong, you cannot trace back to the root cause. Lineage is also required for regulatory compliance (GDPR, HIPAA) — to prove you can find and delete every copy of a person's PII across the entire data platform.

📌 Real Example

A BI dashboard shows total_revenue = $0 for January. With lineage, you trace: Gold sales_summary ← Silver orders_clean ← Bronze orders_raw ← RDS orders table. You find the Glue job that ran Bronze → Silver had a filter bug that dropped all January rows. Without lineage, you'd spend hours guessing.

DATA LINEAGE GRAPH — EXAMPLE [RDS: orders table] │ (Glue JDBC crawl, daily 02:00) ▼ [S3 Bronze: raw/rds/orders/year=2024/month=01/] │ (Glue ETL job: orders-bronze-to-silver, daily 03:00) ▼ [S3 Silver: delta/silver/orders_clean/] │ (Glue ETL job: orders-silver-to-gold, daily 04:00) ▼ [S3 Gold: delta/gold/sales_summary/] │ (Athena view / Redshift Spectrum) ▼ [QuickSight Dashboard: Revenue Summary] LINEAGE METADATA STORED PER PIPELINE RUN: run_id → uuid source → rds.orders (host, db, table) target → s3://company-data-lake/delta/silver/orders_clean/ job_name → orders-bronze-to-silver run_time → 2024-01-15T03:12:44Z rows_in → 142,830 rows_out → 142,830 filters → WHERE updated_at > '2024-01-14' columns_used → order_id, customer_id, amount, status, updated_at

Custom Lineage Table in DynamoDB (Practical Pattern)

The most practical way to implement lineage for a mid-size data platform is to write a lineage record to a DynamoDB table at the start and end of every pipeline job. Each record captures the source, target, transformation job, run time, and row counts. Tools like a simple internal portal or Athena queries can then visualize the lineage graph.

python — writing lineage records to DynamoDB

import boto3, uuid
from datetime import datetime, timezone
from decimal  import Decimal

dynamo = boto3.resource("dynamodb")
table  = dynamo.Table("data_lineage")

def record_lineage(
    job_name:    str,
    source:      str,   # e.g. "rds.company_db.orders"
    target:      str,   # e.g. "s3://company-lake/silver/orders_clean/"
    rows_in:     int,
    rows_out:    int,
    status:      str,   # SUCCEEDED / FAILED
    run_id:      str = None,
    error_msg:   str = None
):
    run_id = run_id or str(uuid.uuid4())
    now    = datetime.now(timezone.utc).isoformat()

    table.put_item(Item={
        "run_id":     run_id,
        "job_name":   job_name,
        "source":     source,
        "target":     target,
        "rows_in":    Decimal(rows_in),
        "rows_out":   Decimal(rows_out),
        "status":     status,
        "run_time":   now,
        "error_msg":  error_msg or ""
    })
    return run_id

# Usage: call at start and end of every Glue/EMR/Lambda job
run_id = str(uuid.uuid4())

# At pipeline start
record_lineage(
    job_name  = "orders-bronze-to-silver",
    source    = "s3://company-lake/raw/rds/orders/",
    target    = "s3://company-lake/delta/silver/orders_clean/",
    rows_in   = 142830,
    rows_out  = 142830,
    status    = "SUCCEEDED",
    run_id    = run_id
)
print(f"✅ Lineage recorded: {run_id}")

# Query lineage for a specific target table
resp = table.query(
    IndexName="target-index",   # GSI on target column
    KeyConditionExpression=boto3.dynamodb.conditions.Key("target").eq(
        "s3://company-lake/delta/silver/orders_clean/"
    )
)
for item in resp["Items"]:
    print(f"  {item['run_time']}  {item['job_name']}  rows_in={item['rows_in']}  status={item['status']}")

OpenLineage — Open Standard for Lineage

OpenLineage is an open standard (backed by Astronomer, Databricks, and others) that defines a common JSON event format for emitting lineage from any pipeline tool — Spark, Airflow, dbt, Flink. Marquez is an open-source lineage server that collects OpenLineage events and provides a UI to visualize the full lineage graph. For large platforms, adopting OpenLineage is far better than building a custom DynamoDB lineage table.

OPENLINEAGE ARCHITECTURE Glue ETL Job Airflow DAG dbt Model │ (OL Spark listener) │ (OL Airflow provider) │ (OL dbt integration) │ │ │ └───────────────────────┼──────────────────────────┘ │ OpenLineage events (JSON over HTTP) ▼ ┌──────────────────┐ │ Marquez Server │ (or DataHub, Atlan, OpenMetadata) │ (lineage store) │ └────────┬─────────┘ │ ▼ Lineage Graph UI "orders_raw → orders_clean → sales_summary → BI Dashboard"

🔑 Key Point

The Spark OpenLineage integration adds a listener to your SparkSession. It automatically emits START and COMPLETE lineage events for every Spark job — no manual code changes needed in your ETL scripts.

python — enabling OpenLineage in PySpark (Glue / EMR)

# Add to spark-submit --conf or SparkSession config
spark = SparkSession.builder \
    .appName("orders-silver-etl") \
    .config("spark.extraListeners",
            "io.openlineage.spark.agent.OpenLineageSparkListener") \
    .config("spark.openlineage.transport.type", "http") \
    .config("spark.openlineage.transport.url",
            "http://marquez-server:5000") \
    .config("spark.openlineage.namespace", "company-data-platform") \
    .getOrCreate()

# Everything after this is automatically tracked by OpenLineage
# reads, writes, transformations — all captured as lineage events
df = spark.read.parquet("s3://company-lake/raw/orders/")
df_clean = df.filter(df.status != "CANCELLED")
df_clean.write.format("delta").save("s3://company-lake/silver/orders_clean/")
# → OpenLineage emits: raw/orders → [filter transform] → silver/orders_clean

🔒

PII Detection and Classification PII / COMPLIANCE ▼

Why PII Detection Matters

PII (Personally Identifiable Information) includes names, email addresses, phone numbers, SSNs, IP addresses, and anything that can identify a person. Under GDPR, HIPAA, and PCI-DSS, you must know where all PII lives in your data lake, control who can see it, and be able to delete it on request (GDPR right to erasure). A Data Engineer must build PII detection into the ingestion pipeline — not as an afterthought.

⚠️ Warning

If raw PII lands in your S3 Bronze layer undetected and unmasked, it can be queried by anyone with Athena access or an EMR cluster. This is a compliance incident. Detect and classify PII at ingestion, before data reaches the catalog.

PII Detection with AWS Glue Sensitive Data Detection

AWS Glue's sensitive data detection feature (part of Glue Data Quality) can automatically scan datasets for PII patterns — emails, credit card numbers, SSNs, phone numbers, and more. It uses built-in detectors that match common PII patterns with regex and ML-based classifiers. You attach it to a Glue job to scan data as it flows through the pipeline.

python — regex-based PII detection in PySpark (practical pattern)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, count, when

spark = SparkSession.builder.appName("pii-scanner").getOrCreate()

# Load raw data
df = spark.read.parquet("s3://company-lake/raw/customers/")

# PII regex patterns
EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
PHONE_REGEX = r'(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
SSN_REGEX   = r'\b\d{3}-\d{2}-\d{4}\b'
CC_REGEX    = r'\b(?:\d{4}[-\s]?){3}\d{4}\b'

# Scan every string column for PII matches
pii_report = {}
for field in df.schema.fields:
    if str(field.dataType) == "StringType()":
        col_name = field.name
        pii_count = df.select(
            count(when(col(col_name).rlike(EMAIL_REGEX), 1)).alias("email"),
            count(when(col(col_name).rlike(PHONE_REGEX), 1)).alias("phone"),
            count(when(col(col_name).rlike(SSN_REGEX),   1)).alias("ssn"),
            count(when(col(col_name).rlike(CC_REGEX),    1)).alias("cc"),
        ).collect()[0]

        detected = {k: v for k, v in pii_count.asDict().items() if v > 0}
        if detected:
            pii_report[col_name] = detected
            print(f"⚠️  PII detected in column '{col_name}': {detected}")

if not pii_report:
    print("✅ No PII detected in dataset")

# Output: ⚠️  PII detected in column 'email_address': {'email': 85200}
# Output: ⚠️  PII detected in column 'phone':         {'phone': 85200}

Column Name Heuristics for PII Detection

A fast first-pass approach before regex scanning is to check column names for PII keywords. If a column is called email, phone_number, ssn, date_of_birth, or credit_card, it almost certainly contains PII. This heuristic lets you flag PII instantly during schema discovery without scanning any data.

python — column-name heuristic PII detection

import boto3

# PII keyword hints in column names
PII_KEYWORDS = {
    "email", "phone", "mobile", "ssn", "social_security",
    "credit_card", "card_number", "dob", "date_of_birth",
    "ip_address", "passport", "national_id", "tax_id",
    "first_name", "last_name", "full_name", "address", "postcode"
}

glue = boto3.client("glue")

def scan_catalog_for_pii(database: str):
    """Scan all tables in a Glue Catalog database for PII column names."""
    pii_findings = []
    paginator = glue.get_paginator("get_tables")

    for page in paginator.paginate(DatabaseName=database):
        for table in page["TableList"]:
            tbl_name = table["Name"]
            columns  = table["StorageDescriptor"]["Columns"]
            for col in columns:
                col_lower = col["Name"].lower()
                if any(kw in col_lower for kw in PII_KEYWORDS):
                    pii_findings.append({
                        "database": database,
                        "table":    tbl_name,
                        "column":   col["Name"],
                        "type":     col["Type"]
                    })
    return pii_findings

findings = scan_catalog_for_pii("bronze_layer")
for f in findings:
    print(f"⚠️  {f['database']}.{f['table']}.{f['column']} ({f['type']}) → likely PII")

🎭

Column-Level Masking Strategies MASKING ▼

The Three Masking Approaches

Once PII is detected, you must decide how to protect it. There are three primary strategies, each with different trade-offs between security, usability, and reversibility.

Strategy	What It Does	Reversible?	Use Case
Nullification	Replace PII with NULL	No	When the value is never needed downstream
Static Masking	Replace with fixed placeholder (`*--1234`)	No	Display in BI dashboards — readable format, no real value
Tokenization	Replace with a token that maps to original value in a secure lookup table	Yes (with key)	Need to join on PII key, but not expose the raw value
Hashing (SHA-256)	Deterministic one-way hash of PII	No	Consistent anonymised join key (same email always same hash)

Implementing Masking in PySpark

Apply masking transformations in your Bronze → Silver ETL job. Raw PII stays in Bronze (with restricted access). Silver and above contain only masked values accessible to analysts.

python — PII masking in PySpark Bronze → Silver transform

from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, sha2, concat_ws, lit,
    regexp_replace, when
)

spark = SparkSession.builder.appName("pii-masking").getOrCreate()

# Bronze layer — contains raw PII (restricted access in Lake Formation)
df_bronze = spark.read.delta("s3://company-lake/delta/bronze/customers/")

df_silver = df_bronze \
    .withColumn(
        "email_masked",
        # Static masking: keep domain, hash local part
        regexp_replace(col("email"), r'^[^@]+', "****")
        # Result: ****@gmail.com  (domain visible, local part masked)
    ) \
    .withColumn(
        "phone_masked",
        # Keep last 4 digits only
        regexp_replace(col("phone"), r'\d(?=\d{4})', "*")
        # Result: ***-***-1234
    ) \
    .withColumn(
        "customer_hash",
        # Deterministic hash for joining without exposing raw email
        sha2(concat_ws("|", col("email"), lit("SALT_2024")), 256)
    ) \
    .withColumn("ssn",   lit(None).cast("string")) \  # nullify SSN completely
    .withColumn("dob",   lit(None).cast("date")) \    # nullify DOB
    .drop("email", "phone")   # drop raw PII columns from silver

df_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .save("s3://company-lake/delta/silver/customers_masked/")

print("✅ PII masked — Silver layer written")

Dynamic Column Masking with Lake Formation

Lake Formation supports column-level security — you can hide an entire column from certain roles without physically removing it from the table. When a user without the column permission queries the table via Athena or Redshift Spectrum, the column simply doesn't appear in the results. This is called a column mask policy and is enforced at query time by Lake Formation.

python — grant column-level access (hide PII columns from analyst role)

import boto3

lf = boto3.client("lakeformation")

# Grant SELECT only on specific non-PII columns to the analyst role
# Column: customer_hash, order_count, total_spend — NO email, NO phone, NO ssn
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/DataAnalystRole"
    },
    Resource={
        "TableWithColumns": {
            "CatalogId":    "123456789012",
            "DatabaseName": "silver_layer",
            "Name":         "customers",
            "ColumnNames": [
                "customer_id",
                "customer_hash",
                "country",
                "segment",
                "created_at"
                # email, phone, ssn, dob NOT in this list → hidden from analyst
            ]
        }
    },
    Permissions=["SELECT"]
)
print("✅ Column-level grant set: analyst sees only safe columns")

🔑 Key Point

Column-level permissions in Lake Formation are additive — you grant exactly the columns the role is allowed to see. Any column not in the grant list is invisible to that principal when they query via Athena, Redshift Spectrum, or EMR (if using Lake Formation fine-grained access).

🗂️

Metadata Management Patterns METADATA ▼

Tagging Tables and Columns with Lake Formation LF-Tags

LF-Tags (Lake Formation Tags) are key-value pairs you attach to databases, tables, and columns to classify data. Instead of writing individual permission grants per table, you write a tag-based grant once: "DataAnalystRole can SELECT all tables tagged with classification=public". As new tables are crawled, you simply tag them, and they automatically inherit the right permissions — no per-table grant needed.

python — creating and using LF-Tags for attribute-based access

import boto3

lf = boto3.client("lakeformation")

# Step 1: Create LF-Tag keys and allowed values
lf.create_lf_tag(TagKey="classification", TagValues=["public", "internal", "confidential", "pii"])
lf.create_lf_tag(TagKey="data_layer",     TagValues=["bronze", "silver", "gold"])
lf.create_lf_tag(TagKey="domain",          TagValues=["sales", "finance", "hr", "ops"])

# Step 2: Tag a table
lf.add_lf_tags_to_resource(
    Resource={
        "Table": {
            "DatabaseName": "gold_layer",
            "Name":         "sales_summary"
        }
    },
    LFTags=[
        {"TagKey": "classification", "TagValues": ["internal"]},
        {"TagKey": "data_layer",     "TagValues": ["gold"]},
        {"TagKey": "domain",          "TagValues": ["sales"]},
    ]
)
print("✅ Table tagged: classification=internal, data_layer=gold, domain=sales")

# Step 3: Grant access by LF-Tag (attribute-based access control)
# "DataAnalystRole can SELECT all tables where classification=internal AND domain=sales"
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier":
            "arn:aws:iam::123456789012:role/SalesAnalystRole"
    },
    Resource={
        "LFTagPolicy": {
            "ResourceType": "TABLE",
            "Expression": [
                {"TagKey": "classification", "TagValues": ["internal", "public"]},
                {"TagKey": "domain",          "TagValues": ["sales"]},
            ]
        }
    },
    Permissions=["SELECT", "DESCRIBE"]
)
print("✅ Tag-based grant: SalesAnalystRole can read all internal/public sales tables")

Tagging Columns with Sensitivity Labels

You can also tag individual columns with sensitivity labels — for example, tag all email columns with sensitivity=pii. Then grant column permissions based on those tags. This way, adding a new PII column to a table automatically makes it restricted, without updating any permission grants manually.

python — tag a specific column as PII in Lake Formation

import boto3

lf = boto3.client("lakeformation")

# Tag the 'email' column as PII
lf.add_lf_tags_to_resource(
    Resource={
        "TableWithColumns": {
            "DatabaseName": "bronze_layer",
            "Name":         "customers_raw",
            "ColumnNames":  ["email", "phone", "ssn", "date_of_birth"]
        }
    },
    LFTags=[
        {"TagKey": "sensitivity", "TagValues": ["pii"]}
    ]
)
print("✅ PII columns tagged — access now requires sensitivity=pii permission grant")

🌐

Cross-Account Catalog Sharing ADVANCED ▼

Why Cross-Account Sharing?

In large enterprises, the data lake (Account A) is separate from the analytics/BI account (Account B) for cost separation, security isolation, or organisational boundaries. Data engineers in Account A produce Gold layer tables; analysts in Account B need to query them via Athena. Lake Formation handles this elegantly via cross-account data sharing — no data copying required.

CROSS-ACCOUNT DATA SHARING PATTERN Account A (Data Engineering / Data Lake account) ├── S3: s3://company-data-lake/gold/sales_summary/ ├── Glue Catalog: gold_layer.sales_summary └── Lake Formation: grant SELECT on gold_layer.sales_summary to Account B (123456789099) │ RAM Resource Share ▼ Account B (Analytics / BI account) ├── Glue Catalog: gold_layer.sales_summary ← appears as shared catalog └── Athena: SELECT * FROM gold_layer.sales_summary → reads data from Account A's S3 directly → Lake Formation enforces permissions across accounts

Setting Up Cross-Account Sharing via boto3

Cross-account sharing uses AWS Resource Access Manager (RAM) to share the Glue Catalog resource, and then Lake Formation to grant table-level permissions to the consumer account. The consumer account then creates a resource link in their own catalog pointing to the shared table.

python — cross-account Lake Formation table sharing (Account A — producer)

import boto3

lf = boto3.client("lakeformation")

CONSUMER_ACCOUNT_ID = "123456789099"   # Account B

# Step 1: Grant permission to the consumer ACCOUNT (not a role inside it)
# This makes the catalog entry visible to the consumer account
lf.grant_permissions(
    Principal={
        "DataLakePrincipalIdentifier": CONSUMER_ACCOUNT_ID
    },
    Resource={
        "Table": {
            "CatalogId":    "111111111111",    # producer account (Account A)
            "DatabaseName": "gold_layer",
            "Name":         "sales_summary"
        }
    },
    Permissions=["SELECT", "DESCRIBE"],
    PermissionsWithGrantOption=["SELECT", "DESCRIBE"]
    # PermissionsWithGrantOption allows Account B to further share with its own roles
)
print("✅ Cross-account share granted to Account B")

python — consumer account (Account B) creates a resource link

import boto3

# In Account B — create a resource link in the local Glue Catalog
# pointing to the shared table in Account A
glue_b = boto3.client("glue")

glue_b.create_table(
    DatabaseName="shared_gold",      # local database in Account B's catalog
    TableInput={
        "Name":           "sales_summary_link",
        "TargetTable": {
            "CatalogId":    "111111111111",    # producer's account ID
            "DatabaseName": "gold_layer",
            "Name":         "sales_summary"
        }
    }
)
print("✅ Resource link created — Athena in Account B can now query sales_summary")

# Athena query in Account B now works:
# SELECT * FROM shared_gold.sales_summary_link LIMIT 10;
# → reads from Account A's S3 directly, LF enforces permissions

☁️ Production Pattern

In a real enterprise setup, Account A is the Data Platform / Lake account owned by the Data Engineering team. Accounts B, C, D are Business Unit accounts (Sales Analytics, Finance Analytics, etc.). Each BU gets exactly the Gold tables they need, with column-level restrictions enforced by Lake Formation. No data is physically copied between accounts — it's all permission-based access to the same S3 objects.

📋 29.21 Summary — AWS Data Governance

Lake Formation is the central permission enforcement layer — governs who can access which databases, tables, columns, and rows.
Glue Catalog is the central metadata store — all AWS analytics services (Athena, Glue, EMR, Redshift Spectrum) share it.
Data Lineage answers "where did this data come from?" — implement via DynamoDB records per pipeline run, or use OpenLineage + Marquez for an open-standard approach.
PII Detection must happen at ingestion — use column name heuristics first, then regex scanning. Flag PII before data reaches the catalog.
Masking strategies: nullify for unused PII, hash for join keys, static mask for display, tokenize for reversible lookup.
LF-Tags enable attribute-based access control — tag tables and columns once, write grants once, and new tables automatically inherit correct permissions.
Cross-account sharing via Lake Formation + RAM eliminates data copying — consumer accounts get permission-based access to producer account S3 data.

29.22

Terraform for Data Engineers (IaC)

Infrastructure as Code (IaC) means your S3 buckets, Glue jobs, EMR clusters, Lambda functions, and IAM roles are defined in code — version-controlled, repeatable, and promotable across environments. Terraform is the industry-standard IaC tool for AWS data platforms. A senior Data Engineer must be able to write and own Terraform for every service in their pipeline stack.

🏗️

Why IaC Matters for Data Engineers FUNDAMENTALS ▼

The Problem Without IaC

Without IaC, infrastructure is created by hand through the AWS console. This creates three serious problems: snowflake environments (prod and staging drift apart over time because changes are made manually), no audit trail (you don't know who created what or when), and slow recovery (if you need to rebuild the data platform in a new account or region, you have to re-click everything from memory). IaC solves all three.

🗄️ Analogy

Clicking through the AWS console to build infrastructure is like cooking a dish from memory each time — the result varies slightly every time and you can't hand the recipe to someone else. Terraform is the written recipe: version-controlled, reproducible, shareable, and reviewable before you start cooking.

Without IaC (Console Clicking)	With Terraform (IaC)
Environments drift apart	Dev/staging/prod are identical by definition
No audit trail of changes	Every change is a git commit with author + reason
Rebuilding takes days/weeks	terraform apply rebuilds in minutes
New team member must learn console	New member reads .tf files to understand the whole stack
Hard to review changes before applying	terraform plan shows exact diff before any change

📖

Terraform Fundamentals CORE CONCEPTS ▼

Providers, Resources, Variables, Outputs

Every Terraform configuration has four building blocks. A provider tells Terraform which cloud to talk to (AWS, GCP, Azure). A resource is an actual infrastructure object you want to create (an S3 bucket, a Glue job, an IAM role). A variable is an input parameter so the same code works across environments. An output exposes values from created resources for use by other modules or scripts.

hcl — terraform basics: provider, resource, variable, output

# provider.tf — tell Terraform to use AWS
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

# variables.tf — input parameters
variable "aws_region"   { default = "us-east-1" }
variable "environment"  { default = "dev" }          # dev / staging / prod
variable "project_name" { default = "company-datalake" }

# main.tf — create an S3 bucket resource
resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.project_name}-${var.environment}"
  # Result: "company-datalake-dev" or "company-datalake-prod"

  tags = {
    Environment = var.environment
    Project     = var.project_name
    ManagedBy   = "Terraform"
  }
}

# outputs.tf — expose values for other modules
output "data_lake_bucket_name" {
  value = aws_s3_bucket.data_lake.bucket
}
output "data_lake_bucket_arn" {
  value = aws_s3_bucket.data_lake.arn
}

State File — Local vs Remote

Terraform tracks everything it has created in a state file (terraform.tfstate). This file maps your .tf resource definitions to the real AWS resources. If you delete the state file, Terraform loses track of what exists. For teams, the state file must be stored remotely (in S3) so everyone shares the same state, and a DynamoDB table is used to prevent two people from running terraform apply simultaneously (state locking).

hcl — remote state: S3 backend + DynamoDB lock table

# backend.tf — store state remotely in S3, lock with DynamoDB
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"      # dedicated state bucket
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                            # SSE-S3 encryption
    dynamodb_table = "terraform-state-lock"          # prevents concurrent applies
  }
}

# Create the lock table (run once manually or in a bootstrap Terraform)
resource "aws_dynamodb_table" "tf_lock" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
  tags = { Purpose = "TerraformStateLock" }
}

Workspaces — Dev / Staging / Prod Environments

Terraform workspaces allow a single set of .tf files to manage multiple environments. Each workspace has its own state file in S3, so terraform workspace select prod followed by terraform apply only affects production resources. Combine workspaces with a terraform.tfvars file per environment for environment-specific variable values.

bash — terraform workspace commands

# Create workspaces for each environment
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

# Switch to prod and apply
terraform workspace select prod
terraform plan  -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars

# List all workspaces
terraform workspace list
#   default
#   dev
# * prod       ← current
#   staging

hcl — using workspace name in resource naming

# Resources automatically get env-specific names
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-datalake-${terraform.workspace}"
  # dev     → company-datalake-dev
  # staging → company-datalake-staging
  # prod    → company-datalake-prod
}

Modules — Reusable Infrastructure Components

A Terraform module is a reusable block of resources grouped together. Instead of copy-pasting Glue job definitions for every pipeline, you write a glue_job module once and call it with different parameters. Modules are the equivalent of functions in programming — they promote reuse and consistency across the data platform.

hcl — creating and calling a reusable glue_job module

# modules/glue_job/main.tf — the reusable module
variable "job_name"       {}
variable "script_location" {}
variable "role_arn"        {}
variable "glue_version"   { default = "4.0" }
variable "worker_type"    { default = "G.1X" }
variable "num_workers"    { default = 5 }

resource "aws_glue_job" "this" {
  name              = var.job_name
  role_arn          = var.role_arn
  glue_version      = var.glue_version
  worker_type       = var.worker_type
  number_of_workers = var.num_workers

  command {
    script_location = var.script_location
    python_version  = "3"
  }
  default_arguments = {
    "--job-language"              = "python"
    "--enable-glue-datacatalog"   = "true"
    "--enable-metrics"            = "true"
    "--enable-continuous-cloudwatch-log" = "true"
  }
}

output "job_name" { value = aws_glue_job.this.name }

##################################################################
# main.tf — calling the module for multiple pipelines
module "orders_bronze" {
  source          = "./modules/glue_job"
  job_name        = "orders-bronze-etl-${terraform.workspace}"
  script_location = "s3://company-scripts/glue/orders_bronze.py"
  role_arn        = aws_iam_role.glue_role.arn
  num_workers     = 10
}

module "customers_bronze" {
  source          = "./modules/glue_job"
  job_name        = "customers-bronze-etl-${terraform.workspace}"
  script_location = "s3://company-scripts/glue/customers_bronze.py"
  role_arn        = aws_iam_role.glue_role.arn
  num_workers     = 5
}

Terraform Workflow — init → plan → apply → destroy

The four commands are your daily Terraform workflow. init downloads provider plugins. plan shows what will change without touching anything. apply executes the changes. destroy tears everything down. In production CI/CD, a human reviews the plan output before apply is triggered.

bash — complete terraform workflow

# Step 1: Download providers and initialise backend
terraform init

# Step 2: Preview what will be created/changed/destroyed
terraform plan -var-file=dev.tfvars -out=tfplan
# Output shows: + create  ~ update  - destroy
# ALWAYS review this before applying in prod

# Step 3: Apply the plan
terraform apply tfplan
# or interactively: terraform apply -var-file=dev.tfvars

# Step 4: Tear down (only for dev/test environments)
terraform destroy -var-file=dev.tfvars

# Import an existing resource into Terraform state
# (for resources created manually before Terraform was adopted)
terraform import aws_s3_bucket.data_lake company-datalake-dev

# Show current state
terraform show
terraform state list

⚙️

Provisioning Data Engineering Resources HANDS-ON ▼

S3 Bucket — Versioning, Lifecycle, Encryption, Policy

A production data lake S3 bucket needs versioning (for accidental deletion recovery), lifecycle rules (to auto-transition old data to cheaper storage classes), encryption (SSE-KMS), and a bucket policy that denies non-HTTPS access. Here is the complete Terraform definition.

hcl — production S3 data lake bucket

resource "aws_s3_bucket" "data_lake" {
  bucket = "company-datalake-${terraform.workspace}"
  tags   = local.common_tags
}

# Block all public access
resource "aws_s3_bucket_public_access_block" "data_lake" {
  bucket                  = aws_s3_bucket.data_lake.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# Enable versioning
resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration { status = "Enabled" }
}

# SSE-KMS encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.data_lake.arn
    }
    bucket_key_enabled = true   # reduces KMS API costs by 99%
  }
}

# Lifecycle: move old data to cheaper storage
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "bronze-lifecycle"
    status = "Enabled"
    filter { prefix = "bronze/" }

    transition {
      days          = 90
      storage_class = "STANDARD_IA"   # after 90 days
    }
    transition {
      days          = 365
      storage_class = "GLACIER"        # after 1 year
    }
    expiration { days = 2555 }           # delete after 7 years
  }

  rule {
    id     = "abort-incomplete-multipart"
    status = "Enabled"
    filter {}
    abort_incomplete_multipart_upload { days_after_initiation = 7 }
  }
}

IAM Roles — Glue, EMR, Lambda Execution Roles

Every AWS service (Glue, EMR, Lambda) needs an IAM execution role that grants it permission to access S3, CloudWatch, Secrets Manager, etc. Here are the three most common roles a Data Engineer provisions.

hcl — IAM roles for Glue, EMR, Lambda

# ── Glue ETL Execution Role ──────────────────────────────
resource "aws_iam_role" "glue_role" {
  name = "GlueETLRole-${terraform.workspace}"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "glue.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "glue_service" {
  role       = aws_iam_role.glue_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}

resource "aws_iam_role_policy" "glue_s3" {
  name   = "GlueS3Access"
  role   = aws_iam_role.glue_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      },
      {
        Effect   = "Allow"
        Action   = ["secretsmanager:GetSecretValue"]
        Resource = "arn:aws:secretsmanager:*:*:secret:company/*"
      }
    ]
  })
}

# ── Lambda Execution Role ─────────────────────────────────
resource "aws_iam_role" "lambda_role" {
  name = "LambdaDataPipelineRole-${terraform.workspace}"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

# ── EMR Service + Job Flow Roles ──────────────────────────
resource "aws_iam_role" "emr_service_role" {
  name = "EMRServiceRole-${terraform.workspace}"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "elasticmapreduce.amazonaws.com" }
      Action    = "sts:AssumeRole"
    }]
  })
}
resource "aws_iam_role_policy_attachment" "emr_service" {
  role       = aws_iam_role.emr_service_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}

Glue Resources — Database, Crawler, Job

Provision the entire Glue stack — catalog database, crawler that discovers S3 data, and ETL jobs — as Terraform resources. This means adding a new pipeline is a git commit + terraform apply, not a console session.

hcl — Glue catalog database, crawler, and ETL job

# Glue Catalog Databases
resource "aws_glue_catalog_database" "bronze" {
  name = "bronze_layer_${terraform.workspace}"
}
resource "aws_glue_catalog_database" "silver" {
  name = "silver_layer_${terraform.workspace}"
}
resource "aws_glue_catalog_database" "gold" {
  name = "gold_layer_${terraform.workspace}"
}

# Glue Crawler — discovers schema from S3 Bronze path
resource "aws_glue_crawler" "bronze_orders" {
  name          = "bronze-orders-crawler-${terraform.workspace}"
  role          = aws_iam_role.glue_role.arn
  database_name = aws_glue_catalog_database.bronze.name

  s3_target {
    path = "s3://${aws_s3_bucket.data_lake.bucket}/bronze/rds/orders/"
  }

  schedule     = "cron(0 6 * * ? *)"     # daily at 06:00 UTC
  table_prefix = "rds_"

  schema_change_policy {
    update_behavior = "UPDATE_IN_DATABASE"
    delete_behavior = "LOG"
  }
}

# Glue ETL Job
resource "aws_glue_job" "orders_bronze_etl" {
  name              = "orders-bronze-etl-${terraform.workspace}"
  role_arn          = aws_iam_role.glue_role.arn
  glue_version      = "4.0"
  worker_type       = "G.1X"
  number_of_workers = 10

  command {
    script_location = "s3://${aws_s3_bucket.data_lake.bucket}/scripts/orders_bronze.py"
    python_version  = "3"
  }

  default_arguments = {
    "--SOURCE_TABLE"     = "orders"
    "--TARGET_PATH"      = "s3://${aws_s3_bucket.data_lake.bucket}/bronze/rds/orders/"
    "--ENVIRONMENT"      = terraform.workspace
    "--enable-metrics"   = "true"
    "--job-bookmark-option" = "job-bookmark-enable"
  }

  execution_property { max_concurrent_runs = 1 }
}

Lambda Function Deployment

Terraform can deploy a Lambda function from a ZIP file, attach its IAM role, and wire up its S3 trigger — all in one terraform apply. This is the standard pattern for deploying pipeline trigger Lambdas.

hcl — Lambda function with S3 trigger

# Package the Lambda code into a ZIP
data "archive_file" "lambda_zip" {
  type        = "zip"
  source_dir  = "${path.module}/lambda_src/"
  output_path = "${path.module}/lambda.zip"
}

# Deploy the Lambda function
resource "aws_lambda_function" "pipeline_trigger" {
  function_name    = "data-pipeline-trigger-${terraform.workspace}"
  role             = aws_iam_role.lambda_role.arn
  handler          = "handler.lambda_handler"
  runtime          = "python3.12"
  filename         = data.archive_file.lambda_zip.output_path
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256

  timeout     = 300    # 5 minutes
  memory_size = 512

  environment {
    variables = {
      ENVIRONMENT  = terraform.workspace
      GLUE_JOB     = aws_glue_job.orders_bronze_etl.name
      AUDIT_TABLE  = aws_dynamodb_table.pipeline_audit.name
    }
  }
}

# Allow S3 to invoke the Lambda
resource "aws_lambda_permission" "s3_trigger" {
  statement_id  = "AllowS3Invoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.pipeline_trigger.function_name
  principal     = "s3.amazonaws.com"
  source_arn    = aws_s3_bucket.data_lake.arn
}

# Wire up S3 event notification → Lambda on new file arrival
resource "aws_s3_bucket_notification" "landing_trigger" {
  bucket = aws_s3_bucket.data_lake.id

  lambda_function {
    lambda_function_arn = aws_lambda_function.pipeline_trigger.arn
    events              = ["s3:ObjectCreated:*"]
    filter_prefix       = "landing/"
    filter_suffix       = ".parquet"
  }

  depends_on = [aws_lambda_permission.s3_trigger]
}

DynamoDB Table for Pipeline Metadata

Every data platform needs a DynamoDB table for pipeline audit logs and watermark tracking. Here is the complete Terraform definition with a GSI on pipeline_name for efficient queries by pipeline.

hcl — DynamoDB pipeline audit table

resource "aws_dynamodb_table" "pipeline_audit" {
  name         = "pipeline-audit-${terraform.workspace}"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "run_id"

  attribute {
    name = "run_id"
    type = "S"
  }
  attribute {
    name = "pipeline_name"
    type = "S"
  }
  attribute {
    name = "run_time"
    type = "S"
  }

  global_secondary_index {
    name            = "pipeline-name-index"
    hash_key        = "pipeline_name"
    range_key       = "run_time"
    projection_type = "ALL"
  }

  ttl {
    attribute_name = "expires_at"
    enabled        = true           # auto-delete old audit records after 90 days
  }

  point_in_time_recovery { enabled = true }
  tags = local.common_tags
}

# Watermark table — tracks incremental processing state
resource "aws_dynamodb_table" "pipeline_watermarks" {
  name         = "pipeline-watermarks-${terraform.workspace}"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "pipeline_id"

  attribute {
    name = "pipeline_id"
    type = "S"
  }
  tags = local.common_tags
}

VPC, Subnets, Security Groups, VPC Endpoints

Glue, EMR, and RDS run inside a VPC. VPC Endpoints are critical — they keep traffic between Glue/EMR and S3/Secrets Manager on the AWS private network, not the public internet. Here is the essential VPC configuration for a data platform.

hcl — VPC with private subnets and S3/Secrets Manager endpoints

resource "aws_vpc" "data_platform" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  tags = { Name = "data-platform-${terraform.workspace}" }
}

resource "aws_subnet" "private_a" {
  vpc_id            = aws_vpc.data_platform.id
  cidr_block        = "10.0.1.0/24"
  availability_zone = "us-east-1a"
  tags = { Name = "private-a-${terraform.workspace}", Tier = "Private" }
}

resource "aws_subnet" "private_b" {
  vpc_id            = aws_vpc.data_platform.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "us-east-1b"
  tags = { Name = "private-b-${terraform.workspace}", Tier = "Private" }
}

# S3 Gateway Endpoint — free, keeps S3 traffic private
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.data_platform.id
  service_name = "com.amazonaws.us-east-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
  tags = { Name = "s3-endpoint" }
}

# Secrets Manager Interface Endpoint — keeps secrets retrieval private
resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.data_platform.id
  service_name        = "com.amazonaws.us-east-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
  tags = { Name = "secretsmanager-endpoint" }
}

# Glue security group — allow Glue jobs to reach RDS, MSK, S3 endpoint
resource "aws_security_group" "glue" {
  name   = "glue-sg-${terraform.workspace}"
  vpc_id = aws_vpc.data_platform.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Glue requires self-referencing rule for Spark driver↔executor communication
  ingress {
    from_port = 0
    to_port   = 65535
    protocol  = "tcp"
    self      = true
  }
}

CloudWatch Alarms and Dashboard

Provision monitoring resources in Terraform so every environment gets the same alarms automatically — no manual alarm creation in console. Here is a Glue job failure alarm wired to SNS.

hcl — CloudWatch alarm for Glue job failure → SNS alert

# SNS topic for pipeline alerts
resource "aws_sns_topic" "pipeline_alerts" {
  name = "pipeline-alerts-${terraform.workspace}"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.pipeline_alerts.arn
  protocol  = "email"
  endpoint  = "data-team@company.com"
}

# CloudWatch alarm: Glue job failure count > 0
resource "aws_cloudwatch_metric_alarm" "glue_job_failure" {
  alarm_name          = "glue-job-failure-${terraform.workspace}"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "glue.driver.aggregate.numFailedTasks"
  namespace           = "Glue"
  period              = 300
  statistic           = "Sum"
  threshold           = 0
  alarm_description   = "Glue job has failed tasks"
  alarm_actions       = [aws_sns_topic.pipeline_alerts.arn]
  ok_actions          = [aws_sns_topic.pipeline_alerts.arn]
  treat_missing_data  = "notBreaching"
}

🔄

CI/CD with Terraform — GitHub Actions Pipeline CI/CD ▼

The Production Terraform CI/CD Pattern

In production, nobody runs terraform apply locally against the prod account. Instead, all Terraform changes go through a pull request workflow: PR opened → GitHub Actions runs terraform plan → human reviews the plan → PR merged → GitHub Actions runs terraform apply. This prevents unreviewed changes from hitting production infrastructure.

TERRAFORM CI/CD WORKFLOW Developer makes infra change (e.g. add a new Glue job) │ ▼ git push → opens Pull Request │ ▼ GitHub Actions — PR trigger ├── terraform fmt -check (formatting gate) ├── terraform validate (syntax check) └── terraform plan -out=tfplan (post plan as PR comment) │ ▼ Human reviews PR + plan output → sees: + aws_glue_job.new_orders_gold_etl will be created → approves PR │ ▼ GitHub Actions — merge trigger └── terraform apply tfplan (applies the approved plan) │ ▼ AWS: Glue job created in prod account ✅

GitHub Actions Terraform Workflow File

yaml — .github/workflows/terraform.yml

name: Terraform Data Platform

on:
  pull_request:
    paths: ["terraform/**"]
  push:
    branches: [main]
    paths: ["terraform/**"]

env:
  TF_VERSION: "1.7.0"
  AWS_REGION: "us-east-1"
  TF_WORKSPACE: "prod"   # or use branch-based logic

jobs:
  terraform:
    runs-on: ubuntu-latest
    permissions:
      id-token: write    # for OIDC auth to AWS — no static keys needed
      contents: read
      pull-requests: write

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC — no stored keys)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsRole
          aws-region: ${{ env.AWS_REGION }}

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        working-directory: terraform/
        run: terraform init

      - name: Terraform Format Check
        working-directory: terraform/
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        working-directory: terraform/
        run: terraform validate

      - name: Terraform Plan (on PR)
        if: github.event_name == 'pull_request'
        working-directory: terraform/
        run: |
          terraform workspace select ${{ env.TF_WORKSPACE }}
          terraform plan -var-file=prod.tfvars -out=tfplan -no-color 2>&1 | tee plan.txt
          echo "## Terraform Plan" >> $GITHUB_STEP_SUMMARY
          cat plan.txt >> $GITHUB_STEP_SUMMARY

      - name: Terraform Apply (on merge to main)
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        working-directory: terraform/
        run: |
          terraform workspace select ${{ env.TF_WORKSPACE }}
          terraform apply -var-file=prod.tfvars -auto-approve

🔑 Key Point

Use OIDC authentication (the id-token: write permission + configure-aws-credentials with role-to-assume) instead of storing AWS access keys as GitHub secrets. OIDC issues short-lived credentials automatically for each workflow run — far more secure.

Importing Existing Resources into Terraform State

If your team created resources manually before adopting Terraform, you can bring them under Terraform management using terraform import. After importing, Terraform knows the resource exists and will include it in future plans and applies without recreating it.

bash — importing existing AWS resources into Terraform state

# Import an existing S3 bucket
terraform import aws_s3_bucket.data_lake company-datalake-prod

# Import an existing Glue job
terraform import aws_glue_job.orders_bronze_etl orders-bronze-etl-prod

# Import an existing DynamoDB table
terraform import aws_dynamodb_table.pipeline_audit pipeline-audit-prod

# Import an existing Lambda function
terraform import aws_lambda_function.pipeline_trigger data-pipeline-trigger-prod

# Import an existing IAM role
terraform import aws_iam_role.glue_role GlueETLRole-prod

# After import, run plan to confirm zero drift
terraform plan
# Goal: "No changes. Your infrastructure matches the configuration."

📋 29.22 Summary — Terraform for Data Engineers

IaC with Terraform makes infrastructure repeatable, version-controlled, and reviewable — essential for maintaining identical dev/staging/prod environments.
Remote state in S3 + DynamoDB lock table is mandatory for team use — never store state locally in a shared project.
Workspaces let one set of .tf files manage multiple environments with workspace-aware resource naming.
Modules are reusable infrastructure components — write a glue_job module once, call it for every pipeline.
Every DE must be able to Terraform: S3 buckets (versioning, lifecycle, encryption), IAM roles (Glue, EMR, Lambda), Glue resources (database, crawler, job), Lambda (with S3 trigger), DynamoDB (audit + watermark tables), VPC + endpoints, CloudWatch alarms.
CI/CD pattern: PR → terraform plan as PR comment → human review → merge → terraform apply. Use OIDC for AWS auth — no stored keys.
terraform import brings manually-created resources under Terraform management without recreating them.

29.23

Data Quality Engineering

Data Quality (DQ) is not an afterthought — it is a first-class engineering concern built into every layer of the pipeline. Bad data that silently flows into Gold tables destroys trust in the entire data platform. A senior Data Engineer builds DQ checks at ingestion, validates between layers, quarantines bad records, and publishes DQ metrics to dashboards so the team can catch problems before analysts do.

🎯

DQ at Pipeline Level — Not an Afterthought PHILOSOPHY ▼

Where DQ Checks Belong in the Pipeline

DQ checks must run at every layer transition — not just at the end. Catching a problem at the Bronze → Silver boundary is far cheaper than discovering it after Gold tables have been built on top of bad Silver data. The rule is: validate early, validate often, never let bad data silently pass through.

DQ GATES IN A MEDALLION PIPELINE Source (RDS / Kafka / Files) │ ▼ ── DQ Gate 1 ────────────────────────────────────── │ Schema check: expected columns present? │ Row count check: did we get any rows at all? │ Not-null check: primary keys are non-null? ▼ Bronze Layer (raw, minimal transform) │ ▼ ── DQ Gate 2 ────────────────────────────────────── │ Null rate check: critical columns < 5% null? │ Duplicate check: no duplicate primary keys? │ Range check: amounts > 0 and < 10,000,000? │ Referential integrity: customer_id exists in customers? │ Row count reconciliation: bronze rows == source rows? ▼ Silver Layer (cleaned, deduplicated) │ ▼ ── DQ Gate 3 ────────────────────────────────────── │ Business rule check: revenue = qty * unit_price? │ Freshness check: max(updated_at) within last 24h? │ Threshold check: row count within ±20% of yesterday? ▼ Gold Layer (aggregated, business-ready) │ ▼ BI Dashboards / Reports

⚠️ Warning

A DQ failure at Gate 2 should stop the pipeline and route bad records to a quarantine path — it should never silently pass through and corrupt Silver. The pipeline either succeeds cleanly or fails loudly. Silent corruption is the worst outcome.

✅

Great Expectations (GX) FRAMEWORK ▼

What is Great Expectations?

Great Expectations (GX) is the most widely adopted open-source Python DQ framework. You write Expectations — assertions about what your data should look like (column X should never be null, column Y should be between 0 and 100, etc.). GX evaluates those expectations against your actual data and produces a validation result with a pass/fail per expectation and a summary score. It integrates with Pandas, Spark, and SQL databases.

🗄️ Analogy

Great Expectations is like unit tests for your data. Just as pytest checks that your code behaves correctly, GX checks that your data meets its contract. An expectation like expect_column_values_to_not_be_null("order_id") is the data equivalent of assert order_id is not None.

Expectation Suites — Defining DQ Rules

An Expectation Suite is a named collection of expectations for a specific dataset. You define one suite per table or dataset. Suites are stored as JSON files and versioned alongside your pipeline code.

python — defining a Great Expectations suite for an orders dataset

import great_expectations as gx
from great_expectations.core import ExpectationSuite

context = gx.get_context()   # loads GX config from gx/ directory

# Create an expectation suite for the orders Bronze table
suite = context.add_expectation_suite("orders_bronze_suite")

# ── Schema expectations ──────────────────────────────────
suite.add_expectation(gx.expectations.ExpectTableColumnsToMatchOrderedList(
    column_list=["order_id", "customer_id", "amount", "status", "created_at"]
))

# ── Not-null expectations ────────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="amount"))

# ── Uniqueness expectation ───────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="order_id"))

# ── Value range expectation ──────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
    column="amount", min_value=0.01, max_value=1_000_000
))

# ── Allowed values expectation ───────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeInSet(
    column="status",
    value_set=["PENDING", "CONFIRMED", "SHIPPED", "DELIVERED", "CANCELLED"]
))

# ── Row count expectation ────────────────────────────────
suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(
    min_value=1000, max_value=10_000_000
))

context.save_expectation_suite(suite)
print("✅ Expectation suite saved: orders_bronze_suite")

Validators — Running Expectations Against Spark DataFrames

A Validator connects an Expectation Suite to an actual dataset (Pandas DataFrame, Spark DataFrame, or SQL table) and runs the validations. The result tells you which expectations passed and which failed, with detailed statistics per column.

python — validating a Spark DataFrame with Great Expectations

import great_expectations as gx
from pyspark.sql import SparkSession

spark   = SparkSession.builder.appName("gx-validation").getOrCreate()
context = gx.get_context()

# Load the data to validate
df = spark.read.parquet("s3://company-lake/bronze/rds/orders/")

# Convert Spark DF to a GX Spark Datasource batch
datasource = context.sources.add_or_update_spark("spark_datasource")
data_asset = datasource.add_dataframe_asset("orders_asset")
batch_request = data_asset.build_batch_request(dataframe=df)

# Run validation
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_bronze_suite"
)
results = validator.validate()

# Check overall pass/fail
if results.success:
    print("✅ All DQ checks passed — proceeding to Silver transform")
else:
    failed = [r for r in results.results if not r.success]
    print(f"❌ {len(failed)} DQ checks FAILED — stopping pipeline")
    for r in failed:
        print(f"  FAILED: {r.expectation_config.expectation_type}"
              f" | {r.result}")
    raise ValueError("DQ validation failed — pipeline halted")

Checkpoints — Integrating GX into the Pipeline

A Checkpoint bundles a batch request + expectation suite + actions into a single reusable object. Actions define what happens after validation — save data docs to S3, send a Slack alert on failure, update a DQ score in DynamoDB. Checkpoints are the right way to run GX in production pipelines and Airflow DAGs.

python — creating and running a GX Checkpoint in a pipeline

import great_expectations as gx

context = gx.get_context()

# Create a checkpoint with post-validation actions
checkpoint = context.add_or_update_checkpoint(
    name="orders_bronze_checkpoint",
    validations=[{
        "batch_request":          batch_request,
        "expectation_suite_name": "orders_bronze_suite"
    }],
    action_list=[
        # Save HTML data docs to S3 after every run
        {
            "name": "store_validation_result",
            "action": {"class_name": "StoreValidationResultAction"}
        },
        {
            "name": "update_data_docs",
            "action": {"class_name": "UpdateDataDocsAction"}
        },
        # Slack notification on failure
        {
            "name": "slack_on_failure",
            "action": {
                "class_name": "SlackNotificationAction",
                "slack_webhook": "https://hooks.slack.com/services/XXX/YYY/ZZZ",
                "notify_on": "failure"
            }
        }
    ]
)

# Run the checkpoint — returns CheckpointResult
result = context.run_checkpoint("orders_bronze_checkpoint")

if not result.success:
    raise ValueError("DQ Checkpoint failed — pipeline halted")

Data Docs — Publishing DQ Results to S3

Data Docs are auto-generated HTML reports that show every expectation, its result (pass/fail), and statistics. They are published to S3 after each checkpoint run and hosted as a static website, giving the data team a live dashboard of DQ health across all datasets without any extra tooling.

yaml — great_expectations.yml: configure S3 as data docs store

# great_expectations/great_expectations.yml
data_docs_sites:
  s3_site:
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: company-dq-reports
      prefix: data-docs/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

# After UpdateDataDocsAction runs, open:
# https://company-dq-reports.s3.amazonaws.com/data-docs/index.html
# → shows all suites, all checkpoint runs, pass/fail history

Integrating GX Checkpoints in Airflow DAGs

In production, GX checkpoints run as a PythonOperator task in the Airflow DAG — positioned between the extraction task (Bronze write) and the transformation task (Silver write). If the checkpoint fails, Airflow marks the task as failed and stops the downstream Silver/Gold tasks automatically.

python — Airflow DAG with GX checkpoint between Bronze and Silver

from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
import great_expectations as gx

@dag(schedule_interval="@daily", start_date=days_ago(1))
def orders_pipeline():

    @task
    def extract_to_bronze():
        # Glue job or Spark job writes raw data to Bronze S3
        import boto3
        glue = boto3.client("glue")
        run  = glue.start_job_run(JobName="orders-bronze-etl")
        # poll until complete (omitted for brevity)
        return run["JobRunId"]

    @task
    def validate_bronze():
        """Run GX checkpoint — fails task if DQ checks don't pass."""
        context = gx.get_context()
        result  = context.run_checkpoint("orders_bronze_checkpoint")
        if not result.success:
            raise ValueError("Bronze DQ validation failed — Silver transform blocked")
        print("✅ Bronze DQ passed")

    @task
    def transform_to_silver():
        # Only runs if validate_bronze task succeeded
        glue = boto3.client("glue")
        glue.start_job_run(JobName="orders-silver-transform")

    # DAG structure: Bronze → DQ Gate → Silver
    bronze = extract_to_bronze()
    dq     = validate_bronze()
    silver = transform_to_silver()

    bronze >> dq >> silver

orders_pipeline()

☁️

AWS Glue Data Quality AWS NATIVE ▼

What is Glue Data Quality?

AWS Glue Data Quality is a native AWS service that lets you define DQ rules using a declarative language called DQDL (Data Quality Definition Language) and attach them directly to Glue ETL jobs or run them standalone against Glue Catalog tables. It produces a DQ score (0–100) and a per-rule pass/fail result, all stored in the Glue Catalog without any external framework to install.

🔑 Key Point

Glue Data Quality is the right choice when your entire pipeline is AWS-native (Glue + S3) and you don't want to install and manage a third-party framework like Great Expectations. For multi-cloud or Databricks/EMR environments, Great Expectations is more portable.

Ruleset Definition — DQDL Rules

DQDL rules cover four categories: Completeness (not-null rate), Uniqueness (no duplicates), Freshness (data is recent enough), and Accuracy (values in expected range or set). You write rules in a simple declarative syntax.

python — creating a Glue Data Quality ruleset via boto3

import boto3

glue = boto3.client("glue")

# DQDL ruleset — define rules for the orders Bronze table
RULESET = """
Rules = [
    # Completeness — critical columns must be 100% populated
    IsComplete "order_id",
    IsComplete "customer_id",
    IsComplete "amount",

    # Uniqueness — order_id must be a primary key (no duplicates)
    IsUnique "order_id",

    # Accuracy — amount must be positive and realistic
    ColumnValues "amount" between 0.01 and 1000000,

    # Accuracy — status must be one of allowed values
    ColumnValues "status" in [ "PENDING", "CONFIRMED", "SHIPPED", "DELIVERED", "CANCELLED" ],

    # Completeness — allow up to 5% null in non-critical columns
    Completeness "description" >= 0.95,

    # Freshness — at least one record updated in last 48 hours
    ColumnStatistic "MAX" "created_at" > (now() - 172800),

    # Volume — row count sanity check
    RowCount >= 1000
]
"""

# Create the ruleset attached to the Glue Catalog table
glue.create_data_quality_ruleset(
    Name="orders-bronze-dq-ruleset",
    Description="DQ rules for orders Bronze table",
    Ruleset=RULESET,
    TargetTable={
        "TableName":    "rds_orders",
        "DatabaseName": "bronze_layer"
    }
)
print("✅ Glue DQ ruleset created")

Attaching Rulesets to Glue Jobs and Failing on Violation

You can run a DQ evaluation standalone or attach it to a Glue ETL job so it runs automatically after the job writes data. The key production pattern is: run the DQ evaluation → poll until complete → check the score → stop the pipeline if score is below threshold.

python — running Glue DQ evaluation and failing pipeline on violation

import boto3, time

glue = boto3.client("glue")

# Start the DQ evaluation run
resp = glue.start_data_quality_ruleset_evaluation_run(
    DataSource={
        "GlueTable": {
            "DatabaseName": "bronze_layer",
            "TableName":    "rds_orders"
        }
    },
    Role="arn:aws:iam::123456789012:role/GlueETLRole",
    RulesetNames=["orders-bronze-dq-ruleset"]
)
run_id = resp["RunId"]
print(f"🔍 DQ evaluation started: {run_id}")

# Poll until the evaluation completes
while True:
    status_resp = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
    status      = status_resp["Status"]
    if status in ("SUCCEEDED", "FAILED", "ERROR"):
        break
    print(f"  Status: {status} — waiting...")
    time.sleep(15)

if status != "SUCCEEDED":
    raise RuntimeError(f"DQ evaluation run failed with status: {status}")

# Inspect per-rule results
result_ids = status_resp.get("ResultIds", [])
for result_id in result_ids:
    result = glue.get_data_quality_result(ResultId=result_id)
    score  = result["Score"]          # overall 0.0 – 1.0
    rules  = result["RuleResults"]   # per-rule pass/fail

    print(f"DQ Score: {score:.2%}")
    failed_rules = [r for r in rules if r["Result"] == "FAIL"]

    if failed_rules:
        for r in failed_rules:
            print(f"  ❌ FAILED: {r['Name']} — {r.get('EvaluationMessage','')}")

    # Fail pipeline if score below 95%
    DQ_THRESHOLD = 0.95
    if score < DQ_THRESHOLD:
        raise ValueError(
            f"DQ score {score:.2%} below threshold {DQ_THRESHOLD:.2%} — pipeline halted"
        )

print("✅ DQ passed — proceeding to Silver transform")

Writing DQ Scores to DynamoDB Audit Table

Every DQ evaluation run should write its score and per-rule results to the pipeline audit DynamoDB table. This gives you a historical record of DQ health per dataset per day — essential for trend analysis and SLA reporting.

python — persisting DQ results to DynamoDB audit table

import boto3, uuid
from datetime import datetime, timezone
from decimal  import Decimal

dynamo = boto3.resource("dynamodb")
table  = dynamo.Table("pipeline-audit-prod")

def write_dq_audit(pipeline_name: str, table_name: str,
                    score: float, passed: bool,
                    failed_rules: list, run_id: str = None):
    table.put_item(Item={
        "run_id":        run_id or str(uuid.uuid4()),
        "pipeline_name": pipeline_name,
        "table_name":    table_name,
        "dq_score":      Decimal(str(round(score, 4))),
        "dq_passed":     passed,
        "failed_rules":  failed_rules,   # list of failed rule names
        "run_time":      datetime.now(timezone.utc).isoformat(),
        "record_type":   "DQ_RESULT"
    })

# Call after every DQ evaluation
write_dq_audit(
    pipeline_name = "orders-bronze-etl",
    table_name    = "bronze_layer.rds_orders",
    score         = score,
    passed        = score >= 0.95,
    failed_rules  = [r["Name"] for r in failed_rules]
)
print("✅ DQ result written to audit table")

🔍

DQ Patterns — Schema, Nulls, Ranges, Quarantine, Reconciliation PATTERNS ▼

Schema Validation Before Load

Before writing any data to Bronze, validate that the incoming schema matches the expected schema. New or missing columns in the source are common causes of silent pipeline corruption. Fail fast at schema validation so the problem is caught immediately.

python — schema validation in PySpark before Bronze write

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

# Define the expected schema for the orders table
EXPECTED_SCHEMA = StructType([
    StructField("order_id",    StringType(),    nullable=False),
    StructField("customer_id", StringType(),    nullable=False),
    StructField("amount",      DoubleType(),    nullable=False),
    StructField("status",      StringType(),    nullable=False),
    StructField("created_at",  TimestampType(), nullable=True),
])

def validate_schema(df, expected_schema: StructType):
    """Fail fast if incoming schema doesn't match expected schema."""
    actual_cols   = {f.name: f.dataType for f in df.schema.fields}
    expected_cols = {f.name: f.dataType for f in expected_schema.fields}

    missing  = set(expected_cols) - set(actual_cols)
    extra    = set(actual_cols)   - set(expected_cols)
    type_err = {
        col: (expected_cols[col], actual_cols[col])
        for col in expected_cols
        if col in actual_cols and expected_cols[col] != actual_cols[col]
    }

    errors = []
    if missing:  errors.append(f"Missing columns: {missing}")
    if extra:    errors.append(f"Unexpected columns: {extra}")
    if type_err: errors.append(f"Type mismatches: {type_err}")

    if errors:
        raise ValueError("Schema validation FAILED:\n" + "\n".join(errors))
    print("✅ Schema validation passed")

# Call before writing to Bronze
df_source = spark.read.jdbc(url=JDBC_URL, table="orders", properties=JDBC_PROPS)
validate_schema(df_source, EXPECTED_SCHEMA)
df_source.write.parquet("s3://company-lake/bronze/rds/orders/")

Null, Range, and Referential Integrity Checks in PySpark

These are the three most common DQ checks a Data Engineer implements directly in PySpark — before or after each layer write.

python — null rate, range, and referential integrity checks in PySpark

from pyspark.sql.functions import col, count, when, isnan

def check_null_rates(df, columns: list, max_null_pct: float = 0.05):
    """Fail if any column has more than max_null_pct nulls."""
    total = df.count()
    for col_name in columns:
        null_count = df.filter(col(col_name).isNull()).count()
        null_pct   = null_count / total
        if null_pct > max_null_pct:
            raise ValueError(
                f"NULL check FAILED: '{col_name}' has {null_pct:.1%} nulls "
                f"(max allowed: {max_null_pct:.1%})"
            )
        print(f"  ✅ {col_name}: {null_pct:.2%} nulls (OK)")

def check_value_range(df, col_name: str, min_val: float, max_val: float):
    """Fail if any value is outside [min_val, max_val]."""
    out_of_range = df.filter(
        (col(col_name) < min_val) | (col(col_name) > max_val)
    ).count()
    if out_of_range > 0:
        raise ValueError(
            f"Range check FAILED: '{col_name}' has {out_of_range} "
            f"values outside [{min_val}, {max_val}]"
        )
    print(f"  ✅ {col_name}: all values in range [{min_val}, {max_val}]")

def check_referential_integrity(df_fact, df_dim, fact_key: str, dim_key: str):
    """Fail if fact rows have keys that don't exist in the dimension table."""
    orphans = df_fact.join(df_dim, df_fact[fact_key] == df_dim[dim_key], "left_anti")
    count   = orphans.count()
    if count > 0:
        raise ValueError(
            f"Referential integrity FAILED: {count} rows in '{fact_key}' "
            f"have no matching '{dim_key}' in dimension"
        )
    print(f"  ✅ Referential integrity OK: all {fact_key} keys found in dimension")

# Usage
df_orders    = spark.read.delta("s3://company-lake/delta/bronze/orders/")
df_customers = spark.read.delta("s3://company-lake/delta/bronze/customers/")

check_null_rates(df_orders, ["order_id", "customer_id", "amount"], max_null_pct=0.0)
check_value_range(df_orders, "amount", min_val=0.01, max_val=1_000_000)
check_referential_integrity(df_orders, df_customers, "customer_id", "customer_id")

Quarantine Pattern — Bad Records → Separate S3 Path

Instead of failing the entire pipeline on the first bad record, the quarantine pattern separates good and bad records. Good records proceed to Bronze/Silver as normal. Bad records are written to a quarantine/ S3 path with an added rejection_reason column for debugging. The pipeline succeeds, but a CloudWatch alarm fires if quarantine count exceeds a threshold.

python — quarantine pattern in PySpark

from pyspark.sql.functions import col, lit, when, current_timestamp

df_raw = spark.read.parquet("s3://company-lake/landing/orders/")

# Tag each row with a rejection reason (null = valid record)
df_tagged = df_raw.withColumn(
    "rejection_reason",
    when(col("order_id").isNull(),           lit("NULL_ORDER_ID"))
    .when(col("amount") <= 0,              lit("AMOUNT_NOT_POSITIVE"))
    .when(col("amount") > 1_000_000,        lit("AMOUNT_EXCEEDS_MAX"))
    .when(~col("status").isin(
        "PENDING", "CONFIRMED", "SHIPPED",
        "DELIVERED", "CANCELLED"),          lit("INVALID_STATUS"))
    .otherwise(lit(None))
).withColumn("quarantine_timestamp", current_timestamp())

# Split into valid and quarantine records
df_valid      = df_tagged.filter(col("rejection_reason").isNull()) \
                         .drop("rejection_reason", "quarantine_timestamp")
df_quarantine = df_tagged.filter(col("rejection_reason").isNotNull())

valid_count  = df_valid.count()
reject_count = df_quarantine.count()
print(f"Valid: {valid_count}  |  Quarantined: {reject_count}")

# Write valid records to Bronze
df_valid.write.format("delta").mode("append") \
    .save("s3://company-lake/delta/bronze/orders/")

# Write bad records to quarantine path
if reject_count > 0:
    df_quarantine.write.format("delta").mode("append") \
        .partitionBy("rejection_reason") \
        .save("s3://company-lake/quarantine/orders/")
    print(f"⚠️  {reject_count} records quarantined")

    # Fail pipeline if quarantine exceeds 1% of total
    total = valid_count + reject_count
    if reject_count / total > 0.01:
        raise ValueError(
            f"Quarantine rate {reject_count/total:.2%} exceeds 1% threshold"
        )

Row Count Reconciliation — Source vs Target

Row count reconciliation is one of the simplest and most effective DQ checks: after writing data, compare the row count of the source against the row count written to the target. Any mismatch indicates dropped records somewhere in the pipeline.

python — row count reconciliation: source vs Bronze vs Silver

import boto3
from decimal import Decimal

def reconcile_row_counts(
    spark, source_df, target_path: str,
    pipeline_name: str, tolerance_pct: float = 0.0
):
    """Compare source row count to target row count after write."""
    source_count = source_df.count()
    target_count = spark.read.format("delta").load(target_path).count()

    diff     = abs(source_count - target_count)
    diff_pct = diff / source_count if source_count > 0 else 0

    print(f"Reconciliation: source={source_count:,}  target={target_count:,}  diff={diff:,} ({diff_pct:.3%})")

    if diff_pct > tolerance_pct:
        raise ValueError(
            f"Row count mismatch in {pipeline_name}: "
            f"source={source_count} target={target_count} diff={diff_pct:.3%}"
        )

    print(f"  ✅ Row counts reconciled for {pipeline_name}")
    return {"source": source_count, "target": target_count, "diff_pct": diff_pct}

# Usage after Bronze write
df_source = spark.read.jdbc(url=JDBC_URL, table="orders", properties=JDBC_PROPS)
df_source.write.format("delta").mode("append") \
    .save("s3://company-lake/delta/bronze/orders/")

reconcile_row_counts(
    spark,
    source_df     = df_source,
    target_path   = "s3://company-lake/delta/bronze/orders/",
    pipeline_name = "orders-rds-to-bronze",
    tolerance_pct = 0.0   # zero tolerance for Bronze — every row must land
)

Deduplication Check After Load

After writing to Bronze or Silver, verify that no duplicate primary keys were introduced. This catches bugs in MERGE logic, append-mode re-runs, and source systems that send the same records twice.

python — deduplication check after table write

from pyspark.sql.functions import count, col

def check_no_duplicates(df, primary_key_cols: list):
    """Verify that the given columns form a unique key."""
    total    = df.count()
    distinct = df.select(primary_key_cols).distinct().count()

    if total != distinct:
        duplicates = total - distinct
        raise ValueError(
            f"Deduplication check FAILED: {duplicates:,} duplicate keys detected "
            f"on columns {primary_key_cols} (total={total:,} distinct={distinct:,})"
        )
    print(f"  ✅ No duplicates on {primary_key_cols} ({total:,} rows, all unique)")

# Check Silver orders after MERGE
df_silver = spark.read.delta("s3://company-lake/delta/silver/orders_clean/")
check_no_duplicates(df_silver, ["order_id"])

DQ Metrics Published to CloudWatch

Every DQ check result should be published as a custom CloudWatch metric so you can build dashboards and alarms on DQ health — alerting the team when DQ score drops below threshold or quarantine rate spikes.

python — publishing DQ metrics to CloudWatch

import boto3

cw = boto3.client("cloudwatch")

def publish_dq_metrics(pipeline_name: str, table_name: str,
                         dq_score: float, quarantine_count: int,
                         source_rows: int):
    cw.put_metric_data(
        Namespace="DataPlatform/DQ",
        MetricData=[
            {
                "MetricName": "DQScore",
                "Dimensions": [
                    {"Name": "PipelineName", "Value": pipeline_name},
                    {"Name": "TableName",    "Value": table_name}
                ],
                "Value": dq_score * 100,    # publish as 0–100 for readability
                "Unit":  "Percent"
            },
            {
                "MetricName": "QuarantineCount",
                "Dimensions": [
                    {"Name": "PipelineName", "Value": pipeline_name},
                    {"Name": "TableName",    "Value": table_name}
                ],
                "Value": quarantine_count,
                "Unit":  "Count"
            },
            {
                "MetricName": "SourceRowCount",
                "Dimensions": [
                    {"Name": "PipelineName", "Value": pipeline_name}
                ],
                "Value": source_rows,
                "Unit":  "Count"
            }
        ]
    )
    print("✅ DQ metrics published to CloudWatch")

# CloudWatch Alarm: alert when DQ score drops below 95%
cw.put_metric_alarm(
    AlarmName          = "dq-score-below-threshold-orders-bronze",
    MetricName         = "DQScore",
    Namespace          = "DataPlatform/DQ",
    Dimensions         = [
        {"Name": "PipelineName", "Value": "orders-bronze-etl"},
        {"Name": "TableName",    "Value": "bronze_layer.rds_orders"}
    ],
    ComparisonOperator = "LessThanThreshold",
    Threshold          = 95.0,
    Period             = 86400,     # 24 hours
    EvaluationPeriods  = 1,
    Statistic          = "Minimum",
    AlarmActions       = ["arn:aws:sns:us-east-1:123456789012:pipeline-alerts"],
    TreatMissingData   = "breaching"   # alarm if pipeline didn't run either
)

📋 29.23 Summary — Data Quality Engineering

DQ gates belong at every layer transition — Bronze ingest, Bronze→Silver, Silver→Gold. Never let bad data silently flow through.
Great Expectations: write Expectation Suites per dataset, validate against Spark DataFrames with Validators, run via Checkpoints in Airflow DAGs, publish Data Docs to S3 for visibility.
Glue Data Quality: AWS-native DQDL rulesets attached to Glue Catalog tables — Completeness, Uniqueness, Freshness, Accuracy. Poll get_data_quality_ruleset_evaluation_run() and fail pipeline if score below threshold.
Schema validation: check expected columns and types before writing to Bronze — fail fast on schema drift.
Null, range, referential integrity: implement directly in PySpark as DQ gate functions between pipeline stages.
Quarantine pattern: split valid vs bad records; bad records go to quarantine/ S3 path with rejection_reason column; fail pipeline only if quarantine rate exceeds threshold.
Row count reconciliation: source count must equal target count — zero tolerance at Bronze boundary.
Deduplication check: verify primary key uniqueness after every MERGE or append write.
CloudWatch DQ metrics: publish DQScore and QuarantineCount per pipeline run; set alarms to alert when DQ degrades.

29.24 — AWS DATA ENGINEERING

Streaming Pipelines

Production streaming pipelines on AWS combine Kafka / MSK as the event backbone, Spark Structured Streaming for stateful processing, and S3 / Delta Lake / Iceberg as the durable sink. This section covers every concept you need — from trigger modes and watermarking to exactly-once guarantees and foreachBatch custom sink patterns.

🌊

Spark Structured Streaming — Core Concepts on AWS FOUNDATION ▼

readStream from Kafka / MSK

Spark Structured Streaming treats a Kafka topic as an unbounded table. Each new message is a new row appended to that table. You use spark.readStream with the Kafka source, specifying your MSK bootstrap servers, topic name, and starting offset position.

📦 Analogy

Think of a Kafka topic like a conveyor belt at an airport. New bags (messages) keep arriving. Spark Structured Streaming is the worker who picks up bags in micro-batches, processes them, and sends them to the destination (Delta / S3). The worker never stops — it just waits for the next batch of bags.

python — readStream from MSK (Kafka)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, schema_of_json
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType

spark = SparkSession.builder \
    .appName("MSK-Streaming-Pipeline") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# ── Define schema of your JSON Kafka message payload ─────────────
order_schema = StructType([
    StructField("order_id",    StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("amount",      DoubleType(), True),
    StructField("event_time",  StringType(), True),  # ISO-8601 string
    StructField("status",      StringType(), True)
])

# ── Connect to MSK and read the stream ───────────────────────────
raw_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092,b-2.mycluster.kafka.us-east-1.amazonaws.com:9092") \
    .option("subscribe", "orders-topic") \         # single topic
    .option("startingOffsets", "latest") \         # "earliest" to replay all
    .option("maxOffsetsPerTrigger", 100000) \    # rate-limit per micro-batch
    .option("failOnDataLoss", "false") \         # don't fail if offsets gap
    .load()

# ── Kafka gives you: key, value, topic, partition, offset, timestamp
# value is always binary — cast to string first, then parse JSON ──
parsed_stream = raw_stream \
    .select(
        col("value").cast("string").alias("json_str"),
        col("partition"),
        col("offset"),
        col("timestamp").alias("kafka_ts")
    ) \
    .withColumn("data", from_json(col("json_str"), order_schema)) \
    .select("data.*", "kafka_ts")

# parsed_stream now has: order_id, customer_id, amount, event_time, status, kafka_ts
parsed_stream.printSchema()

🔑 Key Points

value is always binary bytes in Kafka — you must .cast("string") before from_json(). MSK with IAM auth requires additional Kafka SASL options — see the MSK section (29.11) for auth config.

writeStream to S3 — Parquet, Partitioned by Event Time

Writing to S3 as Parquet is the most common pattern on AWS. You partition by year/month/day derived from your event timestamp so downstream Athena and Glue queries can do efficient partition pruning.

python — writeStream to S3 as partitioned Parquet

from pyspark.sql.functions import to_timestamp, year, month, dayofmonth

# ── Add partition columns from event_time ────────────────────────
enriched = parsed_stream \
    .withColumn("event_ts",  to_timestamp(col("event_time"))) \
    .withColumn("year",       year(col("event_ts"))) \
    .withColumn("month",      month(col("event_ts"))) \
    .withColumn("day",        dayofmonth(col("event_ts")))

# ── Write to S3 with Parquet + partitioning ──────────────────────
query = enriched.writeStream \
    .format("parquet") \
    .option("path",             "s3://my-data-lake/bronze/orders/") \
    .option("checkpointLocation", "s3://my-checkpoints/orders/") \
    .partitionBy("year", "month", "day") \
    .outputMode("append") \
    .trigger(processingTime="5 minutes") \
    .start()

query.awaitTermination()  # blocks — keep the driver alive

⚠️ Always Set checkpointLocation

Without a checkpoint, Spark has no memory of what offsets it last processed. On restart it will either replay from the beginning (duplicates) or start from latest (data loss). Always point checkpoint to a durable S3 path.

writeStream to Delta Lake on S3

Delta Lake is the preferred sink over raw Parquet for streaming because it provides ACID transactions, handles partial failures atomically, and enables downstream MERGE for CDC. Delta's transaction log acts as a built-in exactly-once guarantee for Spark Structured Streaming.

python — writeStream to Delta Lake on S3

# ── Delta Lake dependency must be on the cluster ─────────────────
# EMR: --packages io.delta:delta-core_2.12:2.4.0
# Databricks: built-in

query = enriched.writeStream \
    .format("delta") \
    .option("path",             "s3://my-data-lake/bronze/orders_delta/") \
    .option("checkpointLocation", "s3://my-checkpoints/orders_delta/") \
    .outputMode("append") \
    .trigger(processingTime="5 minutes") \
    .start()

# ── Time travel — you can query any version after the fact ────────
# spark.read.format("delta").option("versionAsOf", 5).load("s3://...")
# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load("s3://...")

writeStream to Apache Iceberg on S3

Iceberg is the other major open table format. It's the preferred choice on AWS when using Athena, Glue, or EMR — it has native Athena support and the Glue catalog treats Iceberg tables as first-class objects without needing manifest files (unlike Delta).

python — writeStream to Iceberg via Glue Catalog

# ── Spark config for Iceberg + Glue Catalog (set at SparkSession) ─
spark = SparkSession.builder \
    .appName("Iceberg-Streaming") \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog",
            "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl",
            "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse",
            "s3://my-data-lake/") \
    .getOrCreate()

# ── The target Iceberg table must exist first ─────────────────────
# CREATE TABLE glue_catalog.bronze.orders (
#   order_id STRING, customer_id STRING, amount DOUBLE,
#   event_time TIMESTAMP, status STRING)
# USING iceberg PARTITIONED BY (days(event_time));

query = enriched.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://my-checkpoints/orders_iceberg/") \
    .toTable("glue_catalog.bronze.orders")   # write directly to catalog table

query.awaitTermination()

🕐

Watermarking — Handling Late-Arriving Data STATE MANAGEMENT ▼

Why Watermarking Exists

In distributed streaming systems, events don't always arrive in the order they were generated. A mobile app might buffer events and send them in bulk minutes later. A network hiccup might delay messages. Without watermarking, Spark would have to hold all state forever waiting for late events — this causes unbounded memory growth and kills your streaming job.

Watermarking tells Spark: "I'm willing to wait up to X time units for late data. After that, drop it and release the state." It's a tradeoff between data completeness (waiting longer) and state size (dropping sooner).

📦 Analogy

Imagine a postal sorting center. Packages are stamped with a send date. The center says: "We'll hold packages for up to 2 days after the stamped date. Anything arriving more than 2 days late gets refused." The 2-day window is your watermark threshold. This lets the center clear shelf space (release state) for old packages while still accepting a reasonable amount of late arrivals.

withWatermark() API — Syntax and Mechanics

withWatermark(eventTimeCol, delayThreshold) tells Spark two things: (1) use eventTimeCol as the clock for measuring event time, and (2) track the maximum event time seen so far and subtract delayThreshold to get the current watermark. State older than the watermark gets evicted.

python — withWatermark with window aggregation

from pyspark.sql.functions import window, sum, count, to_timestamp, col

# ── Cast event_time string → proper timestamp ─────────────────────
with_ts = parsed_stream \
    .withColumn("event_ts", to_timestamp(col("event_time")))

# ── Apply watermark: tolerate up to 10 minutes of late data ───────
watermarked = with_ts.withWatermark("event_ts", "10 minutes")

# ── Tumbling window aggregation: revenue per 5-min window ─────────
windowed_agg = watermarked \
    .groupBy(
        window(col("event_ts"), "5 minutes"),  # 5-minute tumbling window
        col("status")
    ) \
    .agg(
        sum("amount").alias("total_revenue"),
        count("*").alias("order_count")
    )

# ── window column gives: window.start, window.end ─────────────────
result = windowed_agg.select(
    col("window.start").alias("window_start"),
    col("window.end").alias("window_end"),
    col("status"),
    col("total_revenue"),
    col("order_count")
)

# ── With watermark + window, Append mode is allowed ───────────────
query = result.writeStream \
    .format("delta") \
    .option("path",             "s3://my-data-lake/silver/order_windows/") \
    .option("checkpointLocation", "s3://my-checkpoints/order_windows/") \
    .outputMode("append") \   # Append: only write completed windows
    .trigger(processingTime="1 minute") \
    .start()

query.awaitTermination()

🔑 Watermark + Append Mode Rule

When you use withWatermark + window aggregation + outputMode("append"): Spark only writes a window's result AFTER the watermark has passed beyond that window's end time. This means results are delayed by up to your threshold — but state is bounded. Without watermark, you must use outputMode("complete") which rewrites ALL results every batch.

Sliding Windows vs Tumbling Windows

A tumbling window divides time into fixed non-overlapping buckets (e.g. every 5 minutes). A sliding window overlaps — a 10-minute window sliding every 2 minutes means each event falls into 5 different windows simultaneously.

python — tumbling vs sliding window

from pyspark.sql.functions import window

# ── Tumbling: 5-min non-overlapping buckets ───────────────────────
tumbling = watermarked.groupBy(
    window(col("event_ts"), "5 minutes")       # windowDuration only
).agg(sum("amount").alias("revenue"))

# ── Sliding: 10-min window, slides every 2 minutes ───────────────
sliding = watermarked.groupBy(
    window(col("event_ts"), "10 minutes", "2 minutes")  # windowDuration, slideDuration
).agg(sum("amount").alias("revenue"))

# ── Session: gap-based (events within 30 min of each other) ──────
# Available in Spark 3.2+ via session_window()
from pyspark.sql.functions import session_window

session = watermarked.groupBy(
    session_window(col("event_ts"), "30 minutes"),
    col("customer_id")
).agg(sum("amount").alias("session_revenue"))

⏱️

Trigger Modes — Controlling When Spark Processes Data TRIGGER ▼

ProcessingTime — Fixed Interval Micro-Batches

The most common trigger. Spark waits for the specified interval, collects all available data since the last batch, and processes it. If processing takes longer than the interval, the next batch starts immediately (no skipping).

python — trigger modes comparison

from pyspark.sql.streaming import Trigger

# ── 1. ProcessingTime — run every 5 minutes ───────────────────────
query = df.writeStream \
    .trigger(processingTime="5 minutes") \  # or Trigger.ProcessingTime("5 minutes")
    .format("delta") \
    .option("checkpointLocation", "s3://...") \
    .start()

# ── 2. Once — process all available data, then stop ───────────────
# Old approach — still works but deprecated in favour of AvailableNow
query = df.writeStream \
    .trigger(once=True) \                    # processes one mega-batch then exits
    .format("delta") \
    .option("checkpointLocation", "s3://...") \
    .start()
query.awaitTermination()

# ── 3. AvailableNow — modern replacement for Once ─────────────────
# Processes ALL available data in multiple micro-batches, then stops
# Much better for large backlogs — doesn't create one huge batch
query = df.writeStream \
    .trigger(availableNow=True) \           # Spark 3.3+
    .format("delta") \
    .option("checkpointLocation", "s3://...") \
    .start()
query.awaitTermination()   # blocks until all data processed and job exits

# ── 4. Continuous — sub-millisecond latency (experimental) ───────
# Not micro-batch — uses epoch-based commits. Limited operations only.
query = df.writeStream \
    .trigger(continuous="1 second") \        # commit every 1 second
    .format("kafka") \
    .option("kafka.bootstrap.servers", "...") \
    .option("topic", "output-topic") \
    .start()

Trigger	Use Case	Latency	Cost on AWS
`processingTime="5m"`	Near-real-time dashboards	~5 min	Medium — cluster stays running
`availableNow=True`	Scheduled batch-style streaming	Minutes (run on schedule)	Low — cluster auto-terminates
`once=True`	Legacy scheduled runs	Minutes	Low
`continuous="1s"`	Sub-millisecond Kafka-to-Kafka	<1ms	High — dedicated executors

☁️ AWS Cost Tip

On EMR, use availableNow triggered by EventBridge every 15 minutes. The cluster spins up, processes all new Kafka data, writes to Delta, and terminates. This is dramatically cheaper than a long-running streaming cluster for moderate-volume pipelines (<1M events/min).

📤

Output Modes — append, update, complete OUTPUT ▼

Three Output Modes Explained

Output mode controls which rows are written to the sink per micro-batch. The choice depends on whether you have aggregations and whether you need historical results or only new ones.

Mode	What Gets Written	Requires	Common Use
append	Only NEW rows added since last batch. Rows already written are never modified.	Watermark if aggregating	Raw event landing to S3/Delta Bronze
update	Only rows that CHANGED since last batch (new + updated aggregates).	Aggregation or stateful op	JDBC upserts via foreachBatch
complete	ALL result rows rewritten every batch — full result table.	Aggregation (no watermark needed)	Real-time leaderboards, small lookups

python — output mode examples

# ── APPEND: raw events, no aggregation — write each event once ────
raw_events.writeStream \
    .outputMode("append") \
    .format("delta") \
    .option("checkpointLocation", "s3://ckpt/raw/") \
    .start()

# ── UPDATE: streaming aggregation, only changed rows written ──────
# Good for JDBC sinks that support upsert
running_totals = parsed_stream \
    .groupBy("status") \
    .agg(sum("amount").alias("total"), count("*").alias("cnt"))

running_totals.writeStream \
    .outputMode("update") \
    .foreachBatch(lambda df, batch_id: upsert_to_rds(df, batch_id)) \
    .option("checkpointLocation", "s3://ckpt/totals/") \
    .start()

# ── COMPLETE: rewrite entire result table every batch ─────────────
# Good for small aggregations like top-10 lists
leaderboard = parsed_stream \
    .groupBy("customer_id") \
    .agg(sum("amount").alias("total_spent")) \
    .orderBy(col("total_spent").desc()) \
    .limit(10)

leaderboard.writeStream \
    .outputMode("complete") \
    .format("memory") \             # in-memory table for dashboards
    .queryName("top_customers") \
    .start()

# Query the in-memory table via SQL:
# spark.sql("SELECT * FROM top_customers").show()

⚠️ Complete Mode + Delta = Expensive

outputMode("complete") with Delta rewrites the entire table every micro-batch. For large aggregations this causes massive write amplification. Use update + foreachBatch with MERGE INTO Delta instead for production aggregation pipelines.

💾

Checkpointing — Exactly-Once Guarantees on AWS FAULT TOLERANCE ▼

What a Checkpoint Contains and Why It Matters

A Spark Structured Streaming checkpoint is a directory (always on S3 in AWS) that stores three things: the offset log (which Kafka offsets were read), the commit log (which batches were successfully written), and the state store (aggregation/join state for stateful operations). Together these enable exact recovery after any failure.

Checkpoint Directory Layout on S3 ───────────────────────────────────────────── s3://my-checkpoints/my-query/ ├── offsets/ ← which Kafka offsets were read per batch │ 0 (batch 0 offsets) │ 1 (batch 1 offsets) │ 2 ... ├── commits/ ← which batches were successfully committed │ 0 │ 1 ├── state/ ← stateful aggregation state (RocksDB or in-memory) │ 0/ │ 0.delta │ 1.delta └── metadata ← query metadata (id, Spark version)

python — checkpoint best practices on S3

# ── Rule 1: Each query needs its OWN checkpoint path ─────────────
# Never share a checkpoint between two different queries!

query1 = stream_a.writeStream \
    .option("checkpointLocation", "s3://my-ckpt/pipeline-orders/") \
    .format("delta").start()

query2 = stream_b.writeStream \
    .option("checkpointLocation", "s3://my-ckpt/pipeline-events/") \
    .format("delta").start()

# ── Rule 2: Use S3 on a reliable prefix (not temp/scratch) ────────
# Good:  s3://my-checkpoints/prod/orders/
# Bad:   /tmp/ckpt/ (lost on restart)
# Bad:   s3://my-data-lake/bronze/  (mixed with data)

# ── Rule 3: On restart, reuse the SAME checkpoint path ───────────
# Spark will read offset log, find last committed batch,
# and resume from the next un-processed offset automatically.

# ── Rule 4: To reset (replay from scratch), DELETE the checkpoint
# But this will cause duplicate data unless you also clear the sink!
import boto3
s3 = boto3.client("s3")
# s3.delete_object(Bucket="my-ckpt", Key="pipeline-orders/")  ← careful!

🔑 How Exactly-Once Works End-to-End

Source: Kafka offsets are checkpointed — on restart Spark replays from last committed offset (no data loss). Processing: Spark's idempotent execution re-runs the same batch if it failed mid-way. Sink: Delta Lake uses the batch ID to make writes idempotent — rewriting batch 5 twice results in only one copy of the data.

🔁

foreachBatch — Custom Sink Logic ADVANCED PATTERN ▼

What foreachBatch Does and Why You Need It

foreachBatch gives you access to each micro-batch as a static DataFrame. This unlocks everything that native writeStream sinks can't do: writing to multiple destinations, JDBC upserts, REST API calls, custom Delta MERGE operations, and fan-out patterns.

📦 Analogy

Normal writeStream is like a factory output conveyor that automatically drops products into one specific box (Delta / Parquet / Kafka). foreachBatch is like having a human worker intercept each box of products — they can split it, inspect it, route some pieces to one warehouse, others to another, and stamp each box with a batch number for deduplication.

python — foreachBatch: MERGE INTO Delta (upsert pattern)

from delta.tables import DeltaTable

def upsert_to_delta(batch_df, batch_id):
    """
    Called once per micro-batch.
    batch_df  — static DataFrame of this batch's data
    batch_id  — monotonically increasing integer (0, 1, 2, ...)
    """
    print(f"Processing batch {batch_id}, rows: {batch_df.count()}")

    # ── Deduplicate within this batch first ──────────────────────────
    # A batch might have duplicate order_ids if the producer sent retries
    deduped = batch_df.dropDuplicates(["order_id"])

    # ── MERGE INTO Delta: upsert by order_id ─────────────────────────
    target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

    target.alias("t").merge(
        deduped.alias("s"),
        condition="t.order_id = s.order_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()

# ── Wire foreachBatch into the streaming query ────────────────────
query = parsed_stream.writeStream \
    .foreachBatch(upsert_to_delta) \
    .option("checkpointLocation", "s3://my-ckpt/silver-orders/") \
    .outputMode("update") \     # use "update" or "append" with foreachBatch
    .trigger(processingTime="5 minutes") \
    .start()

query.awaitTermination()

foreachBatch: Fan-Out — Write to Multiple Destinations

One of the most powerful patterns: write each micro-batch to multiple sinks simultaneously. Cache the batch DataFrame first so Spark doesn't re-compute it for each write.

python — foreachBatch fan-out: Delta + DynamoDB + SNS

import boto3, json
from delta.tables import DeltaTable

dynamodb = boto3.resource("dynamodb")
sns      = boto3.client("sns")
audit_table = dynamodb.Table("pipeline-audit")

def multi_sink_write(batch_df, batch_id):
    # ── Cache! Re-used across multiple actions ────────────────────────
    batch_df.cache()

    row_count = batch_df.count()

    # ── Sink 1: Write to Delta Lake Bronze ───────────────────────────
    batch_df.write \
        .format("delta") \
        .mode("append") \
        .option("mergeSchema", "true") \
        .save("s3://my-data-lake/bronze/orders/")

    # ── Sink 2: Write high-value orders to separate table ─────────────
    high_value = batch_df.filter(col("amount") > 1000)
    high_value.write \
        .format("delta") \
        .mode("append") \
        .save("s3://my-data-lake/bronze/high_value_orders/")

    # ── Sink 3: Write audit record to DynamoDB ────────────────────────
    audit_table.put_item(Item={
        "run_id":    f"streaming-batch-{batch_id}",
        "batch_id":  batch_id,
        "row_count": row_count,
        "status":    "success"
    })

    # ── Sink 4: Alert on large batches via SNS ────────────────────────
    if row_count > 500000:
        sns.publish(
            TopicArn="arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
            Message=f"Large batch detected: batch_id={batch_id}, rows={row_count}",
            Subject="Streaming Pipeline — Large Batch Alert"
        )

    # ── Always unpersist after all sinks are done ─────────────────────
    batch_df.unpersist()

query = parsed_stream.writeStream \
    .foreachBatch(multi_sink_write) \
    .option("checkpointLocation", "s3://my-ckpt/multi-sink/") \
    .trigger(processingTime="2 minutes") \
    .start()

query.awaitTermination()

Idempotent Writes with batchId

If the job restarts mid-batch, foreachBatch may be called again for the same batch_id. Make your sink logic idempotent using batch_id as a guard — check if that batch was already written before writing again.

python — idempotent foreachBatch using batchId guard

def idempotent_write(batch_df, batch_id):
    # ── Check if this batch was already committed ─────────────────────
    control_table = dynamodb.Table("streaming-control")
    existing = control_table.get_item(
        Key={"pipeline_id": "orders-pipeline", "batch_id": str(batch_id)}
    )

    if "Item" in existing:
        print(f"Batch {batch_id} already committed — skipping (replay safety)")
        return

    # ── Write the data ────────────────────────────────────────────────
    batch_df.write.format("delta").mode("append") \
        .save("s3://my-data-lake/bronze/orders/")

    # ── Mark this batch as done in control table ──────────────────────
    control_table.put_item(Item={
        "pipeline_id": "orders-pipeline",
        "batch_id":   str(batch_id),
        "committed_at": str(__import__("datetime").datetime.utcnow())
    })

📨

Kafka Patterns on AWS MSK — Producer, Consumer, Schema Registry MSK ▼

Producer Design — Serialization and Partitioning Strategy

In Python pipelines, you often need to produce events to Kafka from a Spark job (e.g. after processing, publish enriched events downstream). Use the kafka-python library or write a Spark DataFrame back to Kafka via writeStream.

python — Spark writeStream back to Kafka (producer pattern)

from pyspark.sql.functions import to_json, struct

# ── Kafka requires "value" (and optionally "key") columns ─────────
# Serialize the enriched DataFrame as JSON → Kafka value
kafka_ready = enriched.select(
    col("order_id").alias("key"),        # partition key — same order → same partition
    to_json(struct("*")).alias("value")   # full row as JSON string
)

# ── Write enriched events back to a downstream Kafka topic ────────
query = kafka_ready.writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
    .option("topic", "orders-enriched") \
    .option("checkpointLocation", "s3://my-ckpt/kafka-output/") \
    .outputMode("append") \
    .start()

query.awaitTermination()

Schema Registry with Avro on MSK

For production pipelines on MSK, use Avro serialization with AWS Glue Schema Registry (MSK's managed schema registry). Avro is more compact than JSON and enforces schema on both producer and consumer sides — preventing schema drift from breaking downstream pipelines.

python — read MSK topic with Avro + Glue Schema Registry

# ── Dependency: add the AWS Glue Schema Registry Serde JAR ────────
# --packages software.amazon.glue:schema-registry-serde:1.1.14

from pyspark.sql.functions import col

# ── Read raw Avro bytes from MSK ──────────────────────────────────
raw = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
    .option("subscribe", "orders-avro") \
    .option("startingOffsets", "latest") \
    .load()

# ── Deserialize Avro value using from_avro() ──────────────────────
from pyspark.sql.avro.functions import from_avro

avro_schema = """
{
  "type": "record",
  "name": "Order",
  "fields": [
    {"name": "order_id",    "type": "string"},
    {"name": "customer_id", "type": "string"},
    {"name": "amount",      "type": "double"},
    {"name": "event_time",  "type": "string"},
    {"name": "status",      "type": "string"}
  ]
}
"""

parsed = raw.select(
    from_avro(col("value"), avro_schema).alias("data")
).select("data.*")

parsed.printSchema()
# root
#  |-- order_id: string
#  |-- customer_id: string
#  |-- amount: double
#  |-- event_time: string
#  |-- status: string

Dead Letter Topic — Routing Bad Records

In production, some Kafka messages will be malformed — bad JSON, wrong schema, or null values where non-null is expected. Instead of crashing the streaming job, route these records to a dead letter topic (DLT) in Kafka for later inspection and replay.

python — dead letter topic pattern in foreachBatch

from pyspark.sql.functions import from_json, col, lit, to_json, struct

def process_with_dlt(batch_df, batch_id):
    # ── Parse JSON — invalid records get null in "data" column ────────
    parsed = batch_df.select(
        col("value").cast("string").alias("raw"),
        from_json(col("value").cast("string"), order_schema).alias("data")
    )

    # ── Split: good records vs bad (null data = parse failure) ────────
    good = parsed.filter(col("data.order_id").isNotNull()).select("data.*")
    bad  = parsed.filter(col("data.order_id").isNull()) \
                 .select(col("raw").alias("value"))    # send raw string to DLT

    # ── Write good records to Delta ───────────────────────────────────
    if good.count() > 0:
        good.write.format("delta").mode("append") \
            .save("s3://my-data-lake/bronze/orders/")

    # ── Route bad records to Kafka dead letter topic ──────────────────
    if bad.count() > 0:
        bad.write \
            .format("kafka") \
            .option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
            .option("topic", "orders-topic-dlq") \   # dead letter topic
            .save()

query = raw_stream.writeStream \
    .foreachBatch(process_with_dlt) \
    .option("checkpointLocation", "s3://my-ckpt/orders-dlt/") \
    .trigger(processingTime="1 minute") \
    .start()

query.awaitTermination()

🏗️

Complete AWS Streaming Pipeline — Kafka → Spark → Delta on S3 PRODUCTION PATTERN ▼

End-to-End Pipeline: MSK → EMR Spark Streaming → Delta Bronze

This is the canonical AWS streaming architecture. MSK receives events from producers (apps, microservices). EMR runs a persistent Spark Structured Streaming job that reads from MSK, applies watermarking, does light transformations, and lands data into a Delta Lake Bronze table on S3. Downstream batch jobs then refine Bronze → Silver → Gold.

COMPLETE AWS STREAMING PIPELINE ARCHITECTURE ──────────────────────────────────────────────────────────────── [Producers] [MSK Kafka] [EMR Spark] [S3 Delta Lake] App servers → orders-topic → readStream → bronze/orders/ Mobile apps events-topic watermark ↓ Microservices errors-topic foreachBatch silver/orders/ write to (batch job) ↓ ↓ ↓ [Schema Registry] [S3 Checkpoint] [Athena / Glue] (Avro schemas) (offset + state) (query layer) ↓ [DLQ Topic] ← Bad records routed here ↓ [Lambda] ← Alerts on DLQ depth via CloudWatch ────────────────────────────────────────────────────────────────

python — complete production streaming pipeline (EMR)

import boto3, json
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, from_json, to_timestamp, year, month, dayofmonth,
    current_timestamp, lit
)
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from delta.tables import DeltaTable

# ──────────────────────────────────────────────────────────────────
# 1. SparkSession with Delta support
# ──────────────────────────────────────────────────────────────────
spark = SparkSession.builder \
    .appName("MSK-to-Delta-Bronze") \
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

# ──────────────────────────────────────────────────────────────────
# 2. Load config from Secrets Manager (never hardcode creds)
# ──────────────────────────────────────────────────────────────────
sm = boto3.client("secretsmanager", region_name="us-east-1")
secret = json.loads(
    sm.get_secret_value(SecretId="prod/msk/streaming-pipeline")["SecretString"]
)
bootstrap_servers = secret["bootstrap_servers"]
topic             = secret["topic"]
delta_path        = secret["delta_path"]
checkpoint_path   = secret["checkpoint_path"]

# ──────────────────────────────────────────────────────────────────
# 3. Schema
# ──────────────────────────────────────────────────────────────────
order_schema = StructType([
    StructField("order_id",    StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("amount",      DoubleType(), True),
    StructField("event_time",  StringType(), True),
    StructField("status",      StringType(), True)
])

# ──────────────────────────────────────────────────────────────────
# 4. Read from MSK
# ──────────────────────────────────────────────────────────────────
raw = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_servers) \
    .option("subscribe", topic) \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 200000) \
    .option("failOnDataLoss", "false") \
    .load()

# ──────────────────────────────────────────────────────────────────
# 5. Parse + watermark + enrich
# ──────────────────────────────────────────────────────────────────
parsed = raw \
    .select(from_json(col("value").cast("string"), order_schema).alias("d")) \
    .select("d.*") \
    .withColumn("event_ts",   to_timestamp(col("event_time"))) \
    .withColumn("ingested_at", current_timestamp()) \
    .withColumn("year",        year(col("event_ts"))) \
    .withColumn("month",       month(col("event_ts"))) \
    .withColumn("day",         dayofmonth(col("event_ts"))) \
    .withWatermark("event_ts", "10 minutes")   # tolerate 10-min late events

cw = boto3.client("cloudwatch", region_name="us-east-1")

# ──────────────────────────────────────────────────────────────────
# 6. foreachBatch: write to Delta + publish metrics
# ──────────────────────────────────────────────────────────────────
def write_batch(batch_df, batch_id):
    batch_df.cache()
    row_count = batch_df.count()

    if row_count > 0:
        batch_df.write \
            .format("delta") \
            .mode("append") \
            .partitionBy("year", "month", "day") \
            .option("mergeSchema", "true") \
            .save(delta_path)

    # Publish rows_processed metric to CloudWatch
    cw.put_metric_data(
        Namespace="DataPipeline/Streaming",
        MetricData=[{
            "MetricName": "RowsProcessed",
            "Value":      row_count,
            "Unit":       "Count",
            "Dimensions": [{"Name": "Pipeline", "Value": "orders-bronze"}]
        }]
    )

    batch_df.unpersist()
    print(f"Batch {batch_id}: wrote {row_count} rows")

# ──────────────────────────────────────────────────────────────────
# 7. Start the streaming query
# ──────────────────────────────────────────────────────────────────
query = parsed.writeStream \
    .foreachBatch(write_batch) \
    .option("checkpointLocation", checkpoint_path) \
    .trigger(processingTime="5 minutes") \
    .start()

query.awaitTermination()

✅ What This Pattern Achieves

Exactly-once delivery: checkpoint + Delta atomic writes. Late data tolerance: 10-minute watermark. No hardcoded secrets: config loaded from Secrets Manager. Observability: CloudWatch metrics per batch. Cost control: maxOffsetsPerTrigger prevents runaway executor memory on burst traffic.

29.25 — AWS DATA ENGINEERING

CDC (Change Data Capture) Pipelines

Change Data Capture lets you stream every INSERT, UPDATE, and DELETE from a source database into your data lake in near real-time — without expensive full table scans. This section covers CDC fundamentals, AWS DMS, Debezium on MSK, and landing CDC events into Delta Lake and Iceberg using MERGE INTO.

♻️

CDC Concepts — Full Load vs Incremental vs CDC FOUNDATION ▼

Full Load vs Incremental vs CDC — What Is the Difference?

There are three ways to move data from a source database to your data lake. Each has a different cost, latency, and completeness profile.

Strategy	How It Works	Captures Deletes?	Latency	Cost
Full Load	Read the entire table every run. Overwrite the target.	Yes (missing rows = deleted)	Hours	High — full scan each time
Incremental	Read only rows where `updated_at > last_watermark`. Append or upsert.	No — soft deletes only	Minutes	Medium
CDC	Capture every change event (I/U/D) from the database transaction log. Stream in real-time.	Yes — DELETE events captured	Seconds	Low — log-based, no table scan

📦 Analogy

Full Load is photocopying an entire book every night. Incremental is copying only the pages that were written after a bookmark. CDC is having a live stenographer sitting next to the author, writing down every single edit — including crossed-out words (deletes) — the instant they happen.

Log-Based CDC vs Query-Based CDC

Log-based CDC reads the database's internal transaction log (WAL in PostgreSQL, binlog in MySQL, redo log in Oracle). It captures every change with zero impact on the source — no extra queries, no table locks. Query-based CDC polls the source table with WHERE updated_at > ?. Simpler to set up but can't capture hard deletes and adds query load to the source.

🔑 Production Rule

Always prefer log-based CDC (DMS or Debezium) for production pipelines. Query-based CDC misses hard deletes and creates load on the source database — two problems that bite you at scale.

Insert / Update / Delete Event Structure — Before and After Image

Every CDC event carries an operation type and (depending on the tool) a before image (the row before the change) and an after image (the row after). Understanding this structure is essential for writing correct MERGE logic.

json — CDC event structure (Debezium format)

// INSERT event — op = "c" (create)
{
  "op":     "c",
  "before": null,
  "after":  { "order_id": "O-001", "amount": 99.99, "status": "PLACED" },
  "ts_ms":  1700000000000
}

// UPDATE event — op = "u" (update)
{
  "op":     "u",
  "before": { "order_id": "O-001", "amount": 99.99, "status": "PLACED"   },
  "after":  { "order_id": "O-001", "amount": 99.99, "status": "SHIPPED"  },
  "ts_ms":  1700000060000
}

// DELETE event — op = "d" (delete), tombstone follows
{
  "op":     "d",
  "before": { "order_id": "O-001", "amount": 99.99, "status": "SHIPPED"  },
  "after":  null,
  "ts_ms":  1700000120000
}

// SNAPSHOT / READ event — op = "r" (initial full load row)
{
  "op":     "r",
  "before": null,
  "after":  { "order_id": "O-999", "amount": 150.00, "status": "DELIVERED" },
  "ts_ms":  1700000000000
}

Exactly-Once Delivery Challenges in CDC

CDC pipelines face two failure scenarios: (1) the source emits a change event but the consumer crashes before committing — causing a replay (at-least-once). (2) The consumer commits the offset but fails before writing to the sink — causing data loss (at-most-once). True exactly-once requires checkpointed offsets (Kafka + Spark checkpoint) AND an idempotent sink (Delta MERGE or JDBC upsert keyed by primary key + change timestamp).

🚚

AWS DMS — Database Migration Service for CDC AWS DMS ▼

DMS Architecture — Replication Instance, Endpoints, Tasks

AWS DMS has three components: a Replication Instance (a managed EC2 that runs the migration), Source and Target Endpoints (connections to your database and your destination), and a Replication Task (the actual job that moves data). For CDC, you run a task in Full load + ongoing replication mode — DMS first does a snapshot of the table, then continuously streams changes from the transaction log.

AWS DMS — CDC PIPELINE ARCHITECTURE ──────────────────────────────────────────────────────────── [Source DB] [DMS] [Target] PostgreSQL → Replication Instance → S3 (Parquet) MySQL (t3.medium or larger) MSK (Kafka) Oracle Redshift SQL Server Source Endpoint DynamoDB Target Endpoint Replication Task └── Full Load + CDC mode WAL / Binlog → DMS reads transaction log → writes CDC events ────────────────────────────────────────────────────────────

DMS → S3 Landing (Parquet + CDC op column)

When DMS writes to S3, it creates Parquet (or CSV) files with a special Op column: I for Insert, U for Update, D for Delete. Your downstream Spark job reads these files and applies the changes to your Delta table using MERGE.

python — boto3: create DMS replication task (Full Load + CDC)

import boto3, json

dms = boto3.client("dms", region_name="us-east-1")

# ── Step 1: Create source endpoint (PostgreSQL) ───────────────────
src = dms.create_endpoint(
    EndpointIdentifier="postgres-source",
    EndpointType="source",
    EngineName="postgres",
    ServerName="mydb.cluster-xyz.us-east-1.rds.amazonaws.com",
    Port=5432,
    DatabaseName="orders_db",
    Username="dms_user",
    Password="{{resolved from Secrets Manager}}",
    PostgreSQLSettings={
        "SlotName":               "dms_replication_slot",  # WAL slot
        "CaptureDdls":            True,
        "HeartbeatEnable":        True,
        "HeartbeatFrequency":     5,
        "DatabaseMode":           "default"
    }
)

# ── Step 2: Create target endpoint (S3) ──────────────────────────
tgt = dms.create_endpoint(
    EndpointIdentifier="s3-target",
    EndpointType="target",
    EngineName="s3",
    S3Settings={
        "BucketName":           "my-data-lake",
        "BucketFolder":         "cdc-landing/orders/",
        "CompressionType":      "GZIP",
        "DataFormat":           "parquet",
        "ParquetVersion":       "parquet-2-0",
        "EnableStatistics":     True,
        "IncludeOpForFullLoad": True,   # add Op column even for full load rows
        "CdcInsertsAndUpdates": True,   # capture I and U
        "CdcDeletesEnabled":    True,   # capture D
        "TimestampColumnName":  "dms_timestamp",  # add arrival timestamp
        "ServiceAccessRoleArn": "arn:aws:iam::123456789012:role/dms-s3-role"
    }
)

# ── Step 3: Create replication task ──────────────────────────────
task = dms.create_replication_task(
    ReplicationTaskIdentifier="orders-cdc-task",
    SourceEndpointArn=src["Endpoint"]["EndpointArn"],
    TargetEndpointArn=tgt["Endpoint"]["EndpointArn"],
    ReplicationInstanceArn="arn:aws:dms:us-east-1:123456789012:rep:REPLICATION_INSTANCE_ARN",
    MigrationType="full-load-and-cdc",   # snapshot first, then ongoing
    TableMappings=json.dumps({
        "rules": [{
            "rule-type":   "selection",
            "rule-id":     "1",
            "rule-name":   "include-orders",
            "object-locator": {
                "schema-name": "public",
                "table-name":  "orders"
            },
            "rule-action": "include"
        }]
    }),
    ReplicationTaskSettings=json.dumps({
        "TargetMetadata": {
            "TargetSchema": "",
            "SupportLobs":  True,
            "FullLobMode":  False
        },
        "Logging": {
            "EnableLogging": True
        }
    })
)

# ── Step 4: Start the task ────────────────────────────────────────
dms.start_replication_task(
    ReplicationTaskArn=task["ReplicationTask"]["ReplicationTaskArn"],
    StartReplicationTaskType="start-replication"   # or "resume-processing"
)
print("DMS task started — full load + ongoing CDC running")

DMS → MSK (Kafka CDC Events)

DMS can also write CDC events directly to MSK. This is the preferred pattern when you want multiple consumers to process the same CDC stream — e.g. one Spark job writes to Delta Bronze, another updates a real-time cache in DynamoDB, and a third triggers alerting logic.

python — DMS target endpoint for MSK Kafka

dms.create_endpoint(
    EndpointIdentifier="msk-target",
    EndpointType="target",
    EngineName="kafka",
    KafkaSettings={
        "Broker":                  "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092",
        "Topic":                   "orders-cdc",
        "MessageFormat":           "json",              # or "avro"
        "IncludeTransactionDetails": True,
        "IncludePartitionValue":   True,
        "PartitionIncludeSchemaTable": True,
        "IncludeTableAlterOperations": True,
        "IncludeControlDetails":   True,
        "IncludeNullAndEmpty":     False,
        "SecurityProtocol":        "ssl"                # MSK TLS
    }
)

DMS Task Monitoring and Error Handling

python — poll DMS task status with boto3

import time

def wait_for_dms_task(task_arn, terminal_states=("stopped", "failed")):
    while True:
        resp  = dms.describe_replication_tasks(
            Filters=[{"Name": "replication-task-arn", "Values": [task_arn]}]
        )
        task  = resp["ReplicationTasks"][0]
        state = task["Status"]
        stats = task.get("ReplicationTaskStats", {})

        print(f"State: {state} | "
              f"TablesLoaded: {stats.get('TablesLoaded',0)} | "
              f"TablesLoading: {stats.get('TablesLoading',0)} | "
              f"TablesErrored: {stats.get('TablesErrored',0)}")

        if state in terminal_states:
            break
        if state == "running" and stats.get("TablesLoaded", 0) > 0:
            print("Full load complete — CDC ongoing")
            break   # for full-load-and-cdc, don't wait forever

        time.sleep(15)

wait_for_dms_task(task["ReplicationTask"]["ReplicationTaskArn"])

⚠️ DMS Gotchas in Production

(1) For PostgreSQL CDC, you must set wal_level = logical on the source RDS instance — requires a reboot. (2) DMS replication slots consume WAL disk space on the source — monitor and set max_slot_wal_keep_size. (3) DMS task failure mid-load leaves partial data — always use a staging prefix and swap atomically. (4) LOB columns need SupportLobs=True — affects performance.

⚡

Debezium on MSK — Open-Source CDC with Kafka Connect DEBEZIUM ▼

What Debezium Is and How It Works

Debezium is an open-source CDC platform built on top of Kafka Connect. It runs as a Kafka Connect connector — you deploy it alongside MSK (or on MSK Connect, AWS's managed Kafka Connect), configure it to point at your source database, and it reads the transaction log and publishes a structured CDC event to a Kafka topic for every change. No polling, no extra load on the source, sub-second latency.

DEBEZIUM CDC ARCHITECTURE ON AWS ────────────────────────────────────────────────────────────── [RDS PostgreSQL] [MSK Connect] [MSK Kafka] WAL (logical replication) ─────────────────────────→ topic: mydb.public.orders ↑ Debezium PostgreSQL (CDC events: c/u/d/r) └── Debezium reads Connector ↓ pg_logical slot (Kafka Connect [Spark Streaming] worker fleet) foreachBatch → MERGE INTO Delta Each topic name = {server}.{schema}.{table} e.g. myserver.public.orders myserver.public.customers ──────────────────────────────────────────────────────────────

Debezium PostgreSQL Connector — Configuration

json — Debezium PostgreSQL connector config (Kafka Connect REST API)

{
  "name": "orders-postgres-cdc",
  "config": {
    "connector.class":                    "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname":                  "mydb.cluster-xyz.us-east-1.rds.amazonaws.com",
    "database.port":                      "5432",
    "database.user":                      "debezium",
    "database.password":                  "${file:/opt/kafka/secrets/db.properties:password}",
    "database.dbname":                    "orders_db",
    "database.server.name":               "myserver",
    "table.include.list":                 "public.orders,public.customers",
    "plugin.name":                        "pgoutput",
    "slot.name":                          "debezium_slot",
    "publication.name":                   "debezium_pub",
    "snapshot.mode":                      "initial",
    "decimal.handling.mode":             "double",
    "time.precision.mode":               "adaptive",
    "tombstones.on.delete":              "true",
    "key.converter":                     "org.apache.kafka.connect.json.JsonConverter",
    "value.converter":                   "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable":      "false",
    "value.converter.schemas.enable":    "false",
    "transforms":                        "unwrap",
    "transforms.unwrap.type":            "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.add.fields":      "op,ts_ms,source.ts_ms",
    "transforms.unwrap.delete.handling.mode": "rewrite"
  }
}

🔑 ExtractNewRecordState Transform

Without the unwrap transform, Debezium events have a complex nested structure: {"before": {...}, "after": {...}, "source": {...}, "op": "u"}. The ExtractNewRecordState SMT (Single Message Transform) flattens this — the value becomes just the after fields, with __op, __ts_ms appended. This makes downstream Spark parsing much simpler.

Debezium MySQL Connector

json — Debezium MySQL connector config

{
  "name": "mysql-cdc",
  "config": {
    "connector.class":          "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname":         "mysql.cluster-xyz.us-east-1.rds.amazonaws.com",
    "database.port":             "3306",
    "database.user":             "debezium",
    "database.password":         "secret",
    "database.server.id":        "12345",        // unique server ID for binlog
    "database.server.name":      "mysqlserver",
    "database.include.list":     "orders_db",
    "table.include.list":        "orders_db.orders",
    "database.history.kafka.bootstrap.servers": "b-1.mycluster:9092",
    "database.history.kafka.topic": "schema-changes.orders_db",
    "snapshot.mode":             "initial",
    "include.schema.changes":    "true"
  }
}

Tombstone Records for Deletes

When a row is deleted, Debezium publishes two Kafka messages to the topic: (1) a DELETE event with "op": "d" and the before-image, and (2) a tombstone — a message with the same key but a null value. The tombstone tells Kafka log compaction to remove that key from the compacted log. Your Spark consumer should filter out tombstones (null values) before processing.

python — filter tombstone records in Spark

# Kafka value is null for tombstone records — filter them out
non_tombstone = raw_kafka_stream.filter(col("value").isNotNull())

# Now parse the remaining (valid) CDC events
cdc_events = non_tombstone \
    .select(from_json(col("value").cast("string"), cdc_schema).alias("d")) \
    .select("d.*")

🏠

Landing CDC in Delta Lake — MERGE INTO Pattern DELTA MERGE ▼

MERGE INTO for Upsert from CDC Events (Insert + Update)

The core pattern for landing CDC into Delta: for each micro-batch, MERGE the incoming events into the target Delta table. Rows with matching primary keys get updated; new keys get inserted; delete events remove rows. This produces a current-state table that always reflects the latest state of the source database.

python — MERGE INTO Delta from CDC stream (Debezium / DMS)

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, max as spark_max
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType
from delta.tables import DeltaTable

# ── Schema for the unwrapped Debezium event ───────────────────────
# (after ExtractNewRecordState SMT, fields are flattened)
cdc_schema = StructType([
    StructField("order_id",    StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("amount",      DoubleType(), True),
    StructField("status",      StringType(), True),
    StructField("__op",        StringType(), True),    # c / u / d / r
    StructField("__ts_ms",     LongType(),   True)     # source change timestamp
])

DELTA_PATH   = "s3://my-data-lake/silver/orders/"
CKPT_PATH    = "s3://my-ckpt/cdc-orders/"
KAFKA_BROKER = "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092"

def apply_cdc_batch(batch_df, batch_id):
    """Apply a micro-batch of CDC events to the Delta target table."""
    if batch_df.isEmpty():
        return

    batch_df.cache()

    # ── Deduplicate: keep only the LATEST change per order_id in this batch
    # A batch might have: INSERT O-001, then UPDATE O-001
    # We only want to apply the latest state
    from pyspark.sql.window import Window
    from pyspark.sql.functions import row_number, desc

    window = Window.partitionBy("order_id").orderBy(desc("__ts_ms"))
    latest = batch_df \
        .withColumn("_rn", row_number().over(window)) \
        .filter(col("_rn") == 1) \
        .drop("_rn")

    # ── Split: upserts (c/u/r) vs deletes (d) ────────────────────────
    upserts = latest.filter(col("__op").isin("c", "u", "r")) \
                    .drop("__op", "__ts_ms")
    deletes = latest.filter(col("__op") == "d")

    delta_table = DeltaTable.forPath(spark, DELTA_PATH)

    # ── Apply upserts via MERGE ───────────────────────────────────────
    if upserts.count() > 0:
        delta_table.alias("t").merge(
            upserts.alias("s"),
            "t.order_id = s.order_id"
        ).whenMatchedUpdateAll() \
         .whenNotMatchedInsertAll() \
         .execute()

    # ── Apply deletes ─────────────────────────────────────────────────
    if deletes.count() > 0:
        # Build a condition like: order_id IN ('O-001', 'O-002', ...)
        delete_ids = [row["order_id"] for row in deletes.select("order_id").collect()]
        delta_table.delete(col("order_id").isin(delete_ids))

    batch_df.unpersist()
    print(f"Batch {batch_id}: upserts={upserts.count()}, deletes={deletes.count()}")

# ── Read CDC stream from MSK ──────────────────────────────────────
raw = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKER) \
    .option("subscribe", "myserver.public.orders") \
    .option("startingOffsets", "earliest") \
    .option("failOnDataLoss", "false") \
    .load()

cdc_stream = raw \
    .filter(col("value").isNotNull()) \           # drop tombstones
    .select(
        from_json(col("value").cast("string"), cdc_schema).alias("d")
    ).select("d.*")

query = cdc_stream.writeStream \
    .foreachBatch(apply_cdc_batch) \
    .option("checkpointLocation", CKPT_PATH) \
    .trigger(processingTime="1 minute") \
    .start()

query.awaitTermination()

✅ What This Achieves

After this pipeline runs continuously, s3://my-data-lake/silver/orders/ always contains the current state of the orders table — every insert is reflected, every update is applied, every delete removes the row. Downstream Athena queries, dbt models, and Redshift Spectrum tables always see fresh, consistent data.

SCD Type 2 from CDC Stream — History Tracking

Instead of a current-state table, you can build a full history table (SCD Type 2) where every change creates a new row with effective_start, effective_end, and is_current columns. This lets you query the state of any record at any point in time.

python — SCD Type 2 MERGE INTO Delta from CDC

from pyspark.sql.functions import current_timestamp, lit
from delta.tables import DeltaTable

def apply_scd2_batch(batch_df, batch_id):
    if batch_df.isEmpty(): return

    batch_df.cache()
    upserts = batch_df.filter(col("__op").isin("c", "u", "r")) \
                      .drop("__op", "__ts_ms") \
                      .withColumn("effective_start", current_timestamp()) \
                      .withColumn("effective_end",   lit(None).cast("timestamp")) \
                      .withColumn("is_current",      lit(True))

    delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders_history/")

    if upserts.count() > 0:
        delta_table.alias("t").merge(
            upserts.alias("s"),
            "t.order_id = s.order_id AND t.is_current = true"
        ).whenMatchedUpdate(set={                     # expire the old row
            "is_current":   "false",
            "effective_end": "current_timestamp()"
        }).whenNotMatchedInsertAll() \                # insert new row
         .execute()

        # Insert the NEW current row (the merge above only closed the old one)
        upserts.write.format("delta").mode("append") \
            .save("s3://my-data-lake/silver/orders_history/")

    batch_df.unpersist()

Compaction After CDC Loads

CDC streaming creates many small Delta files — one or more per micro-batch. Over time this degrades query performance. Run a scheduled OPTIMIZE job (Databricks) or an EMR Spark compaction job every few hours to compact small files.

python — Delta OPTIMIZE + ZORDER after CDC landing

from delta.tables import DeltaTable

# ── Run after your CDC stream has been running for a while ────────
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

# OPTIMIZE: compact small Parquet files into 1 GB files
# ZORDER by order_id: co-locate related order data for faster point lookups
spark.sql("""
    OPTIMIZE delta.`s3://my-data-lake/silver/orders/`
    ZORDER BY (order_id, customer_id)
""")

# VACUUM: remove files older than 7 days (default retention)
# Don't go below 7 days if you rely on time travel!
delta_table.vacuum(retentionHours=168)   # 168 hours = 7 days

print("Compaction complete")

🧊

CDC Landing in Apache Iceberg — MERGE INTO Iceberg ICEBERG ▼

Iceberg MERGE INTO for CDC (Spark SQL)

Iceberg supports row-level DML (MERGE, UPDATE, DELETE) since Iceberg 0.13 / Spark 3.x. The MERGE syntax is very similar to Delta, making it easy to port your CDC logic between the two formats. Iceberg uses copy-on-write (default) or merge-on-read strategies for row-level changes.

python — MERGE INTO Iceberg from CDC stream via foreachBatch

def apply_cdc_to_iceberg(batch_df, batch_id):
    if batch_df.isEmpty(): return

    batch_df.cache()

    # ── Deduplicate within batch ──────────────────────────────────────
    from pyspark.sql.window import Window
    from pyspark.sql.functions import row_number, desc

    window  = Window.partitionBy("order_id").orderBy(desc("__ts_ms"))
    latest  = batch_df.withColumn("_rn", row_number().over(window)) \
                      .filter(col("_rn") == 1).drop("_rn")

    upserts = latest.filter(col("__op").isin("c", "u", "r"))
    deletes = latest.filter(col("__op") == "d")

    # ── Register batch as a temp view for SQL MERGE ───────────────────
    upserts.drop("__op", "__ts_ms").createOrReplaceTempView("cdc_upserts")
    deletes.select("order_id").createOrReplaceTempView("cdc_deletes")

    # ── MERGE upserts into Iceberg target table ───────────────────────
    spark.sql("""
        MERGE INTO glue_catalog.silver.orders AS t
        USING cdc_upserts AS s
        ON t.order_id = s.order_id
        WHEN MATCHED THEN
            UPDATE SET
                t.customer_id = s.customer_id,
                t.amount      = s.amount,
                t.status      = s.status
        WHEN NOT MATCHED THEN
            INSERT (order_id, customer_id, amount, status)
            VALUES (s.order_id, s.customer_id, s.amount, s.status)
    """)

    # ── DELETE rows from Iceberg target table ─────────────────────────
    if deletes.count() > 0:
        spark.sql("""
            DELETE FROM glue_catalog.silver.orders
            WHERE order_id IN (SELECT order_id FROM cdc_deletes)
        """)

    batch_df.unpersist()

# ── Wire into the streaming query ────────────────────────────────
query = cdc_stream.writeStream \
    .foreachBatch(apply_cdc_to_iceberg) \
    .option("checkpointLocation", "s3://my-ckpt/cdc-iceberg/") \
    .trigger(processingTime="1 minute") \
    .start()

query.awaitTermination()

☁️ Iceberg CDC on AWS — Best Practice

Use Iceberg with the Glue Catalog and Athena for the query layer. Athena natively supports Iceberg MERGE INTO — you can also run the CDC MERGE directly from Athena without Spark for small tables. For compaction, use Iceberg's rewrite_data_files procedure in a scheduled Glue job.

Iceberg Compaction After CDC

python — Iceberg rewrite_data_files (compaction) after CDC loads

# ── Compact small files after CDC streaming ───────────────────────
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table  => 'silver.orders',
        options => map(
            'target-file-size-bytes', '134217728',   -- 128 MB target
            'min-input-files',        '5'             -- only compact if 5+ small files
        )
    )
""")

# ── Remove old snapshots (Iceberg keeps history like Delta time travel)
spark.sql("""
    CALL glue_catalog.system.expire_snapshots(
        table                => 'silver.orders',
        older_than           => TIMESTAMP '2024-01-01 00:00:00',
        retain_last          => 5
    )
""")

# ── Remove orphan files (files not in any snapshot) ───────────────
spark.sql("""
    CALL glue_catalog.system.delete_orphan_files(
        table => 'silver.orders'
    )
""")

🏗️

Complete CDC Pipeline Architecture on AWS PRODUCTION ▼

Full Reference Architecture: RDS → Debezium → MSK → Spark → Delta

COMPLETE CDC PIPELINE — AWS REFERENCE ARCHITECTURE ──────────────────────────────────────────────────────────────────── [Source: RDS PostgreSQL] │ │ WAL (logical replication) ▼ [MSK Connect — Debezium PostgreSQL Connector] │ topic: myserver.public.orders (JSON CDC events) │ topic: myserver.public.customers ▼ [Amazon MSK (Kafka)] │ ├─────────────────────────────────────────────┐ │ │ ▼ ▼ [EMR Spark Structured Streaming] [Lambda — DLQ monitor] readStream from MSK CloudWatch Alarm foreachBatch: → SNS alert if - filter tombstones DLQ depth > 0 - deduplicate by pk + ts_ms - split: upserts / deletes - MERGE INTO Delta (silver/orders) - Write audit record to DynamoDB - Publish metrics to CloudWatch │ ▼ [S3 Delta Lake — silver/orders/] (current-state table, always up-to-date) │ ├──────────────────┐ │ │ ▼ ▼ [Athena Queries] [dbt Gold Layer] [Redshift Spectrum] (batch transforms) [QuickSight Dashboards] ──────────────────────────────────────────────────────────────────── Key Guarantees: ✅ No full table scans on source RDS ✅ Sub-second change capture ✅ Exactly-once delivery (Spark checkpoint + Delta MERGE) ✅ Hard deletes captured and applied ✅ Schema evolution handled (mergeSchema: true) ✅ Compaction runs every 4 hours (EventBridge → EMR step) ────────────────────────────────────────────────────────────────────

Boto3: Trigger Compaction After CDC via EventBridge + EMR

Schedule a compaction job every 4 hours using EventBridge to trigger a Lambda that adds an EMR step running the Iceberg/Delta OPTIMIZE logic.

python — Lambda triggered by EventBridge to run EMR compaction step

import boto3, json, os

emr = boto3.client("emr", region_name="us-east-1")

def lambda_handler(event, context):
    cluster_id = os.environ["EMR_CLUSTER_ID"]

    resp = emr.add_job_flow_steps(
        JobFlowId=cluster_id,
        Steps=[{
            "Name":            "Delta-Compaction-Orders",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep": {
                "Jar":  "command-runner.jar",
                "Args": [
                    "spark-submit",
                    "--conf", "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension",
                    "--conf", "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog",
                    "s3://my-code-bucket/scripts/compact_orders.py"
                ]
            }
        }]
    )

    step_id = resp["StepIds"][0]
    print(f"Compaction step submitted: {step_id}")
    return {"statusCode": 200, "stepId": step_id}

29.26 — AWS DATA ENGINEERING

Delta Lake & Apache Iceberg

Delta Lake and Apache Iceberg are open table formats that bring ACID transactions, schema evolution, time travel, and upserts to files stored on S3. Without them, S3 is just a dump of Parquet files — you can't update a row, delete a record, or guarantee consistency. With them, S3 becomes a fully-featured transactional data warehouse. This section covers everything you need for production — ACID guarantees, MERGE INTO, OPTIMIZE/VACUUM, hidden partitioning, and when to choose Delta vs Iceberg.

🧊

Why Open Table Formats Exist — The Problem with Raw Parquet on S3 FOUNDATION ▼

The Problems with Raw Parquet on S3

Before Delta Lake and Iceberg existed, data engineers stored data on S3 as raw Parquet files. This worked for simple append-only workloads but broke down in production because of five fundamental problems.

Problem	What Goes Wrong	Example
No ACID Transactions	Two writers writing simultaneously corrupt data — partial writes are visible to readers	Glue job and Spark job both append → duplicate rows
No UPDATE / DELETE	You cannot modify or remove a specific row — only append or rewrite the entire partition	GDPR erasure request impossible without full rewrite
No Schema Evolution Safety	Adding a column to new files breaks readers that expect the old schema	Athena query fails after producer adds a column
No Time Travel	Once data is overwritten, the old version is gone forever	Analyst runs wrong query, overwrites Gold table — unrecoverable
Slow Partition Discovery	Listing millions of S3 files to discover partitions is slow and expensive	Glue crawler takes 45 minutes on a large lake

📦 Analogy

Raw Parquet on S3 is like a shared Google Doc where anyone can edit any paragraph simultaneously, there is no undo button, you cannot delete a sentence without rewriting the entire page, and every time you open it you have to re-read every page to find out what changed. Delta Lake and Iceberg add a proper version control system — like Git — on top of the same files.

How Open Table Formats Solve These Problems

Both Delta Lake and Iceberg solve these problems by adding a metadata layer on top of the existing Parquet files. The data files themselves are still Parquet on S3 — but the table format adds a transaction log that tracks every change: which files were added, which were removed, what the schema is, and when each operation happened. This transaction log is what enables ACID, time travel, schema evolution, and fast metadata operations.

OPEN TABLE FORMAT STRUCTURE Without Open Format (raw Parquet): s3://lake/orders/year=2024/month=01/part-0001.parquet s3://lake/orders/year=2024/month=01/part-0002.parquet s3://lake/orders/year=2024/month=02/part-0001.parquet ← No metadata. No history. No transactions. With Delta Lake: s3://lake/orders/_delta_log/ ← Transaction log 00000000000000000000.json ← Commit 0: initial write 00000000000000000001.json ← Commit 1: schema change 00000000000000000002.json ← Commit 2: MERGE upsert 00000000000000000010.checkpoint.parquet ← Checkpoint every 10 commits s3://lake/orders/year=2024/month=01/part-0001.parquet s3://lake/orders/year=2024/month=01/part-0002.parquet With Iceberg: s3://lake/orders/metadata/ v1.metadata.json ← Table metadata snap-12345.avro ← Snapshot 1 manifest list snap-12346.avro ← Snapshot 2 manifest list s3://lake/orders/data/ 00001-1-abc.parquet 00001-2-def.parquet

Delta Lake vs Iceberg vs Apache Hudi — Quick Comparison

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Created by	Databricks (open-sourced)	Netflix + Apple	Uber
AWS Native	EMR + Glue support	✅ Athena + Glue native	EMR support
MERGE / Upsert	✅ Strong	✅ Strong	✅ (CoW + MoR)
Time Travel	✅ Version/timestamp	✅ Snapshot-based	✅ Limited
Athena Support	Via manifest file	✅ Native, first-class	Via manifest file
Schema Evolution	✅ Add/rename/drop	✅ Full DDL support	✅ Partial
Best with	Databricks, EMR, Glue	Athena, Glue, EMR, Flink	Spark streaming CDC
Recommendation on AWS	Best for EMR/Glue Spark	Best for Athena + Glue	Only for specific CDC patterns

🔑 AWS Rule of Thumb

Use Delta Lake when your primary compute is EMR or Glue Spark and you don't need Athena to write to the table. Use Iceberg when you need Athena to run MERGE / UPDATE / DELETE, or when your stack is Glue + Athena + Redshift Spectrum.

Δ

Delta Lake — ACID on S3 with Transaction Log DELTA LAKE ▼

The Delta Transaction Log — How ACID Works

The _delta_log/ directory is the heart of Delta Lake. Every write operation (INSERT, UPDATE, DELETE, MERGE) appends a new JSON commit file to this log. Each commit file records exactly which Parquet files were added and which were removed. Readers always check the log first to determine which files are part of the "current" version of the table — this is how Delta achieves snapshot isolation and ACID compliance.

python — creating and writing a Delta table on S3 (PySpark)

from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

# ── Build SparkSession with Delta Lake support ────────────────────
builder = SparkSession.builder \
    .appName("DeltaLakeDemo") \
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# ── Write a DataFrame to Delta format ────────────────────────────
data = [
    ("O-001", "C-101", 99.99,  "PLACED"),
    ("O-002", "C-102", 149.50, "SHIPPED"),
    ("O-003", "C-103", 25.00,  "PLACED")
]
df = spark.createDataFrame(data, ["order_id", "customer_id", "amount", "status"])

DELTA_PATH = "s3://my-data-lake/silver/orders/"

df.write \
    .format("delta") \
    .mode("overwrite") \
    .partitionBy("status") \
    .save(DELTA_PATH)

print("✅ Delta table written — _delta_log/ created on S3")

# ── Read back the Delta table ─────────────────────────────────────
df_read = spark.read.format("delta").load(DELTA_PATH)
df_read.show()

# ── Check the transaction log ─────────────────────────────────────
# s3://my-data-lake/silver/orders/_delta_log/00000000000000000000.json
# Contains: {"add": {"path": "status=PLACED/part-xxx.parquet", ...}}

# ── Register as a table in Glue Catalog ──────────────────────────
spark.sql("""
    CREATE TABLE IF NOT EXISTS silver.orders
    USING delta
    LOCATION 's3://my-data-lake/silver/orders/'
""")

ACID Properties — What They Mean in Practice

Delta provides all four ACID properties. Understanding what each means practically helps you design pipelines correctly.

⚛️

Atomicity

A MERGE either fully succeeds or fully fails. If the Spark job crashes mid-write, the transaction log is not committed — readers never see partial data.

🔒

Consistency

Schema constraints are enforced. A write with a mismatched schema fails rather than corrupting the table silently.

📸

Isolation

Readers always see a consistent snapshot. While a writer is running a MERGE, concurrent readers see the pre-MERGE version — no dirty reads.

💾

Durability

Once committed to the S3 transaction log, data survives any cluster failure. S3's 11-nines durability backs the guarantee.

Schema Evolution — addColumn, renameColumn, dropColumn

Delta tracks the table schema in the transaction log. When you add a column to the source data, you have two options: use mergeSchema=true to automatically add the new column to the Delta schema, or use overwriteSchema=true to replace the schema entirely. Old files that don't have the new column return NULL for that column when read.

python — Delta schema evolution: add a new column safely

from pyspark.sql.functions import lit

# Original table has: order_id, customer_id, amount, status
# New data source now also includes: discount_pct
new_data = [
    ("O-004", "C-104", 200.00, "PLACED", 0.10),
    ("O-005", "C-105", 75.00,  "PLACED", 0.05)
]
df_new = spark.createDataFrame(
    new_data, ["order_id", "customer_id", "amount", "status", "discount_pct"]
)

# ── mergeSchema=True: add new column to existing Delta table schema
df_new.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \   # auto-adds discount_pct column
    .save(DELTA_PATH)

# Old rows (O-001, O-002, O-003) now return NULL for discount_pct
spark.read.format("delta").load(DELTA_PATH).show()
# order_id | customer_id | amount | status  | discount_pct
# O-001    | C-101       | 99.99  | PLACED  | null        ← old row
# O-004    | C-104       | 200.00 | PLACED  | 0.10        ← new row

# ── ALTER TABLE: rename or drop a column ─────────────────────────
spark.sql("ALTER TABLE silver.orders RENAME COLUMN discount_pct TO discount")
spark.sql("ALTER TABLE silver.orders DROP COLUMN discount")

# ── overwriteSchema: replace schema entirely (destructive!) ───────
df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \  # WARNING: loses old schema definition
    .save(DELTA_PATH)

⚠️ Schema Evolution Rules

You can safely add nullable columns (old files return NULL). You can widen types (int→long). You cannot change a column's type to an incompatible type (string→int fails) or remove a column without using overwriteSchema. Always test schema changes in staging before production.

⏰

Time Travel — Query Any Historical Version TIME TRAVEL ▼

What Time Travel Is and Why It Matters

Delta Lake (and Iceberg) retain old versions of the table in the transaction log. You can read any previous version — either by version number or by timestamp. This is invaluable for: debugging ("what did this table look like before yesterday's bad pipeline run?"), auditing, reproducing ML training datasets, and recovering from accidental deletes or incorrect MERGE operations.

python — Delta time travel by version and timestamp

from delta.tables import DeltaTable

# ── Read a specific version number ───────────────────────────────
df_v0 = spark.read \
    .format("delta") \
    .option("versionAsOf", 0) \    # version 0 = initial write
    .load(DELTA_PATH)

df_v2 = spark.read \
    .format("delta") \
    .option("versionAsOf", 2) \    # version 2 = after second commit
    .load(DELTA_PATH)

# ── Read as of a specific timestamp ──────────────────────────────
df_yesterday = spark.read \
    .format("delta") \
    .option("timestampAsOf", "2024-01-14T23:59:59.000Z") \
    .load(DELTA_PATH)

# ── SQL syntax for time travel ────────────────────────────────────
spark.sql("""
    SELECT * FROM silver.orders
    VERSION AS OF 2
""").show()

spark.sql("""
    SELECT * FROM silver.orders
    TIMESTAMP AS OF '2024-01-14 23:59:59'
""").show()

# ── View the full history of a Delta table ────────────────────────
delta_table = DeltaTable.forPath(spark, DELTA_PATH)
history = delta_table.history()
history.select("version", "timestamp", "operation", "operationParameters").show()
# version | timestamp           | operation | operationParameters
# 2       | 2024-01-15 08:00:05 | MERGE     | {predicate: "t.order_id = s.order_id"}
# 1       | 2024-01-15 07:00:02 | WRITE     | {mode: "Append"}
# 0       | 2024-01-14 06:00:00 | WRITE     | {mode: "Overwrite"}

# ── RESTORE a table to a previous version ────────────────────────
# USE WITH CARE — this is irreversible without another restore
delta_table.restoreToVersion(1)     # restore to version 1
# OR: delta_table.restoreToTimestamp("2024-01-14T23:59:59.000Z")

✅ Real Example — Recovering from a Bad Pipeline Run

Your nightly MERGE pipeline had a bug — it deleted 50,000 rows it shouldn't have. You discover the bug the next morning. With Delta: delta_table.restoreToVersion(prev_version) — the table is back to its pre-bug state in seconds. Without Delta: you'd need to re-run the entire pipeline from the source, taking hours.

🔀

MERGE INTO — Upserts, Deletes, SCD Type 2 MOST USED ▼

Basic MERGE INTO (Upsert)

MERGE INTO is the most important Delta Lake operation. It takes a source DataFrame (new/changed data) and merges it into the target Delta table. For each row in the source: if a matching row exists in the target (by primary key), update it; if not, insert it.

python — MERGE INTO Delta: basic upsert

from delta.tables import DeltaTable
from pyspark.sql.functions import current_timestamp

# ── Target: existing Delta table ──────────────────────────────────
target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

# ── Source: incoming changes (from CDC, streaming, or batch load) ─
updates = spark.createDataFrame([
    ("O-001", "C-101", 99.99,  "SHIPPED"),   # existing row — status changed
    ("O-006", "C-106", 300.00, "PLACED"),   # new row
], ["order_id", "customer_id", "amount", "status"])

# ── MERGE: update matching rows, insert new rows ──────────────────
target.alias("t").merge(
    updates.alias("s"),
    condition="t.order_id = s.order_id"   # join condition
).whenMatchedUpdateAll() \          # UPDATE all columns on match
 .whenNotMatchedInsertAll() \       # INSERT on no match
 .execute()

# ── MERGE with selective updates and conditions ───────────────────
target.alias("t").merge(
    updates.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdate(
    condition="s.amount > t.amount",  # only update if new amount is larger
    set={
        "status":      "s.status",
        "amount":      "s.amount",
        "updated_at":  "current_timestamp()"
    }
).whenNotMatchedInsert(
    values={
        "order_id":    "s.order_id",
        "customer_id": "s.customer_id",
        "amount":      "s.amount",
        "status":      "s.status",
        "updated_at":  "current_timestamp()"
    }
).execute()

MERGE with DELETE Clause

Delta's MERGE also supports a whenMatchedDelete() clause — if a matching row exists in the target AND a condition is met, delete the target row. This is the standard pattern for applying hard deletes from CDC events.

python — MERGE INTO with DELETE (CDC pattern)

from pyspark.sql.functions import col

# Source contains CDC events with __op column: c/u/d
cdc_batch = spark.createDataFrame([
    ("O-001", "C-101", 99.99,  "DELIVERED", "u"),  # update
    ("O-003", "C-103", 25.00,  "CANCELLED", "d"),  # delete
    ("O-007", "C-107", 500.00, "PLACED",    "c"),  # insert
], ["order_id", "customer_id", "amount", "status", "__op"])

target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

target.alias("t").merge(
    cdc_batch.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedDelete(
    condition="s.__op = 'd'"           # DELETE if it's a delete event
).whenMatchedUpdate(
    condition="s.__op = 'u'",           # UPDATE if it's an update event
    set={
        "status": "s.status",
        "amount": "s.amount"
    }
).whenNotMatchedInsert(
    condition="s.__op = 'c'",           # INSERT only for create events
    values={
        "order_id":    "s.order_id",
        "customer_id": "s.customer_id",
        "amount":      "s.amount",
        "status":      "s.status"
    }
).execute()
print("✅ CDC batch applied: insert + update + delete")

SCD Type 2 with MERGE — Full History Tracking

Slowly Changing Dimension Type 2 keeps the full history of changes: each update closes the old row (sets effective_end and is_current=false) and inserts a new row with the new values. MERGE INTO Delta makes this pattern clean and atomic.

python — SCD Type 2 MERGE INTO Delta

from pyspark.sql.functions import current_timestamp, lit
from delta.tables import DeltaTable

# SCD2 table structure:
# customer_id | name       | country | effective_start | effective_end | is_current

target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/customers_scd2/")

new_data = spark.createDataFrame([
    ("C-101", "Alice Smith", "US"),   # existing customer — country changed
    ("C-200", "New Person",  "IN"),   # new customer
], ["customer_id", "name", "country"])

# Step 1: Close the old row for changed customers
target.alias("t").merge(
    new_data.alias("s"),
    "t.customer_id = s.customer_id AND t.is_current = true"
    " AND (t.name != s.name OR t.country != s.country)"  # only if data changed
).whenMatchedUpdate(set={
    "is_current":   "false",
    "effective_end": "current_timestamp()"
}).execute()

# Step 2: Insert new rows (new customers + new versions of changed ones)
new_rows = new_data \
    .withColumn("effective_start", current_timestamp()) \
    .withColumn("effective_end",   lit(None).cast("timestamp")) \
    .withColumn("is_current",      lit(True))

new_rows.write.format("delta").mode("append").save(
    "s3://my-data-lake/silver/customers_scd2/"
)
print("✅ SCD2 applied")

⚡

OPTIMIZE, ZORDER & VACUUM — Maintenance for Performance MAINTENANCE ▼

The Small Files Problem in Delta

Every MERGE, every streaming micro-batch, and every incremental append creates new small Parquet files. Over time, a Delta table can accumulate thousands of tiny files. Querying a table with 50,000 x 1 MB files is much slower than querying one with 50 x 1 GB files — more S3 LIST operations, more file opens, more task scheduling overhead.

📚 Analogy

OPTIMIZE is like defragmenting a hard drive, or re-shelving a library where thousands of post-it notes have been scattered across the floor. You consolidate all those small notes into proper chapters (large Parquet files), making future reads much faster.

OPTIMIZE — Compacting Small Files

OPTIMIZE reads all small files in the table, combines them into larger files (~1 GB each), and rewrites them. Old small files are retained for time travel but marked as removed in the transaction log (so queries skip them). Run OPTIMIZE on a schedule — nightly or after a large batch of micro-batch streaming.

python — OPTIMIZE Delta table (PySpark SQL)

# ── OPTIMIZE: compact all small files in the table ───────────────
spark.sql("OPTIMIZE silver.orders")

# ── OPTIMIZE a specific partition only ───────────────────────────
spark.sql("OPTIMIZE silver.orders WHERE status = 'PLACED'")

# ── OPTIMIZE via Delta Table API ─────────────────────────────────
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
delta_table.optimize().executeCompaction()

# ── OPTIMIZE with target file size (default is 1 GB) ─────────────
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 134217728)  # 128 MB
spark.sql("OPTIMIZE silver.orders")

ZORDER — Co-locating Related Data for Faster Queries

ZORDER BY is a multi-dimensional clustering technique that arranges data within files so that rows with similar values of the ZORDER columns are stored together. When a query filters on those columns, Delta can skip entire files (data skipping) rather than scanning every file. ZORDER is most effective on high-cardinality columns that are frequently used in WHERE clauses.

python — OPTIMIZE with ZORDER for faster point queries

# ── ZORDER by order_id and customer_id ───────────────────────────
# After this: queries like WHERE order_id = 'O-001' skip most files
spark.sql("""
    OPTIMIZE silver.orders
    ZORDER BY (order_id, customer_id)
""")

# ZORDER via API
delta_table.optimize().executeZOrderBy("order_id", "customer_id")

# ── Check data skipping stats after ZORDER ───────────────────────
spark.sql("""
    SELECT
        file.path,
        file.stats
    FROM delta.`s3://my-data-lake/silver/orders/`.`_delta_log`.`*.json`
""")
# stats shows: min/max of order_id and customer_id per file
# → Delta skips files where min/max range doesn't contain the query value

🔑 When to Use ZORDER vs Partitioning

Partition by low-cardinality columns (year, month, status) — makes entire directories skippable. ZORDER by high-cardinality columns (order_id, customer_id) — makes individual files skippable within a partition. Use both together for maximum data skipping.

VACUUM — Removing Old Files to Save S3 Cost

VACUUM permanently deletes Parquet files that are no longer referenced by the Delta transaction log AND are older than the retention threshold (default 7 days). Without regular VACUUM, deleted/overwritten files accumulate indefinitely on S3, and your storage cost grows even though the table data hasn't grown.

python — VACUUM to reclaim S3 storage

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

# ── VACUUM: delete files older than 7 days (default) ─────────────
delta_table.vacuum()

# ── VACUUM with custom retention (minimum 168 hours = 7 days) ────
delta_table.vacuum(retentionHours=168)   # 7 days

# ── SQL: VACUUM with dry run (see what WOULD be deleted) ─────────
spark.sql("VACUUM silver.orders DRY RUN").show()  # no actual deletion
spark.sql("VACUUM silver.orders RETAIN 168 HOURS")  # actually delete

# ⚠️ NEVER go below 168 hours (7 days) if you use time travel!
# If you VACUUM files needed for a time-travel query, that query will fail.

# ── To allow shorter retention (dev/test only — NOT production) ───
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(retentionHours=1)   # DEV ONLY — removes all old versions
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

⚠️ VACUUM and Time Travel Trade-off

The longer you set the VACUUM retention, the more old files are kept on S3 (higher storage cost) but the further back you can time-travel. Production rule: 7 days retention (168 hours) balances time travel utility with storage cost. Set a CloudWatch alarm on S3 bucket size growth to detect when VACUUM is not running regularly.

☁️

Delta Lake on AWS — EMR, Glue, Athena Integration AWS INTEGRATION ▼

Delta Lake on EMR

On EMR, you install Delta Lake via the --packages option to spark-submit, or via a bootstrap action that pre-installs the JAR. EMR 6.9+ has native Delta support you can enable via cluster configuration.

bash — spark-submit with Delta on EMR

# Option 1: --packages (downloads from Maven on cluster startup)
spark-submit \
  --packages io.delta:delta-core_2.12:2.4.0 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  s3://my-code/scripts/delta_pipeline.py

# Option 2: EMR 6.9+ native Delta support via Configurations

json — EMR cluster config for native Delta Lake support

[
  {
    "Classification": "delta-defaults",
    "Properties": {
      "delta.enableDeltaTableUtils": "true"
    }
  }
]

Delta Lake on Glue

In AWS Glue 4.0+, Delta Lake is a supported format. You add a Glue job parameter --datalake-formats delta and Glue automatically adds the Delta JAR. No manual dependency management needed.

python — Glue job with Delta Lake (Glue 4.0+)

# In Glue job DefaultArguments:
# "--datalake-formats": "delta"
# "--conf": "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
# "--conf": "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

# In your Glue PySpark script:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from delta.tables import DeltaTable

sc    = SparkContext()
glue  = GlueContext(sc)
spark = glue.spark_session

# Now use Delta normally
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
delta_table.history().show()

glue.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"path": "s3://my-data-lake/silver/orders/"},
    format="delta"
)

Querying Delta with Athena — Manifest File Pattern

Athena does not natively read Delta's _delta_log — it needs a manifest file that lists the active Parquet files. Delta can generate this manifest automatically. After generating, you create an Athena external table pointing to the manifest. This requires regenerating the manifest after every write — so it adds operational overhead compared to Iceberg's native Athena support.

python — generate manifest + create Athena table for Delta

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")

# ── Generate symlink manifest for Athena ─────────────────────────
delta_table.generate("symlink_format_manifest")
# Creates: s3://my-data-lake/silver/orders/_symlink_format_manifest/
#   manifest file listing all active Parquet file paths

# ── Create Athena external table pointing to manifest ─────────────
spark.sql("""
    CREATE EXTERNAL TABLE IF NOT EXISTS silver_athena.orders (
        order_id    STRING,
        customer_id STRING,
        amount      DOUBLE,
        status      STRING
    )
    PARTITIONED BY (year STRING, month STRING)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION 's3://my-data-lake/silver/orders/_symlink_format_manifest/'
""")
# Run MSCK REPAIR TABLE to load partitions, then query from Athena

# ⚠️ Must re-run delta_table.generate("symlink_format_manifest")
# after EVERY write to keep the Athena table in sync.

☁️ Delta + Athena Recommendation

The manifest approach works but adds operational overhead. If Athena is a primary query interface, consider Iceberg instead of Delta — Athena has native Iceberg support with no manifest files needed. Use Delta when Databricks or EMR Spark is your primary compute and Athena is only occasionally used.

🧊

Apache Iceberg — The Cloud-Native Table Format APACHE ICEBERG ▼

Iceberg Architecture — Metadata, Manifests, Data Files

Iceberg's metadata structure is three-tiered: Table Metadata File (the entry point — lists all snapshots), Manifest Lists (one per snapshot — lists all manifest files for that snapshot), and Manifest Files (list the actual Parquet data files with per-file statistics). This hierarchy enables extremely fast metadata operations even on tables with billions of files.

ICEBERG METADATA STRUCTURE ───────────────────────────────────────────────────── s3://lake/orders/metadata/ v1.metadata.json ← Table Metadata (current snapshot pointer) └── current-snapshot-id: snap-12346 snap-12346-manifest-list.avro ← Manifest List for snapshot 12346 ├── manifest-file-1.avro ← Manifest 1: lists data files in partition A │ part-0001.parquet (with stats: min/max order_id per file) │ part-0002.parquet └── manifest-file-2.avro ← Manifest 2: lists data files in partition B part-0003.parquet part-0004.parquet s3://lake/orders/data/ year=2024/month=01/part-0001.parquet year=2024/month=01/part-0002.parquet year=2024/month=02/part-0003.parquet WHY THIS IS FASTER THAN HIVE: Hive: list all S3 files to find partitions → slow for large tables Iceberg: read manifest list → instantly know all partitions and file counts → Partition discovery takes milliseconds not minutes

Creating an Iceberg Table with Glue Catalog

On AWS, you use Iceberg with the Glue Data Catalog as the metadata store. The Glue Catalog stores the pointer to Iceberg's latest metadata file. Athena, Glue Spark, and EMR Spark all look up the table from Glue — so one table definition works across all three tools.

python — Iceberg with Glue Catalog on EMR/Glue (SparkSession config)

from pyspark.sql import SparkSession

# ── SparkSession config for Iceberg + Glue Catalog ────────────────
spark = SparkSession.builder \
    .appName("IcebergDemo") \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog",
            "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl",
            "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse",
            "s3://my-data-lake/") \
    .config("spark.sql.catalog.glue_catalog.io-impl",
            "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()

# ── Create an Iceberg table via SQL ───────────────────────────────
spark.sql("""
    CREATE TABLE IF NOT EXISTS glue_catalog.silver.orders (
        order_id    STRING  NOT NULL,
        customer_id STRING,
        amount      DOUBLE,
        status      STRING,
        created_at  TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (days(created_at))   -- hidden partitioning!
    TBLPROPERTIES (
        'write.format.default'       = 'parquet',
        'write.parquet.compression-codec' = 'snappy',
        'format-version'             = '2'           -- enables row-level deletes
    )
""")
print("✅ Iceberg table created in Glue Catalog")

# ── Write a DataFrame to the Iceberg table ───────────────────────
data = [
    ("O-001", "C-101", 99.99,  "PLACED"),
    ("O-002", "C-102", 149.50, "SHIPPED"),
]
df = spark.createDataFrame(data, ["order_id", "customer_id", "amount", "status"])
df.writeTo("glue_catalog.silver.orders").append()

# ── Query the table ───────────────────────────────────────────────
spark.table("glue_catalog.silver.orders").show()

Hidden Partitioning — Iceberg's Killer Feature

In Hive-style partitioning (and Delta), you must include the partition column value explicitly: partitionBy("year", "month", "day") and then manually add year/month/day columns. Iceberg's hidden partitioning lets you partition on transforms of a column — days(created_at), months(created_at), bucket(16, customer_id) — without adding extra columns to your data. Iceberg computes the partition value automatically from the existing column.

sql — Iceberg hidden partitioning transforms

-- Hidden partitioning on a timestamp column
CREATE TABLE glue_catalog.silver.events (
    event_id   STRING,
    user_id    STRING,
    event_type STRING,
    occurred_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(occurred_at));    -- automatically partitions by date
-- NO year/month/day columns in the data! Iceberg handles it internally.

-- Query: Iceberg rewrites WHERE occurred_at > '2024-01-01' as a partition filter
SELECT * FROM glue_catalog.silver.events
WHERE occurred_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Iceberg automatically prunes partitions before 2024-01-15. No manual filter!

-- Other transform options:
-- PARTITIONED BY (hours(occurred_at))           -- hourly partitions
-- PARTITIONED BY (months(occurred_at))          -- monthly partitions
-- PARTITIONED BY (years(occurred_at))           -- yearly partitions
-- PARTITIONED BY (bucket(16, customer_id))      -- hash bucket by customer_id
-- PARTITIONED BY (truncate(10, product_code))   -- string prefix partitioning
-- PARTITIONED BY (days(created_at), bucket(8, customer_id))  -- multi-level

🔑 Why Hidden Partitioning Changes Everything

With Hive partitioning: adding year/month/day columns to every row is a schema burden, and if you want to change your partition granularity (from month to day), you must rewrite every file. With Iceberg hidden partitioning: your data schema stays clean, and you can evolve the partition scheme without rewriting data — just add a new partition spec and future writes use it while old files keep the old layout.

Iceberg Time Travel and Snapshots

Iceberg versions data through snapshots. Each write creates a new snapshot with a unique snapshot ID and timestamp. You can query any snapshot by ID or timestamp — just like Delta time travel.

python / sql — Iceberg time travel and snapshot management

# ── View all snapshots ────────────────────────────────────────────
spark.sql("""
    SELECT snapshot_id, committed_at, operation, summary
    FROM glue_catalog.silver.orders.snapshots
""").show()

# ── Time travel by snapshot ID ────────────────────────────────────
spark.sql("""
    SELECT * FROM glue_catalog.silver.orders
    VERSION AS OF 12345678901234567
""").show()

# ── Time travel by timestamp ──────────────────────────────────────
spark.sql("""
    SELECT * FROM glue_catalog.silver.orders
    TIMESTAMP AS OF '2024-01-14 23:59:59'
""").show()

# ── Iceberg DataFrame API time travel ────────────────────────────
df_old = spark.read \
    .option("snapshot-id", "12345678901234567") \
    .format("iceberg") \
    .load("glue_catalog.silver.orders")

# ── Roll back to a previous snapshot ─────────────────────────────
spark.sql("""
    CALL glue_catalog.system.rollback_to_snapshot(
        'silver.orders', 12345678901234567
    )
""")

# ── Expire old snapshots (cleanup) ───────────────────────────────
spark.sql("""
    CALL glue_catalog.system.expire_snapshots(
        table       => 'silver.orders',
        older_than  => TIMESTAMP '2024-01-01 00:00:00',
        retain_last => 5
    )
""")

Iceberg MERGE INTO (Full DML: INSERT, UPDATE, DELETE)

Iceberg supports MERGE INTO, UPDATE, and DELETE via standard SQL. This is especially powerful with Athena — you can run UPDATE and DELETE directly from Athena without Spark.

sql — Iceberg MERGE, UPDATE, DELETE (works in Athena and Spark)

-- MERGE INTO (upsert)
MERGE INTO glue_catalog.silver.orders AS t
USING glue_catalog.staging.orders_updates AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN
    UPDATE SET t.status = s.status, t.amount = s.amount
WHEN NOT MATCHED THEN
    INSERT (order_id, customer_id, amount, status, created_at)
    VALUES (s.order_id, s.customer_id, s.amount, s.status, s.created_at);

-- UPDATE specific rows
UPDATE glue_catalog.silver.orders
SET status = 'DELIVERED'
WHERE order_id = 'O-001';

-- DELETE rows (GDPR erasure)
DELETE FROM glue_catalog.silver.orders
WHERE customer_id = 'C-101';

-- These all work in Athena natively (no Spark required for Iceberg!)
-- This is Iceberg's biggest advantage over Delta on AWS.

✅ Iceberg + Athena = GDPR Compliance Without Spark

A GDPR erasure request comes in: delete all data for customer C-101. With Iceberg on Athena, this is a one-line SQL statement from the Athena console — no Spark cluster needed, no EMR job, no boto3 code. The delete is ACID, transactional, and instantly reflected to all subsequent queries.

Iceberg Compaction — rewrite_data_files

Iceberg's compaction is done via a stored procedure: rewrite_data_files. It consolidates small files into larger ones and can also re-sort the data for better data skipping. Run it on a schedule after heavy streaming or incremental loads.

python / sql — Iceberg compaction and maintenance

# ── Compact small files → 128 MB target size ─────────────────────
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table   => 'silver.orders',
        options => map(
            'target-file-size-bytes', '134217728',
            'min-input-files',        '5',
            'max-concurrent-file-group-rewrites', '10'
        )
    )
""")

# ── Sort while compacting (like ZORDER in Delta) ──────────────────
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table      => 'silver.orders',
        strategy   => 'sort',
        sort_order => 'order_id ASC NULLS LAST, customer_id ASC',
        options    => map('target-file-size-bytes', '134217728')
    )
""")

# ── Remove orphan files not referenced by any snapshot ───────────
spark.sql("""
    CALL glue_catalog.system.delete_orphan_files(
        table => 'silver.orders',
        older_than => TIMESTAMP '2024-01-01 00:00:00'
    )
""")

# ── Rewrite position-delete files as merge-on-read optimization ──
spark.sql("""
    CALL glue_catalog.system.rewrite_position_delete_files(
        table => 'silver.orders'
    )
""")

Copy-on-Write (CoW) vs Merge-on-Read (MoR)

Iceberg supports two strategies for row-level updates and deletes. Copy-on-Write (CoW) rewrites the affected data files immediately on every update/delete — reads are fast (no merging needed), but writes are expensive. Merge-on-Read (MoR) writes a small delete/update file rather than rewriting the whole data file — writes are fast, but reads need to merge the data file with the delete file. CoW is better for read-heavy tables; MoR is better for CDC-heavy tables with frequent small updates.

Strategy	Write Cost	Read Cost	Best For
Copy-on-Write (default)	Higher — rewrites files	Lower — no merge needed	Gold layer, dashboard tables, read-heavy
Merge-on-Read	Lower — small delete files	Higher — merge at read time	CDC landing, Silver layer with frequent updates

sql — setting CoW vs MoR on an Iceberg table

-- Default: Copy-on-Write for all operations
CREATE TABLE glue_catalog.gold.orders_summary (...)
USING iceberg
TBLPROPERTIES (
    'write.delete.mode' = 'copy-on-write',
    'write.update.mode' = 'copy-on-write',
    'write.merge.mode'  = 'copy-on-write'
);

-- Merge-on-Read for CDC landing tables (fast writes)
CREATE TABLE glue_catalog.silver.orders_cdc_landing (...)
USING iceberg
TBLPROPERTIES (
    'write.delete.mode' = 'merge-on-read',
    'write.update.mode' = 'merge-on-read',
    'format-version'    = '2'    -- required for MoR
);

⚖️

Delta vs Iceberg Decision Guide for AWS Data Engineers DECISION GUIDE ▼

Decision Framework

DELTA vs ICEBERG — DECISION TREE Q1: Is Athena a primary write interface for your team? YES → Use Iceberg (Athena can run MERGE, UPDATE, DELETE natively) NO → Either works; consider Q2 Q2: Is Databricks your primary compute? YES → Use Delta (native Databricks format, most features) NO → Either works; consider Q3 Q3: Do you need complex partition evolution (change partition columns later)? YES → Use Iceberg (hidden partitioning + partition spec evolution) NO → Either works; consider Q4 Q4: Is EMR or Glue Spark your primary compute? YES → Either works; Delta slightly easier setup on Glue 4.0 NO (all Athena) → Use Iceberg Q5: Do you need sub-second CDC updates with MoR optimization? YES → Use Iceberg format-version=2 with MoR mode NO → Either works SUMMARY: Athena-heavy platform → Iceberg Databricks platform → Delta EMR/Glue-only platform → Either (Delta slightly simpler) Multi-engine (all 3) → Iceberg (widest native support)

Feature Comparison Summary

Feature	Delta Lake	Apache Iceberg
ACID Transactions	✅ Transaction log	✅ Snapshot-based
Time Travel	✅ version/timestamp	✅ snapshot-based
MERGE INTO	✅ Full MERGE API	✅ MERGE/UPDATE/DELETE
Athena Write (MERGE)	❌ Read-only via manifest	✅ Native MERGE in Athena
Partition Evolution	Limited	✅ Full partition spec evolution
Hidden Partitioning	❌ Manual year/month/day columns	✅ days()/months()/bucket()
Small File Compaction	✅ OPTIMIZE	✅ rewrite_data_files
Data Skipping	✅ Column stats + ZORDER	✅ Column stats per manifest
Merge-on-Read	❌ Not supported	✅ format-version=2
Glue 4.0 Native	✅ --datalake-formats delta	✅ --datalake-formats iceberg

Production Recommendation — AWS Standard Stack

🏗️

Bronze Layer

Either format works. Delta is fine for append-only CDC landing. Iceberg MoR if you need fast incremental updates without compaction overhead.

⚙️

Silver Layer

Iceberg recommended — the MERGE INTO from CDC is clean, Athena can directly read/query, and hidden partitioning keeps the schema clean.

🌟

Gold Layer

Iceberg with CoW — Gold tables are read-heavy. Fast reads matter more than fast writes. Athena and Redshift Spectrum can directly query.

📋 29.26 Summary — Delta Lake & Apache Iceberg

Raw Parquet on S3 lacks ACID, UPDATE/DELETE, schema evolution safety, and time travel — open table formats solve all four.
Delta Lake uses a _delta_log/ transaction log. ACID via atomic commits. Time travel by version or timestamp. MERGE INTO for upserts and CDC. OPTIMIZE + ZORDER for query performance. VACUUM for cost control.
Apache Iceberg uses a three-tier metadata structure (metadata → manifest list → manifest files). Hidden partitioning eliminates manual year/month/day columns. Native Athena MERGE/UPDATE/DELETE. Copy-on-Write (read-optimized) or Merge-on-Read (write-optimized) strategies.
On AWS: use Iceberg when Athena is a primary interface (native DML support). Use Delta when Databricks or Glue ETL is the primary compute and Athena is secondary.
OPTIMIZE/ZORDER (Delta) and rewrite_data_files (Iceberg) must run on a schedule — streaming and CDC create many small files that degrade query performance without compaction.
VACUUM (Delta) and expire_snapshots (Iceberg) reclaim S3 storage from old versions — run at minimum weekly with 7-day retention to balance time travel and cost.

29.27 — AWS DATA ENGINEERING

Data Modeling for Data Engineers

Dimensional modeling, fact and dimension table design, Slowly Changing Dimensions (SCD), Data Vault, and dbt — the techniques used to turn raw lake data into a well-structured, query-friendly Gold layer.

🗃️

29.27 Data Modeling for Data Engineers MODELING ▼

Dimensional Modeling — The Foundation

Dimensional modeling is the most widely used technique in data warehousing. It organizes data into two types of tables: fact tables (what happened — metrics, events, measurements) and dimension tables (context — who, what, where, when). It was popularized by Ralph Kimball and is the backbone of most Gold-layer designs in a lakehouse.

🧠 Analogy

Think of a sales receipt. The fact is the line-item (product sold, quantity, price). The dimensions are the customer, the store, the product, and the date. You always measure the fact through the lens of dimensions.

Fact Tables — Transaction, Snapshot, Accumulating Snapshot, Factless

Fact tables hold measurable numeric data (revenue, quantity, duration) and foreign keys to dimension tables. The grain of a fact table is the most important design decision — it defines exactly what one row represents.

Type	What One Row Represents	Example
Transaction Fact	One discrete event	One order line item on a specific date
Periodic Snapshot	State at a fixed interval	Account balance at end of each month
Accumulating Snapshot	Progress of a process through stages	Order lifecycle: placed → shipped → delivered
Factless Fact	An event with no numeric measure	Student enrolled in a course (just the relationship)

sql — transaction fact table (grain = one order line)

CREATE TABLE fact_order_lines (
    order_line_sk     BIGINT,       -- surrogate key
    order_date_sk     INT,          -- FK → dim_date
    customer_sk       INT,          -- FK → dim_customer
    product_sk        INT,          -- FK → dim_product
    store_sk          INT,          -- FK → dim_store
    quantity          INT,
    unit_price        DECIMAL(10,2),
    discount_amount   DECIMAL(10,2),
    net_revenue       DECIMAL(10,2),
    load_dts          TIMESTAMP     -- audit column
);

python — writing fact table from PySpark

from pyspark.sql import functions as F

# Read silver layer (cleaned orders)
orders = spark.table("silver.orders")
order_items = spark.table("silver.order_items")

# Join and compute measures
fact_df = order_items.join(orders, "order_id") \
    .withColumn("net_revenue", F.col("unit_price") * F.col("quantity") - F.col("discount_amount")) \
    .withColumn("order_date_sk", F.date_format("order_date", "yyyyMMdd").cast("int")) \
    .withColumn("load_dts", F.current_timestamp()) \
    .select(
        "order_line_id", "order_date_sk", "customer_sk",
        "product_sk", "store_sk", "quantity",
        "unit_price", "discount_amount", "net_revenue", "load_dts"
    )

# Write to Gold Delta table
fact_df.write.format("delta").mode("append") \
    .saveAsTable("gold.fact_order_lines")

Dimension Tables — Conformed, Degenerate, Junk

Dimension tables provide the descriptive context for facts. A conformed dimension is shared across multiple fact tables (e.g., dim_date used in sales, HR, and finance facts). A degenerate dimension is a dimension with no attributes beyond its key — it lives on the fact table itself (e.g., order number). A junk dimension bundles several low-cardinality flags and indicators into one table to avoid cluttering the fact table (e.g., is_rush_order, is_gift_wrapped, is_loyalty_member combined).

sql — conformed dim_date (shared across all facts)

CREATE TABLE dim_date (
    date_sk       INT PRIMARY KEY,  -- 20240115
    full_date     DATE,
    year          INT,
    quarter       INT,
    month         INT,
    month_name    STRING,
    week_of_year  INT,
    day_of_week   INT,
    day_name      STRING,
    is_weekend    BOOLEAN,
    is_holiday    BOOLEAN,
    fiscal_year   INT,
    fiscal_quarter INT
);

python — generating dim_date with PySpark

from pyspark.sql import functions as F
from pyspark.sql.types import *

# Generate a date sequence from 2020-01-01 to 2030-12-31
date_range = spark.sql("SELECT sequence(to_date('2020-01-01'), to_date('2030-12-31'), interval 1 day) as date_array") \
    .select(F.explode("date_array").alias("full_date"))

dim_date = date_range.select(
    F.date_format("full_date", "yyyyMMdd").cast("int").alias("date_sk"),
    F.col("full_date"),
    F.year("full_date").alias("year"),
    F.quarter("full_date").alias("quarter"),
    F.month("full_date").alias("month"),
    F.date_format("full_date", "MMMM").alias("month_name"),
    F.weekofyear("full_date").alias("week_of_year"),
    F.dayofweek("full_date").alias("day_of_week"),
    F.date_format("full_date", "EEEE").alias("day_name"),
    (F.dayofweek("full_date").isin([1, 7])).alias("is_weekend")
)

dim_date.write.format("delta").mode("overwrite") \
    .saveAsTable("gold.dim_date")

📘 Junk Dimension Example

Instead of adding is_rush_order, is_gift_wrapped, is_loyalty_member directly to the fact table, you create dim_order_flags with all combinations (2³ = 8 rows max), then store only the order_flag_sk on the fact. Keeps fact tables lean.

Star Schema vs Snowflake Schema

A star schema has the fact table at the center connected directly to denormalized dimension tables — one join to get all context. A snowflake schema normalizes dimensions further into sub-dimensions (e.g., dim_product → dim_category → dim_department). Star is preferred in analytics because fewer joins = faster queries and simpler SQL; snowflake is used when storage is a concern or dimensions are extremely wide.

Aspect	Star Schema	Snowflake Schema
Dimension normalization	Denormalized (wide tables)	Normalized (sub-dimensions)
Query complexity	Simple — fewer joins	More joins needed
Storage	Slightly more (duplicated values)	Less storage
Preferred for	OLAP, BI tools, Redshift/Snowflake queries	Normalized source systems

Grain Definition — Most Important Modeling Decision

The grain is the precise definition of what one row in a fact table means. You must declare it before writing a single line of SQL. Get it wrong and your queries will double-count or miss records silently. Common grains: one row per order line, one row per daily account snapshot, one row per user session event.

⚠️ Anti-Pattern

Mixing grains in a single fact table — e.g., having some rows at order level and others at line-item level — leads to wildly incorrect aggregations. Always pick one grain and stick to it.

sql — grain declaration in comments (best practice)

-- GRAIN: One row per order line item per day
-- Primary Key: (order_id, product_id, order_date)
-- Each row = one product sold in one order on one date
-- Do NOT aggregate across product_sk without GROUP BY
CREATE TABLE gold.fact_order_lines (
    order_id       STRING,
    product_sk     INT,
    order_date_sk  INT,
    quantity       INT,
    net_revenue    DECIMAL(10,2)
);

Slowly Changing Dimensions (SCD) — Type 1, 2, 3

Dimension attributes change over time (a customer moves cities, a product changes category). SCD strategies define how to handle those changes in the dimension table.

Type	Strategy	History Kept?	Use When
SCD Type 1	Overwrite the old value	No	Corrections, typos, no historical analysis needed
SCD Type 2	Add a new row, keep old row	Yes (full)	Track history — "what was the customer's city when they ordered?"
SCD Type 3	Add a "previous value" column	Partial (1 prior)	Only need to know one previous value (e.g., last city)

sql — SCD Type 2 table design

CREATE TABLE dim_customer (
    customer_sk         BIGINT,       -- surrogate key (auto-increment or hash)
    customer_id         STRING,       -- natural / business key
    customer_name       STRING,
    city                STRING,
    state               STRING,
    effective_start_date DATE,        -- when this version became active
    effective_end_date   DATE,        -- NULL if still current; 9999-12-31 convention
    is_current           BOOLEAN,     -- TRUE for the active row
    load_dts             TIMESTAMP
);

python — SCD Type 2 MERGE with PySpark + Delta

from delta.tables import DeltaTable
from pyspark.sql import functions as F

# New/changed customers arriving from source
updates = spark.table("silver.customer_updates") \
    .withColumn("effective_start_date", F.current_date()) \
    .withColumn("effective_end_date", F.lit(None).cast("date")) \
    .withColumn("is_current", F.lit(True))

dim_table = DeltaTable.forName(spark, "gold.dim_customer")

# Step 1 — expire old rows that have changed
dim_table.alias("dim").merge(
    updates.alias("src"),
    "dim.customer_id = src.customer_id AND dim.is_current = true AND \
     (dim.city != src.city OR dim.state != src.state)"
).whenMatchedUpdate(set={
    "is_current": "false",
    "effective_end_date": "current_date()"
}).execute()

# Step 2 — insert new versions
updates.write.format("delta").mode("append") \
    .saveAsTable("gold.dim_customer")

python — SCD Type 1 (simple overwrite with MERGE)

from delta.tables import DeltaTable

updates = spark.table("silver.customer_updates")
dim_table = DeltaTable.forName(spark, "gold.dim_customer_scd1")

# Just overwrite the attributes — no history
dim_table.alias("dim").merge(
    updates.alias("src"),
    "dim.customer_id = src.customer_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

sql — SCD Type 3 table design (one prior value)

CREATE TABLE dim_customer_scd3 (
    customer_sk       BIGINT,
    customer_id       STRING,
    current_city      STRING,
    previous_city     STRING,   -- only one prior value stored
    city_changed_date DATE
);

Data Vault (Awareness Level) — Hubs, Links, Satellites

Data Vault is a modeling methodology designed for enterprise raw vaults where auditability, scalability, and parallel loading matter more than query simplicity. It separates data into three entity types.

Entity	Purpose	Example
Hub	Stores only the unique business keys	`hub_customer` — customer_id only
Link	Stores relationships between hubs	`link_order_customer` — order_id + customer_id
Satellite	Stores descriptive attributes + history	`sat_customer_details` — name, city, updated_at

📌 When Companies Adopt Data Vault

Data Vault shines in multi-source enterprise environments where sources change frequently, auditability is required, and you want to load in parallel without dependencies. It is NOT beginner-friendly for query writing — it's typically used for the Raw Vault layer, and a dimensional mart is built on top for analysts.

dbt for Transformation

dbt (data build tool) is the industry-standard tool for writing, testing, and documenting SQL-based transformations in your warehouse or lakehouse. It replaces ad-hoc SQL scripts with version-controlled, tested, documented models. A model is just a .sql file that defines a SELECT — dbt materializes it as a table or view.

sql — dbt model: stg_orders.sql (staging layer)

-- models/staging/stg_orders.sql
-- Materialization: view (lightweight, always fresh)
{{ config(materialized='view') }}

SELECT
    order_id,
    customer_id,
    order_date::date                         AS order_date,
    total_amount::decimal(10,2)              AS total_amount,
    status,
    CURRENT_TIMESTAMP                        AS _loaded_at
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL

sql — dbt incremental model: fct_orders.sql

-- models/marts/fct_orders.sql
{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
) }}

SELECT
    o.order_id,
    o.customer_id,
    d.date_sk                                 AS order_date_sk,
    o.total_amount,
    o.status,
    CURRENT_TIMESTAMP                         AS load_dts
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_date') }} d ON o.order_date = d.full_date

{% if is_incremental() %}
  -- Only process rows newer than the max already in the table
  WHERE o.order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

yaml — dbt tests (schema.yml)

version: 2

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customer')
              field: customer_id
      - name: total_amount
        tests:
          - not_null
          - accepted_values:
              values: [0.01, 999999.99]  # range check via custom test

bash — running dbt in Airflow via BashOperator

# In Airflow DAG — BashOperator to run dbt
run_dbt = BashOperator(
    task_id="run_dbt_gold_models",
    bash_command="""
        cd /opt/dbt/my_project &&
        dbt run --select tag:gold --target prod --profiles-dir /opt/dbt/profiles
    """,
    env={"DBT_PROFILES_DIR": "/opt/dbt/profiles"}
)

# Run tests after models complete
test_dbt = BashOperator(
    task_id="test_dbt_gold_models",
    bash_command="""
        cd /opt/dbt/my_project &&
        dbt test --select tag:gold --target prod
    """
)

run_dbt >> test_dbt

Materialization	What it Creates	When to Use
view	SQL view — no data stored	Staging / lightweight transforms
table	Full table, rebuilt every run	Small-medium marts
incremental	Table, only new rows merged/appended	Large fact tables (avoid full rebuild)
ephemeral	CTE — no physical object created	Intermediate logic reused within a run

📌 dbt Cloud vs dbt Core

dbt Core is the open-source CLI — free, runs anywhere, you manage orchestration via Airflow. dbt Cloud is the SaaS product — adds a scheduler, IDE, lineage graph, CI integration, and managed jobs. Most enterprise teams use dbt Core + Airflow.

29.28 — AWS DATA ENGINEERING

Spark Performance Engineering

Partitioning strategy, join optimization, Adaptive Query Execution (AQE), caching, and small-file compaction — the core levers for tuning Spark pipelines at scale.

⚡

29.28 Spark Performance Engineering OPTIMIZATION ▼

Partitioning Strategy

Spark processes data in partitions — chunks that run in parallel across executor cores. The number and size of partitions directly determines whether your job is fast, slow, or crashes with OOM errors. Too few partitions = underutilized cluster. Too many = scheduling overhead and tiny tasks.

🧠 Analogy

Think of partitions as puzzle pieces. Too few big pieces and only a few people can work at once. Too many tiny pieces and people spend more time picking up pieces than placing them. The sweet spot is pieces sized so every person is always busy.

How Many Partitions — spark.sql.shuffle.partitions

spark.sql.shuffle.partitions controls how many partitions are created after a shuffle (groupBy, join, orderBy). The default is 200 — correct for large clusters, terrible for small datasets (200 tiny partitions waste scheduling overhead). Rule of thumb: target 128 MB – 200 MB per partition after the shuffle.

python — tuning shuffle partitions

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("perf-tuning").getOrCreate()

# Default is 200 — too high for small/medium data
print(spark.conf.get("spark.sql.shuffle.partitions"))  # 200

# For a 10 GB dataset with 200 MB target size → 50 partitions
spark.conf.set("spark.sql.shuffle.partitions", "50")

# For very large datasets (1 TB+) you might go higher
spark.conf.set("spark.sql.shuffle.partitions", "2000")

# With AQE enabled, Spark adjusts this automatically at runtime
spark.conf.set("spark.sql.adaptive.enabled", "true")
# AQE coalesces small post-shuffle partitions dynamically
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

repartition() vs coalesce() — When to Use Each

repartition(n) performs a full shuffle — data moves across the network to create exactly n evenly distributed partitions. Use it to increase partitions or when you need balanced distribution before a join. coalesce(n) is a narrow transformation — it merges partitions locally without a shuffle. Use it to decrease partitions cheaply (e.g., before writing output to avoid thousands of small files).

python — repartition vs coalesce

df = spark.table("silver.events")

print(f"Original partitions: {df.rdd.getNumPartitions()}")  # e.g. 800

# repartition — full shuffle, balanced, use before heavy joins
df_balanced = df.repartition(200)
# Can also repartition by column — co-locates same keys together
df_by_date = df.repartition(200, "event_date")

# coalesce — no shuffle, just merges existing partitions
# Use before writing output to reduce small files
df_small = df.coalesce(10)

# Write with coalesce to get exactly 10 output files
df_small.write.mode("overwrite").parquet("s3://bucket/output/")

# WRONG: using repartition before write just to reduce files
# costs a full shuffle for no benefit — use coalesce instead
# df.repartition(10).write...  ← wasteful if only reducing count

Aspect	repartition(n)	coalesce(n)
Shuffle	Yes — full network shuffle	No — local merge only
Partition balance	Perfectly even	Uneven (merges as-is)
Increasing partitions	Yes	No
Decreasing partitions	Yes (expensive)	Yes (cheap)
Best use	Before join, balancing skewed data	Before write, reducing output files

Partition by Columns for Downstream Reads

When writing to Delta/Parquet, partitionBy() physically organizes files on disk by column values. This enables partition pruning — downstream jobs and queries skip entire folders when filtering on those columns.

python — partitionBy for storage-level partitioning

# Writing partitioned by date columns — Athena/Spark can skip entire date folders
df.write \
  .format("delta") \
  .mode("overwrite") \
  .partitionBy("year", "month", "day") \
  .save("s3://my-lake/gold/events/")

# Reading — Spark pushes the filter down and skips irrelevant partitions
spark.read.format("delta") \
  .load("s3://my-lake/gold/events/") \
  .filter("year = 2024 AND month = 1") \   # reads ONLY year=2024/month=1/ folders
  .count()

⚠️ Partition Column Cardinality

Never partition by a high-cardinality column like user_id or order_id — this creates millions of tiny files (the small files problem). Good partition columns have low cardinality: date parts, region, status, category.

Join Optimization

Joins are the most expensive operation in Spark. Choosing the wrong join strategy on large tables causes massive shuffles, OOM errors, and hour-long jobs. Understanding when Spark picks each join type — and how to force the right one — is a core senior DE skill.

Broadcast Join — When and How

A broadcast join sends a copy of the small table to every executor, so the large table never moves. It completely eliminates the shuffle. Spark automatically broadcasts tables under spark.sql.autoBroadcastJoinThreshold (default 10 MB). You can force it with a hint.

python — broadcast join hint

from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast

large_df = spark.table("silver.transactions")   # 500 GB
small_df = spark.table("gold.dim_store")         # 2 MB

# Auto broadcast (if small_df < autoBroadcastJoinThreshold)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024))  # 50 MB

# Force broadcast with hint — even if Spark wouldn't auto-broadcast
result = large_df.join(broadcast(small_df), "store_id")

# SQL hint equivalent
spark.sql("""
    SELECT /*+ BROADCAST(s) */ t.*, s.store_name
    FROM silver.transactions t
    JOIN gold.dim_store s ON t.store_id = s.store_id
""")

Sort-Merge Join — Default for Large Tables

When both tables are large, Spark uses a sort-merge join: shuffle both tables by the join key, sort each partition, then merge matching rows. This is the safe default but requires two shuffles (one per table) and sorting — expensive but correct for any data size.

python — force sort-merge join hint

# Force sort-merge join (useful when you've pre-bucketed both tables)
result = spark.sql("""
    SELECT /*+ MERGE(t, o) */ t.customer_id, o.order_total
    FROM silver.transactions t
    JOIN silver.orders o ON t.order_id = o.order_id
""")

Data Skew in Joins — Salting Technique

Skew happens when one join key value has disproportionately more rows than others (e.g., 80% of orders belong to one customer_id = 'UNKNOWN'). This causes one executor task to process millions of rows while others sit idle — the slowest task determines the total job time.

Salting fixes skew by adding a random suffix to the skewed key, splitting one huge task into many small balanced ones.

python — salting to fix join skew

from pyspark.sql import functions as F

SALT_FACTOR = 10  # split into 10 buckets

# Large skewed table — add random salt to the join key
transactions = spark.table("silver.transactions") \
    .withColumn("salt", (F.rand() * SALT_FACTOR).cast("int")) \
    .withColumn("salted_key", F.concat(F.col("customer_id"), F.lit("_"), F.col("salt")))

# Small dimension — replicate rows for each salt value
dim_customer = spark.table("gold.dim_customer")
salt_range = spark.range(SALT_FACTOR).withColumnRenamed("id", "salt")
dim_exploded = dim_customer.crossJoin(salt_range) \
    .withColumn("salted_key", F.concat(F.col("customer_id"), F.lit("_"), F.col("salt")))

# Join on salted key — no single executor gets all the UNKNOWN rows
result = transactions.join(dim_exploded, "salted_key") \
    .drop("salt", "salted_key")

Skew Hint in Spark 3+

Spark 3 introduced a skew join hint and AQE-based automatic skew handling — no manual salting required if AQE is on. AQE detects which partitions are skewed at runtime and splits them automatically.

python — AQE skew join (no manual salting needed)

# Enable AQE — handles skew automatically
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# A partition is skewed if it's > this many times the median partition size
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
# And larger than this absolute threshold
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", str(256 * 1024 * 1024))

# AQE will automatically split the skewed partition during the join
result = transactions.join(dim_customer, "customer_id")

# Or force via hint if AQE is not enabled
result = spark.sql("""
    SELECT /*+ SKEW('t', 'customer_id', ('UNKNOWN', 'GUEST')) */ *
    FROM silver.transactions t
    JOIN gold.dim_customer c ON t.customer_id = c.customer_id
""")

Predicate Pushdown

Predicate pushdown means Spark (via Catalyst optimizer) moves filter conditions as close to the data source as possible — ideally into the file reader or even the storage layer — so less data is ever read into memory. It works automatically with Parquet, ORC, Delta, and Iceberg.

With Parquet / ORC

Parquet and ORC store min/max statistics per row group (Parquet) or stripe (ORC). Spark reads these statistics first and skips entire row groups where the filter can't match — without reading the actual data.

python — verify predicate pushdown in explain plan

df = spark.read.parquet("s3://my-lake/silver/events/")

# This filter gets pushed into the Parquet reader
result = df.filter("event_date = '2024-01-15' AND event_type = 'purchase'")

# Check the explain plan — look for "PushedFilters" in the scan
result.explain(True)
# You'll see: PushedFilters: [IsNotNull(event_date), EqualTo(event_date,2024-01-15)]
# This means Spark is filtering at the file level, not after reading all data

With Delta Lake and Iceberg

Delta and Iceberg add an additional layer: file-level statistics stored in the transaction log (Delta) or manifest files (Iceberg). Spark checks these stats and skips entire files before even opening them.

python — Delta file skipping via stats

# Delta stores min/max for every column in every file in _delta_log
# This query skips files where max(amount) < 1000
result = spark.read.format("delta") \
    .load("s3://my-lake/gold/transactions/") \
    .filter("amount > 1000 AND event_date = '2024-01-15'")

# ZORDER clusters data by columns → makes file skipping more effective
# Run this as a maintenance job
spark.sql("""
    OPTIMIZE gold.transactions
    ZORDER BY (event_date, customer_id)
""")

Partition Pruning in Queries

Partition pruning is the coarsest and most impactful pushdown — Spark skips entire directories (not just row groups) when the filter matches a partition column. The key requirement: the column you filter on must be a partition column (used in partitionBy() at write time).

python — partition pruning example

# Table written with partitionBy("year", "month")
# File layout: .../year=2024/month=01/... year=2024/month=02/... year=2023/...

df = spark.read.format("delta").load("s3://my-lake/gold/events/")

# Spark reads ONLY .../year=2024/month=01/ — skips all other folders
result = df.filter("year = 2024 AND month = 1")

# Check explain — look for "PartitionFilters" in the scan node
result.explain()
# Output shows: PartitionFilters: [isnotnull(year#10), (year#10 = 2024), ...]

# Dynamic partition pruning (DPP) — filter from a join key narrows partitions
large = spark.table("gold.fact_sales")          # partitioned by date_sk
small = spark.table("gold.dim_date").filter("fiscal_year = 2024")
# Spark computes dim_date result first, then uses those date_sk values
# to prune partitions in fact_sales — even though fact_sales isn't filtered directly
result = large.join(small, "date_sk")

Caching — persist() vs cache(), Storage Levels

cache() and persist() store a DataFrame in memory (and optionally disk) so downstream actions don't re-read from source or recompute expensive transformations. Use caching when you reuse a DataFrame more than once in the same job — without it, Spark recomputes the entire lineage from source every time.

python — cache vs persist with storage levels

from pyspark import StorageLevel

# cache() = shorthand for persist(StorageLevel.MEMORY_AND_DISK)
df_cached = df.cache()

# persist() lets you control exactly where data is stored
# MEMORY_ONLY — fastest, but evicted if not enough RAM (recomputed on miss)
df.persist(StorageLevel.MEMORY_ONLY)

# MEMORY_AND_DISK — spills to disk if memory is full (safest default)
df.persist(StorageLevel.MEMORY_AND_DISK)

# DISK_ONLY — always on disk, slow but survives memory pressure
df.persist(StorageLevel.DISK_ONLY)

# MEMORY_ONLY_SER — serialized storage, less RAM but slower to deserialize
df.persist(StorageLevel.MEMORY_ONLY_SER)

# Always trigger an action to materialize the cache
df_cached.count()   # materializes the cache

# Unpersist when done — free up memory for other jobs
df_cached.unpersist()

When to Cache and When Not To

Situation	Cache?	Reason
DataFrame used 2+ times in same job	Yes	Avoids re-reading source and recomputing
Iterative ML training loops	Yes	Same dataset used in every iteration
foreachBatch with multiple writes	Yes	batchDF used for Delta + Kafka simultaneously
Single-use DataFrame (one action)	No	Cache overhead without benefit
Very large dataset (fills memory + disk)	No	Spill cost may exceed recompute cost
Quickly computed transformation	No	Recomputing is faster than cache read

Spark UI — Reading It for Performance Diagnosis

The Spark UI (port 4040 while a job runs, History Server after) is your primary diagnostic tool. Every performance problem leaves a fingerprint in the UI — you need to know where to look.

DAG Visualization

The Jobs → Stages → DAG view shows how Spark decomposed your query into stages separated by shuffle boundaries. Wide stages (wide arrows in the DAG) indicate shuffles — each one is a potential performance bottleneck. Identify the heaviest stages and focus optimization there.

Stages and Tasks — Shuffle Read/Write Metrics

Click into a Stage to see per-task metrics. Key columns to check:

Metric	What It Means	Problem Signal
Shuffle Read Size	Data read from shuffle files	Very large → shuffle bottleneck
Shuffle Write Size	Data written to shuffle files	Large → consider tuning partitions
Spill (Memory)	Data that couldn't fit in execution memory	Any spill → increase executor memory or reduce partition size
Spill (Disk)	Data written to disk due to spill	High → serious memory pressure
Task Duration (max vs median)	How long tasks take	Max >> median → data skew
GC Time	Garbage collection time per task	>10% of task time → GC pressure, increase memory

GC Time — When to Increase Executor Memory

If GC time is high, executors are spending more time cleaning up objects than doing actual work. Fix: increase spark.executor.memory, switch to Kryo serialization, or reduce data held in memory per task.

python — executor memory and GC tuning config

spark = SparkSession.builder \
    .appName("perf-tuned-job") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.cores", "4") \
    .config("spark.executor.memoryOverhead", "2g")  \
    .config("spark.memory.fraction", "0.8") \         # 80% of heap for execution+storage
    .config("spark.memory.storageFraction", "0.3") \  # 30% of above for caching
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryo.registrationRequired", "false") \
    .getOrCreate()

Spill to Disk — Sign of Memory Pressure

Spill means Spark ran out of execution memory and had to write intermediate data (shuffle buffers, sort buffers, aggregation hash maps) to disk. This slows jobs by 10–100x. The fix depends on the cause:

python — diagnosing and fixing spill

# Cause 1: Too few partitions → each partition too large → spill
# Fix: increase partition count
spark.conf.set("spark.sql.shuffle.partitions", "500")  # was 200

# Cause 2: Executor memory too small
# Fix: increase executor memory (if cluster resources allow)
# spark.executor.memory = "16g"

# Cause 3: Data skew — one giant partition spills while others are fine
# Fix: AQE skew handling or manual salting (see join optimization above)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Cause 4: Expensive aggregation with high cardinality
# Fix: use approx functions where exactness not required
from pyspark.sql import functions as F
df.agg(F.approx_count_distinct("user_id", rsd=0.05))  # much less memory than countDistinct

AQE — Adaptive Query Execution

AQE (introduced in Spark 3.0) re-optimizes the query plan at runtime using actual statistics collected during execution — unlike the Catalyst optimizer which plans ahead of time with estimates. It has three major features that often make jobs significantly faster with zero code changes.

python — enabling and configuring AQE

# Enable AQE — should be on by default in Spark 3.2+
spark.conf.set("spark.sql.adaptive.enabled", "true")

# Feature 1: Auto-optimize shuffle partitions
# Coalesces many small post-shuffle partitions into fewer larger ones
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", str(128 * 1024 * 1024))  # 128 MB target

# Feature 2: Convert sort-merge join → broadcast join at runtime
# If Spark underestimated a table's size but discovers it's small after the shuffle
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true")

# Feature 3: Skew join handling (covered above)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

# Verify AQE is working — look for "AdaptiveSparkPlan" in explain output
df_result = large_df.join(medium_df, "key").groupBy("category").count()
df_result.explain()
# Output: == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
#   +- HashAggregate ...  ← AQE manages this

Serialization — Kryo vs Java

Serialization converts objects to bytes for network transfer (shuffle) and disk storage (spill, cache). Java serialization is the default — simple but slow and produces large byte arrays. Kryo serialization is 2–10x faster and produces smaller byte arrays — use it for any job with heavy shuffle or caching.

python — configuring Kryo serialization

spark = SparkSession.builder \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryo.registrationRequired", "false") \
    .getOrCreate()

# Optionally register custom classes for maximum Kryo efficiency
# spark.conf.set("spark.kryo.classesToRegister", "com.mycompany.MyClass")

📌 When Serialization Matters

DataFrame/Dataset operations use Tungsten's UnsafeRow binary format internally — they bypass Java/Kryo entirely and operate directly on binary memory. Kryo matters most for RDD operations and broadcast variables that serialize Python/JVM objects.

File Compaction — Solving the Small Files Problem

The small files problem occurs when a pipeline writes many tiny files over time (e.g., streaming writes every minute, or partitioned writes with too many partitions). Reading thousands of small files is slow — each file has metadata overhead (S3 API call, Parquet footer read). The fix is periodic compaction — merging small files into larger ones (target: 128 MB – 1 GB per file).

Small Files Problem — Causes

Common causes: streaming micro-batch writes (one batch = one file per partition), over-partitioned data (too many partition columns = too many folders), incremental appends without compaction.

Compaction Job Pattern in Spark

python — manual compaction job (read → coalesce → overwrite)

# Read a partition that has accumulated many small files
partition_path = "s3://my-lake/silver/events/year=2024/month=01/day=15/"

df = spark.read.format("parquet").load(partition_path)
print(f"Files before compaction: {df.rdd.getNumPartitions()}")  # e.g. 800 small files

# Coalesce to target file count (128MB per file → 10 GB / 128 MB = ~80 files)
df_compacted = df.coalesce(80)

# Overwrite the same partition
df_compacted.write.mode("overwrite").parquet(partition_path)
print("Compaction complete")

OPTIMIZE in Delta Lake

python — Delta OPTIMIZE (built-in compaction)

# Delta OPTIMIZE compacts small files into 1 GB files automatically
spark.sql("OPTIMIZE silver.events")

# Optimize only a specific partition — much faster for large tables
spark.sql("OPTIMIZE silver.events WHERE year = 2024 AND month = 1")

# ZORDER co-locates related rows in the same files for better file skipping
spark.sql("OPTIMIZE gold.transactions ZORDER BY (customer_id, event_date)")

# Run as a scheduled Airflow task (e.g., daily at midnight)
# After streaming writes accumulate small files, OPTIMIZE consolidates them

Compaction with Iceberg rewrite_data_files

python — Iceberg compaction

from pyiceberg.catalog import load_catalog

# Via Spark SQL (SparkCatalog configured)
spark.sql("""
    CALL my_catalog.system.rewrite_data_files(
        table => 'silver.events',
        options => map(
            'target-file-size-bytes', '134217728',  -- 128 MB
            'min-file-size-bytes',    '33554432',   -- only compact files < 32 MB
            'max-concurrent-file-group-rewrites', '10'
        )
    )
""")

📌 Compaction in Production

Schedule OPTIMIZE / compaction as a separate maintenance DAG in Airflow — not inline with your ingestion pipeline. Run it during off-peak hours (nightly for batch tables, every few hours for streaming tables). This keeps your ingestion fast and your reads fast.

29.29 — AWS DATA ENGINEERING

Pipeline Observability & Reliability

Metrics, logging, lineage tracking, SLA monitoring, and alerting — the practices that keep production data pipelines reliable and make failures visible before they become incidents.

🔭

29.29 Pipeline Observability & Reliability PRODUCTION ▼

Observability Pillars — Metrics, Logs, Lineage

Observability is the ability to understand what your pipeline is doing, why it failed, and where data came from — from the outside, by examining its outputs. In data engineering, observability rests on three pillars. Without all three, you are flying blind in production.

🧠 Analogy

Metrics tell you your car's speed dropped to zero. Logs tell you the engine threw a fault code at 14:32:05. Lineage tells you the fuel came from this supplier, was processed at this refinery, and arrived via this pipeline. You need all three to diagnose and fix the problem.

Pillar	Answers	Example in Data Engineering
Metrics	What happened? How much? How long?	rows_processed=1.2M, duration_seconds=142, dq_failures=3
Logs	Why did it happen? What was the error?	ERROR: NullPointerException at silver.orders line 47, batch_id=20240115
Lineage	Where did this data come from? What consumed it?	gold.daily_revenue ← silver.orders ← bronze.raw_orders ← PostgreSQL.orders

Custom CloudWatch Metrics

AWS built-in metrics cover infrastructure (CPU, memory, disk). But pipeline-level metrics — rows processed, DQ failures, pipeline duration — must be published manually from your Glue/EMR/Lambda code using put_metric_data(). These custom metrics are then used to build dashboards and alarms.

python — publishing custom pipeline metrics to CloudWatch

import boto3
import time
from datetime import datetime, timezone

cw = boto3.client("cloudwatch", region_name="us-east-1")

def publish_pipeline_metrics(pipeline_name, rows_read, rows_written,
                              rows_rejected, duration_seconds, dq_score):
    """Publish all key pipeline metrics in a single batch call."""
    now = datetime.now(timezone.utc)
    dimensions = [{"Name": "PipelineName", "Value": pipeline_name}]

    cw.put_metric_data(
        Namespace="DataEngineering/Pipelines",
        MetricData=[
            {
                "MetricName": "RowsProcessed",
                "Dimensions": dimensions,
                "Value": rows_written,
                "Unit": "Count",
                "Timestamp": now
            },
            {
                "MetricName": "RowsRejected",
                "Dimensions": dimensions,
                "Value": rows_rejected,
                "Unit": "Count",
                "Timestamp": now
            },
            {
                "MetricName": "PipelineDurationSeconds",
                "Dimensions": dimensions,
                "Value": duration_seconds,
                "Unit": "Seconds",
                "Timestamp": now
            },
            {
                "MetricName": "DataQualityScore",
                "Dimensions": dimensions,
                "Value": dq_score,
                "Unit": "Percent",
                "Timestamp": now
            },
            {
                "MetricName": "PipelineSuccess",
                "Dimensions": dimensions,
                "Value": 1,          # 1 = success, publish 0 on failure
                "Unit": "Count",
                "Timestamp": now
            }
        ]
    )
    print(f"[METRICS] Published metrics for {pipeline_name}")

# Usage at end of Glue ETL job
start_time = time.time()

# ... your ETL logic here ...
rows_read     = 1_200_000
rows_written  = 1_198_500
rows_rejected = 1_500
dq_score      = 99.87

duration = time.time() - start_time
publish_pipeline_metrics(
    pipeline_name   = "silver-orders-etl",
    rows_read       = rows_read,
    rows_written    = rows_written,
    rows_rejected   = rows_rejected,
    duration_seconds= duration,
    dq_score        = dq_score
)

📌 Namespace Convention

Always use a structured namespace like DataEngineering/Pipelines or Company/DataPlatform. This groups your custom metrics separately from AWS built-in metrics and makes dashboard creation much cleaner. Use Dimensions (PipelineName, Environment, SourceSystem) to slice metrics in CloudWatch.

Alerting Architecture — CloudWatch Alarm → SNS → PagerDuty / Slack

A well-designed alerting architecture has a clear flow: metric threshold breach → CloudWatch Alarm → SNS Topic → subscribers (email, Slack webhook, PagerDuty). Each pipeline should have at minimum four alarms: failure alarm, duration alarm, DQ failure alarm, and DLQ depth alarm.

python — creating CloudWatch alarms for a pipeline

import boto3

cw  = boto3.client("cloudwatch", region_name="us-east-1")
sns = boto3.client("sns",         region_name="us-east-1")

PIPELINE_NAME   = "silver-orders-etl"
SNS_TOPIC_ARN   = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts"

def create_pipeline_alarms(pipeline_name, sns_topic_arn):
    dims = [{"Name": "PipelineName", "Value": pipeline_name}]

    # Alarm 1 — Pipeline failure (PipelineSuccess drops to 0)
    cw.put_metric_alarm(
        AlarmName        = f"{pipeline_name}-failure",
        AlarmDescription = f"{pipeline_name} did not complete successfully",
        Namespace        = "DataEngineering/Pipelines",
        MetricName       = "PipelineSuccess",
        Dimensions       = dims,
        Statistic        = "Sum",
        Period           = 3600,       # check every 1 hour
        EvaluationPeriods= 1,
        Threshold        = 1,
        ComparisonOperator = "LessThanThreshold",
        TreatMissingData = "breaching", # missing metric = pipeline didn't run = alert
        AlarmActions     = [sns_topic_arn],
        OKActions        = [sns_topic_arn]
    )

    # Alarm 2 — Duration exceeded SLA (e.g., must finish within 30 min = 1800s)
    cw.put_metric_alarm(
        AlarmName        = f"{pipeline_name}-duration-sla-breach",
        AlarmDescription = f"{pipeline_name} exceeded 30-minute SLA",
        Namespace        = "DataEngineering/Pipelines",
        MetricName       = "PipelineDurationSeconds",
        Dimensions       = dims,
        Statistic        = "Maximum",
        Period           = 3600,
        EvaluationPeriods= 1,
        Threshold        = 1800,       # 30 minutes
        ComparisonOperator = "GreaterThanThreshold",
        AlarmActions     = [sns_topic_arn]
    )

    # Alarm 3 — DQ score dropped below threshold
    cw.put_metric_alarm(
        AlarmName        = f"{pipeline_name}-dq-failure",
        AlarmDescription = f"{pipeline_name} data quality score below 99%",
        Namespace        = "DataEngineering/Pipelines",
        MetricName       = "DataQualityScore",
        Dimensions       = dims,
        Statistic        = "Minimum",
        Period           = 3600,
        EvaluationPeriods= 1,
        Threshold        = 99.0,
        ComparisonOperator = "LessThanThreshold",
        AlarmActions     = [sns_topic_arn]
    )

    # Alarm 4 — DLQ depth (failed messages piling up)
    cw.put_metric_alarm(
        AlarmName        = f"{pipeline_name}-dlq-depth",
        AlarmDescription = "Dead letter queue has messages — pipeline errors not being handled",
        Namespace        = "AWS/SQS",
        MetricName       = "ApproximateNumberOfMessagesVisible",
        Dimensions       = [{"Name": "QueueName", "Value": f"{pipeline_name}-dlq"}],
        Statistic        = "Sum",
        Period           = 300,        # check every 5 minutes
        EvaluationPeriods= 1,
        Threshold        = 1,
        ComparisonOperator = "GreaterThanOrEqualToThreshold",
        AlarmActions     = [sns_topic_arn]
    )

    print(f"[ALARMS] Created 4 alarms for {pipeline_name}")

create_pipeline_alarms(PIPELINE_NAME, SNS_TOPIC_ARN)

python — SNS topic setup with Slack webhook subscriber

import boto3, json

sns = boto3.client("sns", region_name="us-east-1")

# Create the alert topic
response = sns.create_topic(Name="data-pipeline-alerts")
topic_arn = response["TopicArn"]

# Subscribe an email endpoint
sns.subscribe(TopicArn=topic_arn, Protocol="email",
              Endpoint="data-team@company.com")

# Subscribe a Lambda function that forwards to Slack
# (Lambda reads SNS message and calls Slack webhook URL)
sns.subscribe(TopicArn=topic_arn, Protocol="lambda",
              Endpoint="arn:aws:lambda:us-east-1:123456789012:function:slack-notifier")

# The Lambda (slack_notifier) looks like this:
def lambda_handler(event, context):
    import urllib.request
    SLACK_WEBHOOK = "https://hooks.slack.com/services/T.../B.../..."
    for record in event["Records"]:
        message = json.loads(record["Sns"]["Message"])
        alarm   = record["Sns"]["Subject"]
        payload = {
            "text": f":rotating_light: *PIPELINE ALERT*\n*Alarm:* {alarm}\n*Detail:* {message}"
        }
        req = urllib.request.Request(
            SLACK_WEBHOOK,
            data=json.dumps(payload).encode(),
            headers={"Content-Type": "application/json"}
        )
        urllib.request.urlopen(req)

SLA Monitoring — Expected Completion Time per Pipeline

Every production pipeline has an SLA — a contractual or internal commitment that data will be available by a certain time (e.g., "daily sales report ready by 06:00 UTC"). SLA monitoring checks whether each pipeline completed on time, and fires an alert before business users notice the data is missing.

python — SLA check via Lambda on a schedule

import boto3
from datetime import datetime, timezone, timedelta
import json

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
sns      = boto3.client("sns",       region_name="us-east-1")

AUDIT_TABLE   = dynamodb.Table("pipeline_audit")
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts"

# SLA definitions — pipeline_name → expected completion UTC hour
PIPELINE_SLAS = {
    "silver-orders-etl":  {"expected_by_hour": 4,  "expected_by_minute": 30},
    "gold-daily-revenue": {"expected_by_hour": 5,  "expected_by_minute": 0},
    "gold-executive-kpi": {"expected_by_hour": 6,  "expected_by_minute": 0},
}

def lambda_handler(event, context):
    """Run at 06:30 UTC daily — check all pipelines completed on time."""
    today = datetime.now(timezone.utc).date().isoformat()
    breaches = []

    for pipeline_name, sla in PIPELINE_SLAS.items():
        # Look up today's run in audit table
        response = AUDIT_TABLE.get_item(
            Key={"pipeline_name": pipeline_name, "run_date": today}
        )
        item = response.get("Item")

        if not item:
            breaches.append(f"❌ {pipeline_name}: NO RUN RECORDED for {today}")
            continue

        if item.get("status") != "SUCCESS":
            breaches.append(f"❌ {pipeline_name}: Status={item.get('status')} on {today}")
            continue

        # Check if it completed before the SLA deadline
        end_time = datetime.fromisoformat(item["end_time"])
        deadline = datetime(
            end_time.year, end_time.month, end_time.day,
            sla["expected_by_hour"], sla["expected_by_minute"],
            tzinfo=timezone.utc
        )
        if end_time > deadline:
            delay_mins = int((end_time - deadline).total_seconds() / 60)
            breaches.append(
                f"⚠️ {pipeline_name}: Completed {delay_mins} min LATE "
                f"(finished {end_time.strftime('%H:%M')} UTC, SLA was "
                f"{sla['expected_by_hour']:02d}:{sla['expected_by_minute']:02d} UTC)"
            )

    if breaches:
        message = "SLA BREACH REPORT — " + today + "\n\n" + "\n".join(breaches)
        sns.publish(
            TopicArn = SNS_TOPIC_ARN,
            Subject  = f"🚨 SLA Breach Detected — {len(breaches)} pipeline(s)",
            Message  = message
        )
        print(f"[SLA] Published breach alert: {len(breaches)} pipeline(s)")
    else:
        print(f"[SLA] All pipelines met SLA for {today} ✅")

SLA Miss → SNS Alert (Airflow Integration)

python — Airflow SLA miss callback

import boto3
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

sns = boto3.client("sns", region_name="us-east-1")

def sla_miss_callback(context):
    """Called by Airflow when any task misses its SLA."""
    task_id   = context["task_instance"].task_id
    dag_id    = context["task_instance"].dag_id
    exec_date = context["execution_date"]
    sns.publish(
        TopicArn = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts",
        Subject  = f"⏰ SLA Miss: {dag_id}.{task_id}",
        Message  = (
            f"Task {task_id} in DAG {dag_id} missed its SLA.\n"
            f"Execution date: {exec_date}\n"
            f"Check Airflow UI for details."
        )
    )

with DAG(
    dag_id          = "silver_orders_etl",
    schedule_interval = "0 2 * * *",      # run at 02:00 UTC
    sla_miss_callback = sla_miss_callback,
    default_args    = {
        "owner": "data-engineering",
        "retries": 2,
        "retry_delay": timedelta(minutes=5),
        "sla": timedelta(hours=2)         # each task must finish within 2 hours
    }
) as dag:
    pass  # your tasks here

Lineage Tracking — Source → Transform → Target

Data lineage is the record of where data came from, what transformations it went through, and where it landed. It answers critical questions: "Which source tables feed this Gold report?", "If this upstream table is wrong, which dashboards are affected?", "Who changed this data and when?".

Recording Lineage in DynamoDB / RDS Metadata Table

python — writing lineage records per pipeline run

import boto3
from datetime import datetime, timezone
import uuid

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
lineage_table = dynamodb.Table("pipeline_lineage")

def record_lineage(run_id, pipeline_name, source_tables,
                   target_table, transformation_summary, rows_out):
    """Write one lineage record per pipeline run."""
    lineage_table.put_item(Item={
        "lineage_id"             : str(uuid.uuid4()),
        "run_id"                 : run_id,
        "pipeline_name"          : pipeline_name,
        "source_tables"          : source_tables,        # list of strings
        "target_table"           : target_table,
        "transformation_summary" : transformation_summary,
        "rows_out"               : rows_out,
        "recorded_at"            : datetime.now(timezone.utc).isoformat()
    })

# Example usage after a Glue ETL job
record_lineage(
    run_id                 = "run-20240115-142300",
    pipeline_name          = "gold-daily-revenue",
    source_tables          = ["silver.orders", "silver.order_items", "gold.dim_customer"],
    target_table           = "gold.fact_daily_revenue",
    transformation_summary = "Joined orders + items, grouped by date and store, applied SCD2 customer lookup",
    rows_out               = 45_230
)

OpenLineage — Open Standard for Lineage

OpenLineage is an open-source specification for collecting lineage metadata from data pipelines. Instead of building your own lineage tables from scratch, you emit standardized run events (START, COMPLETE, FAIL) with input/output datasets, and any compatible backend (Marquez, DataHub, Atlan) can store and visualize the lineage graph.

python — emitting OpenLineage events from a Spark job

# Install: pip install openlineage-python
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import (
    DataSourceDatasetFacet, SchemaDatasetFacet, SchemaField,
    SqlJobFacet
)
from openlineage.client.dataset import InputDataset, OutputDataset
import uuid
from datetime import datetime, timezone

client = OpenLineageClient.from_environment()  # reads OPENLINEAGE_URL env var

run_id  = str(uuid.uuid4())
job     = Job(namespace="aws-glue", name="silver-orders-etl")

# Emit START event
client.emit(RunEvent(
    eventType = RunState.START,
    eventTime = datetime.now(timezone.utc).isoformat(),
    run       = Run(runId=run_id),
    job       = job,
    inputs    = [
        InputDataset(
            namespace = "postgresql://prod-db:5432",
            name      = "public.orders",
            facets    = {
                "dataSource": DataSourceDatasetFacet(
                    name="prod-postgres", uri="postgresql://prod-db:5432/salesdb"
                )
            }
        )
    ],
    outputs   = []
))

# ... run ETL ...

# Emit COMPLETE event with output dataset
client.emit(RunEvent(
    eventType = RunState.COMPLETE,
    eventTime = datetime.now(timezone.utc).isoformat(),
    run       = Run(runId=run_id),
    job       = job,
    inputs    = [],
    outputs   = [
        OutputDataset(
            namespace = "s3://my-lake",
            name      = "silver/orders",
            facets    = {
                "schema": SchemaDatasetFacet(fields=[
                    SchemaField("order_id", "STRING"),
                    SchemaField("customer_id", "STRING"),
                    SchemaField("net_revenue", "DOUBLE")
                ])
            }
        )
    ]
))

Marquez — Open Source Lineage Server

Marquez is the open-source reference implementation of an OpenLineage backend. It provides a REST API that receives lineage events and a UI that renders the full lineage graph — showing exactly which datasets feed which jobs and which jobs produce which datasets.

bash — running Marquez locally with Docker

# Start Marquez (lineage server + UI)
docker run -d --name marquez \
  -p 5000:5000 \
  -p 5001:5001 \
  marquezproject/marquez:latest

# Set env var so OpenLineage client sends events to Marquez
export OPENLINEAGE_URL=http://localhost:5000

# Open Marquez UI at http://localhost:5001
# Navigate: Namespaces → Jobs → click any job to see lineage graph

Audit Table Pattern — One Record per Pipeline Run

The audit table is the heartbeat log of your entire data platform. Every pipeline run — success or failure — writes one record. It gives you historical visibility into pipeline health, run durations, row counts, and error messages — all queryable with SQL.

sql — audit table DDL

CREATE TABLE pipeline_audit (
    audit_id          STRING DEFAULT gen_random_uuid(),
    run_id            STRING NOT NULL,
    pipeline_name     STRING NOT NULL,
    dag_id            STRING,
    task_id           STRING,
    source_table      STRING,
    target_table      STRING,
    status            STRING NOT NULL,   -- SUCCESS / FAILED / RUNNING
    start_time        TIMESTAMP NOT NULL,
    end_time          TIMESTAMP,
    duration_seconds  INT,
    rows_read         BIGINT,
    rows_written      BIGINT,
    rows_rejected     BIGINT,
    dq_score          DECIMAL(5,2),
    error_message     STRING,
    error_type        STRING,            -- NullConstraint / SchemaError / TimeoutError
    batch_id          STRING,
    watermark_value   TIMESTAMP,
    environment       STRING,            -- dev / staging / prod
    created_at        TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (audit_id)
);

python — audit record writer (reusable utility)

import boto3, uuid, time
from datetime import datetime, timezone

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table    = dynamodb.Table("pipeline_audit")

class PipelineAudit:
    """Context manager that writes audit records automatically."""

    def __init__(self, pipeline_name, source_table, target_table,
                 run_id=None, environment="prod"):
        self.pipeline_name = pipeline_name
        self.source_table  = source_table
        self.target_table  = target_table
        self.run_id        = run_id or str(uuid.uuid4())
        self.environment   = environment
        self.start_time    = None
        self.rows_read     = 0
        self.rows_written  = 0
        self.rows_rejected = 0
        self.dq_score      = 100.0

    def __enter__(self):
        self.start_time = datetime.now(timezone.utc)
        # Write RUNNING record immediately so we know the pipeline started
        table.put_item(Item={
            "audit_id"      : str(uuid.uuid4()),
            "run_id"        : self.run_id,
            "pipeline_name" : self.pipeline_name,
            "source_table"  : self.source_table,
            "target_table"  : self.target_table,
            "status"        : "RUNNING",
            "start_time"    : self.start_time.isoformat(),
            "environment"   : self.environment
        })
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        end_time = datetime.now(timezone.utc)
        duration = int((end_time - self.start_time).total_seconds())

        if exc_type is None:
            status        = "SUCCESS"
            error_message = None
            error_type    = None
        else:
            status        = "FAILED"
            error_message = str(exc_val)[:1000]   # DynamoDB 400 KB item limit
            error_type    = exc_type.__name__

        table.put_item(Item={
            "audit_id"        : str(uuid.uuid4()),
            "run_id"          : self.run_id,
            "pipeline_name"   : self.pipeline_name,
            "source_table"    : self.source_table,
            "target_table"    : self.target_table,
            "status"          : status,
            "start_time"      : self.start_time.isoformat(),
            "end_time"        : end_time.isoformat(),
            "duration_seconds": duration,
            "rows_read"       : self.rows_read,
            "rows_written"    : self.rows_written,
            "rows_rejected"   : self.rows_rejected,
            "dq_score"        : str(self.dq_score),
            "error_message"   : error_message,
            "error_type"      : error_type,
            "environment"     : self.environment
        })
        return False   # do not suppress exceptions

# Usage — wraps your entire ETL logic
with PipelineAudit("silver-orders-etl",
                   source_table="bronze.raw_orders",
                   target_table="silver.orders") as audit:
    # your ETL logic here
    df = spark.table("bronze.raw_orders")
    audit.rows_read = df.count()

    df_clean = df.dropna(subset=["order_id", "customer_id"])
    audit.rows_rejected = audit.rows_read - df_clean.count()
    audit.rows_written  = df_clean.count()
    audit.dq_score      = round(audit.rows_written / audit.rows_read * 100, 2)

    df_clean.write.format("delta").mode("overwrite") \
        .saveAsTable("silver.orders")

Querying the Audit Table for Ops Insights

sql — useful audit table queries

-- 1. Last 7 days pipeline success rate per pipeline
SELECT
    pipeline_name,
    COUNT(*)                                           AS total_runs,
    SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) AS successes,
    ROUND(100.0 * SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate_pct
FROM pipeline_audit
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
GROUP BY pipeline_name
ORDER BY success_rate_pct ASC;

-- 2. Pipelines that took longer than their SLA today
SELECT pipeline_name, start_time, end_time, duration_seconds, rows_written
FROM pipeline_audit
WHERE DATE(start_time) = CURRENT_DATE
  AND status = 'SUCCESS'
  AND duration_seconds > 1800     -- 30 min SLA
ORDER BY duration_seconds DESC;

-- 3. DQ score trend for a specific pipeline
SELECT DATE(start_time) AS run_date, AVG(dq_score::float) AS avg_dq_score
FROM pipeline_audit
WHERE pipeline_name = 'silver-orders-etl'
  AND status = 'SUCCESS'
  AND start_time >= CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY run_date
ORDER BY run_date;

-- 4. Most common error types
SELECT error_type, COUNT(*) AS occurrences, MAX(start_time) AS last_seen
FROM pipeline_audit
WHERE status = 'FAILED'
  AND start_time >= CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY error_type
ORDER BY occurrences DESC;

Full Production Observability Stack — End to End

Putting it all together — here is how all observability components connect in a production AWS data platform.

PIPELINE OBSERVABILITY STACK

┌─────────────────────────────────────────────────────────────┐
│                  DATA PIPELINE (Glue / EMR / Lambda)        │
│                                                             │
│  ① publish_pipeline_metrics()  → CloudWatch Custom Metrics  │
│  ② put_log_events()            → CloudWatch Log Groups      │
│  ③ PipelineAudit context mgr   → DynamoDB audit table       │
│  ④ record_lineage()            → DynamoDB lineage table      │
│  ⑤ OpenLineage client.emit()   → Marquez / DataHub          │
└──────────────┬──────────────────────────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────────────────────────┐
│              CloudWatch Alarms (4 per pipeline)             │
│  • PipelineSuccess < 1   → ALARM                           │
│  • Duration > SLA        → ALARM                           │
│  • DQScore < 99%         → ALARM                           │
│  • DLQ depth ≥ 1         → ALARM                           │
└──────────────┬──────────────────────────────────────────────┘
               │ AlarmActions
               ▼
┌─────────────────────────────────────────────────────────────┐
│                    SNS Topic                                │
│            "data-pipeline-alerts"                           │
└──┬──────────────────┬───────────────────┬───────────────────┘
   │                  │                   │
   ▼                  ▼                   ▼
Email            Lambda →            PagerDuty
(data-team)      Slack Webhook       (on-call DE)

+ SLA Check Lambda (runs at 06:30 UTC daily)
  → scans DynamoDB audit table
  → compares end_time vs SLA deadline
  → publishes breach report to SNS

📌 Production Checklist

Every pipeline in production should have: ✅ Audit record on every run ✅ Custom CloudWatch metrics ✅ At least 4 CloudWatch alarms ✅ SNS alert routing to Slack + email ✅ SLA check Lambda running daily ✅ Lineage recorded per run ✅ DLQ for failed messages

29.30.1 — BOTO3 DEEP DIVE

Boto3 Fundamentals

Sessions, clients, resources, credential resolution, and production-grade client configuration — the foundation for every AWS API call you'll make from Python.

🐍

29.30.1 Boto3 Fundamentals BOTO3 ▼

Session vs Client vs Resource — Difference and When to Use Each

Boto3 has three levels of abstraction for interacting with AWS. Understanding which one to use — and why — is the first thing every data engineer must get right before writing a single API call.

🧠 Analogy

A Session is your AWS identity card — it holds your credentials and region. A Client is a low-level phone line to one AWS service — you speak the raw API language. A Resource is a high-level object wrapper — it hides the phone line and gives you Python objects with methods. Most production DE code uses Client because it maps 1:1 to the AWS API docs and gives full control.

Session

A Session stores configuration — credentials, region, profile — and is the factory from which you create clients and resources. By default, boto3 uses an implicit default session. You create explicit sessions when you need to use different credentials (cross-account, assumed role) or different regions in the same script.

python — session basics

import boto3

# Implicit default session — uses ~/.aws/credentials or instance profile
s3 = boto3.client("s3")   # boto3 creates a default session internally

# Explicit session — useful when you need custom region or profile
session = boto3.Session(
    region_name="eu-west-1",      # override default region
    profile_name="data-dev"       # use a named profile from ~/.aws/config
)
s3_eu = session.client("s3")

# Cross-account session using temporary credentials from STS assume_role
sts = boto3.client("sts")
creds = sts.assume_role(
    RoleArn="arn:aws:iam::999999999999:role/CrossAccountDataRole",
    RoleSessionName="etl-cross-account"
)["Credentials"]

cross_session = boto3.Session(
    aws_access_key_id     = creds["AccessKeyId"],
    aws_secret_access_key = creds["SecretAccessKey"],
    aws_session_token     = creds["SessionToken"],
    region_name           = "us-east-1"
)
s3_cross = cross_session.client("s3")  # now operates in the target account

Client — Low-Level, Full API Coverage

A Client is the low-level interface — every method maps directly to one AWS API call. It returns raw Python dictionaries. Use Client for all production data engineering code — it covers 100% of the AWS API, has predictable return structures, and aligns with the AWS documentation.

python — client usage pattern

import boto3

# Create a client — specify service name and region
s3     = boto3.client("s3",     region_name="us-east-1")
glue   = boto3.client("glue",   region_name="us-east-1")
emr    = boto3.client("emr",    region_name="us-east-1")
dynamo = boto3.client("dynamodb", region_name="us-east-1")

# Client methods return raw dicts — you access fields by key
response = s3.get_object(Bucket="my-lake", Key="silver/orders/part-0001.parquet")
body     = response["Body"].read()           # raw bytes
metadata = response["Metadata"]             # dict of custom metadata
size     = response["ContentLength"]        # integer

# List all S3 buckets — returns dict with "Buckets" key
buckets = s3.list_buckets()
for b in buckets["Buckets"]:
    print(b["Name"], b["CreationDate"])

Resource — High-Level Object-Oriented Interface

A Resource is the high-level, object-oriented interface. It wraps AWS entities as Python objects with attributes and methods (e.g., bucket.objects.all() instead of s3.list_objects_v2()). Resources are only available for a subset of services (S3, DynamoDB, EC2, IAM, SQS, SNS). They are convenient for simple scripts but lack full API coverage — not recommended for production pipelines.

python — resource usage (S3 and DynamoDB examples)

import boto3

# S3 Resource — object-oriented access
s3_resource = boto3.resource("s3", region_name="us-east-1")
bucket      = s3_resource.Bucket("my-lake")

# List all objects in a bucket (resource style)
for obj in bucket.objects.filter(Prefix="silver/orders/"):
    print(obj.key, obj.size)

# Upload a file — simpler syntax than client.upload_file
bucket.upload_file("local_file.parquet", "silver/orders/part-0001.parquet")

# DynamoDB Resource — table as an object
dynamo_resource = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamo_resource.Table("pipeline_audit")

# put_item with resource — cleaner syntax than client
table.put_item(Item={"run_id": "abc123", "status": "SUCCESS", "rows": 50000})

# get_item with resource — no need to parse nested response
response = table.get_item(Key={"run_id": "abc123"})
item     = response.get("Item")   # already deserialized — no type wrappers

Aspect	Session	Client	Resource
Purpose	Credential/config store	Raw API calls	OO wrapper
Return type	N/A	Raw dict	Python objects
API coverage	N/A	100% of AWS API	Subset of services
Production DE use	Always (implicit)	Primary choice	Simple scripts only
Services supported	All	All	S3, DynamoDB, EC2, IAM, SQS, SNS only

Authentication Patterns

Boto3 resolves credentials using a credential chain — it tries each method in order and uses the first one it finds. In production, you never hardcode credentials. In local dev, you use profiles. On AWS compute, the instance/task/pod role is automatically used.

Credential Chain — Order of Resolution

text — boto3 credential resolution order

1. Explicit credentials passed to Session() / client() constructor
2. Environment variables:
     AWS_ACCESS_KEY_ID
     AWS_SECRET_ACCESS_KEY
     AWS_SESSION_TOKEN (for temporary creds)
     AWS_DEFAULT_REGION
3. AWS config file — ~/.aws/credentials and ~/.aws/config
     [default] profile or named profiles
4. AWS SSO credential cache (if using aws sso login)
5. Container credentials (ECS task role — via metadata endpoint)
6. EC2 instance profile / Lambda execution role / Glue IAM role
     → fetched automatically from EC2 metadata service (169.254.169.254)
     → This is the standard production pattern on AWS compute

IAM Role Assumption — Production Standard

In production, AWS compute services (Glue jobs, Lambda functions, EMR clusters) run under an IAM role. Boto3 automatically picks up the role's credentials from the metadata service — you never need to pass credentials explicitly. This is the most secure pattern.

python — production pattern: no explicit credentials needed

import boto3

# On Glue / Lambda / EMR — boto3 auto-uses the attached IAM role
# No credentials needed in code — fetched from EC2 metadata service
s3   = boto3.client("s3",   region_name="us-east-1")
glue = boto3.client("glue", region_name="us-east-1")

# Verify which identity boto3 resolved — useful for debugging auth issues
sts      = boto3.client("sts")
identity = sts.get_caller_identity()
print(f"Account : {identity['Account']}")
print(f"UserId  : {identity['UserId']}")
print(f"ARN     : {identity['Arn']}")
# Output on a Glue job:
# ARN: arn:aws:sts::123456789012:assumed-role/GlueExecutionRole/session-name

Environment Variables — CI/CD and Docker

bash — setting credentials via environment variables

# Used in CI/CD pipelines (GitHub Actions, GitLab CI) and Docker containers
# Store in GitHub Secrets / GitLab CI Variables — never in code

export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI..."
export AWS_DEFAULT_REGION="us-east-1"

# boto3 picks these up automatically — no code change needed
python my_etl_script.py

~/.aws/credentials and Config Files — Local Dev

ini — ~/.aws/credentials and ~/.aws/config

# ~/.aws/credentials
[default]
aws_access_key_id     = AKIA...
aws_secret_access_key = wJalrX...

[data-dev]
aws_access_key_id     = AKIA...DEV
aws_secret_access_key = wJalrX...DEV

# ~/.aws/config
[default]
region = us-east-1
output = json

[profile data-dev]
region = eu-west-1
role_arn       = arn:aws:iam::999999999999:role/DevDataRole
source_profile = default    # assume role using default creds

python — using named profiles in local dev

import boto3

# Use the data-dev profile from ~/.aws/config
session = boto3.Session(profile_name="data-dev")
s3      = session.client("s3")

# Useful for local testing against a dev AWS account
# while prod code uses the instance role (no profile needed)

Instance Profile — EC2 / Lambda / Glue Auto-Auth

When your code runs on AWS compute, boto3 automatically fetches temporary credentials from the EC2 Instance Metadata Service (IMDS) at 169.254.169.254. These credentials are rotated automatically by AWS every few hours — you never manage them. This is the gold standard for production.

python — confirming instance profile is being used

import boto3

# On any AWS compute — this confirms auto-auth from instance profile
sts = boto3.client("sts")
try:
    identity = sts.get_caller_identity()
    print(f"✅ Auth via: {identity['Arn']}")
except Exception as e:
    print(f"❌ Auth failed: {e}  — check IAM role attachment")

Regions and Endpoint Configuration

Always specify region_name explicitly when creating clients — never rely on the default region being correct in a different environment. For services that are global (IAM, STS, S3 bucket creation), use us-east-1 as the convention. For VPC endpoint access (private subnets), you may need to specify a custom endpoint_url.

python — region and custom endpoint configuration

import boto3

# Always specify region explicitly — avoid silent wrong-region bugs
s3   = boto3.client("s3",   region_name="us-east-1")
glue = boto3.client("glue", region_name="eu-west-1")   # Glue in Ireland

# Custom endpoint — for VPC Interface Endpoints (private subnet access)
# Instead of going over the public internet, traffic stays inside VPC
s3_private = boto3.client(
    "s3",
    region_name  = "us-east-1",
    endpoint_url = "https://bucket.vpce-0a1b2c3d4e-abcdef.s3.us-east-1.vpce.amazonaws.com"
)

# LocalStack — local AWS emulation for integration testing
s3_local = boto3.client(
    "s3",
    region_name            = "us-east-1",
    endpoint_url           = "http://localhost:4566",
    aws_access_key_id      = "test",
    aws_secret_access_key  = "test"
)

Boto3 Config Object — Retries, Timeouts, Connection Pool

The botocore.config.Config object controls retry behavior, connection timeouts, and connection pool size. In production pipelines that make thousands of API calls, tuning these settings prevents silent failures from transient errors and connection exhaustion.

python — production boto3 config object

import boto3
from botocore.config import Config

# Production-grade config for high-throughput DE pipelines
prod_config = Config(
    region_name = "us-east-1",

    # Retry configuration
    retries = {
        "max_attempts": 10,       # total attempts (1 initial + 9 retries)
        "mode": "adaptive"        # adaptive: exponential backoff + client-side rate limiting
                                  # standard: exponential backoff only
                                  # legacy:   fixed 5 retries (old default — avoid)
    },

    # Timeout configuration (in seconds)
    connect_timeout = 5,          # time to establish TCP connection
    read_timeout    = 60,         # time to wait for server response after connected

    # Connection pool — increase for high-concurrency pipelines
    max_pool_connections = 50     # default is 10; increase for parallel uploads/downloads
)

# Apply config to a client
s3   = boto3.client("s3",   config=prod_config)
glue = boto3.client("glue", config=prod_config)

# Verify the config took effect
print(s3.meta.config.retries)         # {'max_attempts': 10, 'mode': 'adaptive'}
print(s3.meta.config.connect_timeout) # 5

Retry Mode	Behaviour	Use When
legacy	Fixed 5 retries, no rate limiting	Never — old default, avoid
standard	Exponential backoff on retryable errors	Most production use cases
adaptive	Exponential backoff + token bucket rate limiter	High-throughput pipelines hitting AWS rate limits

⚠️ max_pool_connections vs ThreadPoolExecutor

If you use ThreadPoolExecutor with 50 threads each making S3 calls on the same client, the default pool of 10 connections becomes a bottleneck. Set max_pool_connections to at least match your thread count to avoid connection wait time.

Complete Production Boto3 Client Setup Template

This is the pattern every DE should use as their standard template for initializing boto3 clients in production pipelines — combining explicit region, production config, and environment-aware credential resolution.

python — reusable boto3 client factory

import boto3
import os
from botocore.config import Config

REGION = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")

PROD_CONFIG = Config(
    region_name          = REGION,
    retries              = {"max_attempts": 10, "mode": "adaptive"},
    connect_timeout      = 5,
    read_timeout         = 60,
    max_pool_connections = 50
)

def get_client(service_name: str, region: str = REGION):
    """
    Factory for boto3 clients.
    - On AWS compute  → auto-uses instance role (no creds needed)
    - In local dev    → uses ~/.aws/credentials or env vars
    - In CI/CD        → uses env var credentials
    """
    return boto3.client(service_name, config=Config(
        region_name          = region,
        retries              = {"max_attempts": 10, "mode": "adaptive"},
        connect_timeout      = 5,
        read_timeout         = 60,
        max_pool_connections = 50
    ))

# Usage throughout the pipeline
s3     = get_client("s3")
glue   = get_client("glue")
dynamo = get_client("dynamodb")
sns    = get_client("sns")
cw     = get_client("cloudwatch")

📌 Golden Rules for Production Boto3

① Never hardcode AWS credentials in code — use IAM roles on AWS compute, env vars in CI/CD.
② Always specify region_name explicitly — never rely on defaults.
③ Always use Config with mode="adaptive" retry and reasonable timeouts.
④ Use sts.get_caller_identity() at startup to verify the correct identity is being used.
⑤ Use explicit Sessions only for cross-account or multi-region scenarios.

29.30.2

Error Handling & Retry Patterns

The single most missed topic in boto3. Every production pipeline will hit throttling, transient failures, and service errors. Knowing how to catch, classify, and retry them correctly separates junior scripts from senior-grade pipelines.

🚨

Exception Hierarchy

Foundation ▼

ClientError — The Most Common Exception

ClientError is raised when AWS returns an HTTP error response — meaning your request reached AWS, but AWS rejected it or something went wrong on its side. It wraps all service-level errors: access denied, resource not found, throttling, concurrent run limits, etc.

🧠 Analogy

Think of AWS as a restaurant. ClientError is when you order and the waiter comes back saying "sorry, we're out of that dish" or "your card was declined." Your request arrived — AWS just couldn't fulfill it.

python — catching ClientError

import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

try:
    response = s3.get_object(Bucket="my-bucket", Key="data/file.parquet")
except ClientError as e:
    # Always inspect these two fields first
    error_code    = e.response["Error"]["Code"]       # e.g. "NoSuchKey"
    error_message = e.response["Error"]["Message"]    # human-readable
    http_status   = e.response["ResponseMetadata"]["HTTPStatusCode"]  # e.g. 404

    print(f"Code: {error_code} | HTTP: {http_status} | Msg: {error_message}")

📌 Key Structure

Every ClientError has e.response["Error"]["Code"] and e.response["Error"]["Message"]. Always parse these — never just print the raw exception, as it loses the code you need to branch on.

BotoCoreError — SDK-Side Errors

BotoCoreError is the base class for errors that happen before a request reaches AWS — network issues, bad parameters, missing credentials, endpoint resolution failures. These are SDK-side, not service-side.

python — SDK-side exceptions

from botocore.exceptions import (
    BotoCoreError,
    NoCredentialsError,
    EndpointResolutionError,
    ParamValidationError,
    ClientError
)

try:
    s3 = boto3.client("s3", region_name="us-east-1")
    s3.get_object(Bucket="my-bucket", Key="file.parquet")

except NoCredentialsError:
    # No credentials found at all — misconfigured environment
    print("ERROR: No AWS credentials found. Check IAM role or env vars.")

except ParamValidationError as e:
    # Wrong parameter type or missing required parameter — caught before network call
    print(f"ERROR: Bad parameters passed to boto3: {e}")

except EndpointResolutionError as e:
    # Cannot resolve the AWS endpoint — DNS/network issue
    print(f"ERROR: Cannot reach AWS endpoint: {e}")

except ClientError as e:
    # AWS responded with an error (service-side)
    print(f"AWS Error: {e.response['Error']['Code']}: {e.response['Error']['Message']}")

except BotoCoreError as e:
    # Catch-all for any other SDK-level error
    print(f"Boto3 SDK error: {e}")

✅ Best Practice Order

Always catch specific exceptions first (NoCredentialsError, ParamValidationError), then ClientError, then the broad BotoCoreError last. Python's except chain stops at the first match.

Common Error Codes Every DE Must Handle

These are the error codes that appear in real pipelines. When you catch a ClientError, branch on e.response["Error"]["Code"] to take the right action for each case.

python — branching on error codes

import boto3
from botocore.exceptions import ClientError

def safe_get_object(bucket, key):
    s3 = boto3.client("s3", region_name="us-east-1")
    try:
        return s3.get_object(Bucket=bucket, Key=key)

    except ClientError as e:
        code = e.response["Error"]["Code"]

        if code == "NoSuchBucket":
            # The S3 bucket doesn't exist — configuration error
            raise ValueError(f"Bucket '{bucket}' does not exist. Check your config.")

        elif code == "NoSuchKey":
            # The file doesn't exist — could be expected (e.g. first run)
            print(f"File s3://{bucket}/{key} not found — may be first run.")
            return None

        elif code == "AccessDenied":
            # IAM permissions issue — escalate immediately
            raise PermissionError(f"Access denied to s3://{bucket}/{key}. Check IAM role.")

        elif code in ("ThrottlingException", "RequestLimitExceeded", "SlowDown"):
            # AWS is rate-limiting us — back off and retry
            print("Being throttled by S3 — apply exponential backoff.")
            raise  # re-raise so retry logic catches it

        elif code == "ServiceUnavailableException":
            # Transient AWS service issue — safe to retry
            raise  # re-raise for retry

        else:
            # Unknown error — log and re-raise
            print(f"Unhandled error code: {code}")
            raise

Error Code	Service	Meaning	Action
`NoSuchBucket`	S3	Bucket doesn't exist	Config error — fail fast
`NoSuchKey`	S3	Object doesn't exist	Handle gracefully (first run etc.)
`AccessDenied`	All	IAM policy blocks action	Fail fast, alert ops
`ThrottlingException`	All	API rate limit hit	Backoff + retry
`SlowDown`	S3	S3 request rate limit	Backoff + retry
`EntityAlreadyExists`	IAM	Role/policy already exists	Skip or update
`ConcurrentRunsExceededException`	Glue	Job already running	Wait or queue
`ServiceUnavailableException`	All	Transient AWS outage	Retry with backoff
`ValidationException`	All	Invalid input	Fix parameters — don't retry

🔁

Retry Patterns

Critical ▼

Exponential Backoff with Jitter — Manual Implementation

When AWS throttles you, retrying immediately makes things worse — all clients pile on at the same moment. Exponential backoff doubles the wait time after each failure. Jitter adds randomness so multiple pipeline instances don't all retry at exactly the same second.

🧠 Analogy

Imagine 50 people all try to walk through a revolving door at once — it jams. Exponential backoff tells each person to wait 1s, then 2s, then 4s if they fail. Jitter makes each person wait a random amount within that window so they spread out naturally.

python — exponential backoff with jitter

import boto3, time, random
from botocore.exceptions import ClientError

# Error codes that are safe to retry (transient)
RETRYABLE_CODES = {
    "ThrottlingException",
    "RequestLimitExceeded",
    "SlowDown",
    "ServiceUnavailableException",
    "InternalError",
    "RequestTimeout",
    "ProvisionedThroughputExceededException",  # DynamoDB
}

def with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=32.0):
    """
    Call fn() with exponential backoff + full jitter on retryable errors.

    Wait formula: min(max_delay, base_delay * 2^attempt) with full jitter
    Example waits: ~0.5s, ~1s, ~2s, ~4s, ~8s (randomized within each cap)
    """
    for attempt in range(max_attempts):
        try:
            return fn()  # call the boto3 operation

        except ClientError as e:
            code = e.response["Error"]["Code"]

            if code not in RETRYABLE_CODES or attempt == max_attempts - 1:
                # Non-retryable error OR we've exhausted all attempts
                raise

            # Exponential cap with full jitter
            cap   = min(max_delay, base_delay * (2 ** attempt))
            sleep = random.uniform(0, cap)   # full jitter: random between 0 and cap

            print(f"[Attempt {attempt+1}/{max_attempts}] Got {code}. "
                  f"Retrying in {sleep:.2f}s...")
            time.sleep(sleep)

    raise RuntimeError("Exhausted all retry attempts")


# ── Usage example ──
s3 = boto3.client("s3", region_name="us-east-1")

response = with_backoff(
    lambda: s3.get_object(Bucket="my-bucket", Key="data/events.parquet")
)
body = response["Body"].read()
print(f"Downloaded {len(body)} bytes")

📌 Why Full Jitter?

AWS itself recommends "full jitter" — random value between 0 and the exponential cap. This prevents thundering herd: when many clients all retry at exactly 2s, 4s, 8s, they still collide. Jitter spreads them out.

tenacity Library — @retry Decorator

tenacity is a Python library that handles retries declaratively with decorators. It's cleaner than writing manual loops for every function. It supports exponential backoff, jitter, stop conditions, and custom retry predicates — all in one decorator.

python — tenacity @retry decorator

# pip install tenacity
import boto3
from botocore.exceptions import ClientError
from tenacity import (
    retry,
    stop_after_attempt,
    wait_exponential_jitter,
    retry_if_exception,
    before_sleep_log,
    RetryError
)
import logging

logger = logging.getLogger(__name__)

# ── Define what counts as a retryable error ──
RETRYABLE_CODES = {
    "ThrottlingException", "SlowDown", "ServiceUnavailableException",
    "RequestLimitExceeded", "InternalError", "ProvisionedThroughputExceededException"
}

def is_retryable(exc):
    """Return True if this exception should trigger a retry."""
    if isinstance(exc, ClientError):
        return exc.response["Error"]["Code"] in RETRYABLE_CODES
    return False


# ── Decorated function — retries automatically ──
@retry(
    retry        = retry_if_exception(is_retryable),   # only retry on throttle/transient
    stop         = stop_after_attempt(5),               # give up after 5 total attempts
    wait         = wait_exponential_jitter(             # exponential backoff + jitter
                       initial=1,    # first wait = 1s
                       max=30,       # cap wait at 30s
                       jitter=2      # add up to 2s random jitter on top
                   ),
    before_sleep = before_sleep_log(logger, logging.WARNING),  # log each retry
    reraise      = True             # re-raise the original exception on final failure
)
def download_s3_object(bucket: str, key: str) -> bytes:
    s3 = boto3.client("s3", region_name="us-east-1")
    response = s3.get_object(Bucket=bucket, Key=key)
    return response["Body"].read()


# ── Usage ──
try:
    data = download_s3_object("my-data-lake", "bronze/events/2024/01/01/events.parquet")
    print(f"Downloaded {len(data):,} bytes")

except ClientError as e:
    print(f"Final failure after retries: {e.response['Error']['Code']}")

except RetryError:
    print("tenacity exhausted all retry attempts")

✅ Why Use tenacity Over Manual Loops?

Manual retry loops become noisy with logging, sleep, counter tracking. tenacity handles all of that cleanly. The @retry decorator keeps your function body focused on business logic, not retry plumbing.

python — tenacity for Glue job polling

from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_result

glue = boto3.client("glue", region_name="us-east-1")

def is_still_running(state: str) -> bool:
    """Return True while the job is still running — keep polling."""
    return state in ("STARTING", "RUNNING", "STOPPING")

@retry(
    retry         = retry_if_result(is_still_running),  # keep retrying while RUNNING
    stop          = stop_after_attempt(60),              # max 60 polls (~10 minutes)
    wait          = wait_fixed(10),                      # poll every 10 seconds
    reraise       = True
)
def poll_glue_job(job_name: str, run_id: str) -> str:
    """Poll until Glue job reaches a terminal state. Returns final state string."""
    response = glue.get_job_run(JobName=job_name, RunId=run_id)
    state    = response["JobRun"]["JobRunState"]
    print(f"  Glue job state: {state}")
    return state  # tenacity checks if this triggers retry_if_result


# ── Usage ──
run = glue.start_job_run(JobName="my-etl-job", Arguments={"--env": "prod"})
run_id = run["JobRunId"]

final_state = poll_glue_job("my-etl-job", run_id)

if final_state == "SUCCEEDED":
    print("✅ Glue job completed successfully")
else:
    raise RuntimeError(f"❌ Glue job ended with state: {final_state}")

boto3 Adaptive Retry Mode — Automatic Backoff

boto3 has a built-in retry mechanism configurable via the Config object. With mode="adaptive", boto3 applies exponential backoff automatically for all retryable errors, plus a client-side rate limiter that slows requests before they even get throttled.

python — boto3 built-in adaptive retry

import boto3
from botocore.config import Config

# ── Configure adaptive retry globally for this client ──
retry_config = Config(
    retries={
        "max_attempts": 10,   # 1 initial attempt + 9 retries
        "mode": "adaptive"    # adaptive = exponential backoff + token bucket rate limiter
    }
)

s3    = boto3.client("s3",    region_name="us-east-1", config=retry_config)
glue  = boto3.client("glue",  region_name="us-east-1", config=retry_config)
dynamo= boto3.client("dynamodb", region_name="us-east-1", config=retry_config)

# Now all boto3 calls on these clients automatically retry on:
# ThrottlingException, ServiceUnavailableException, TransientError, etc.
# No extra code needed for basic retry cases.
response = s3.get_object(Bucket="my-bucket", Key="data/file.parquet")
print("Success — boto3 retried internally if needed")

Mode	Backoff	Rate Limiter	Best For
`legacy`	Fixed 5 retries	None	Never use
`standard`	Exponential	None	Most pipelines
`adaptive`	Exponential	Token bucket	High-throughput pipelines

⚠️ boto3 Retry vs Your Own Retry

boto3 built-in retry handles transient/throttling errors automatically. But you still need your own retry logic for business-level polling (e.g. waiting for a Glue job to finish, polling Athena query state) — boto3 won't retry those because they're not HTTP errors, they're just job states you need to check.

🏭

Full Production Error Handling Pattern

Production ▼

Complete Pattern: catch → classify → log → alert → audit

In a real production pipeline, catching the error is just step one. You also need to: log it to CloudWatch, write it to an audit table in DynamoDB, publish a failure alert to SNS, and decide whether to retry or fail the pipeline. Here is the complete production-grade template every senior DE uses:

python — full production error handling template

import boto3, json, time, logging
from botocore.exceptions import ClientError, NoCredentialsError
from datetime import datetime, timezone

# ── Clients ──
s3     = boto3.client("s3",         region_name="us-east-1")
dynamo = boto3.client("dynamodb",   region_name="us-east-1")
sns    = boto3.client("sns",        region_name="us-east-1")
cw     = boto3.client("cloudwatch", region_name="us-east-1")

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

AUDIT_TABLE   = "pipeline_audit"
ALERT_TOPIC   = "arn:aws:sns:us-east-1:123456789012:pipeline-alerts"
PIPELINE_NAME = "s3_to_delta_bronze"
RUN_ID        = f"run_{int(time.time())}"

# ── Retryable error codes ──
RETRYABLE = {
    "ThrottlingException", "SlowDown", "ServiceUnavailableException",
    "RequestLimitExceeded", "InternalError"
}

def write_audit(status: str, error_msg: str = "", rows: int = 0):
    """Write pipeline run status to DynamoDB audit table."""
    dynamo.put_item(
        TableName=AUDIT_TABLE,
        Item={
            "run_id":        {"S": RUN_ID},
            "pipeline_name": {"S": PIPELINE_NAME},
            "status":        {"S": status},
            "start_time":    {"S": datetime.now(timezone.utc).isoformat()},
            "rows_processed":{"N": str(rows)},
            "error_message": {"S": error_msg},
        }
    )

def publish_alert(subject: str, message: str):
    """Push failure notification to SNS → email/Slack."""
    sns.publish(
        TopicArn=ALERT_TOPIC,
        Subject=subject,
        Message=json.dumps({
            "pipeline": PIPELINE_NAME,
            "run_id":   RUN_ID,
            "message":  message,
            "time":     datetime.now(timezone.utc).isoformat()
        }, indent=2)
    )

def publish_metric(metric_name: str, value: float, unit: str = "Count"):
    """Publish custom CloudWatch metric for dashboards and alarms."""
    cw.put_metric_data(
        Namespace="DataPipeline/Bronze",
        MetricData=[{
            "MetricName": metric_name,
            "Value":      value,
            "Unit":       unit,
            "Dimensions": [{"Name": "Pipeline", "Value": PIPELINE_NAME}]
        }]
    )


def run_pipeline():
    """Main pipeline entry point with full error handling."""
    rows_processed = 0

    try:
        logger.info(f"Starting pipeline: {PIPELINE_NAME} | run_id: {RUN_ID}")
        write_audit("RUNNING")

        # ── Step 1: Download source file from S3 ──
        try:
            response = s3.get_object(Bucket="raw-data-lake", Key="events/2024/01/01/events.json")
            data     = response["Body"].read()
            logger.info(f"Downloaded {len(data):,} bytes from S3")

        except ClientError as e:
            code = e.response["Error"]["Code"]

            if code == "NoSuchKey":
                logger.warning("Source file not found — skipping run (may be first run).")
                write_audit("SKIPPED", "Source file not found")
                return  # graceful exit — not a failure

            elif code in RETRYABLE:
                logger.error(f"Transient S3 error: {code} — should be handled by boto3 retry config")
                raise  # re-raise — boto3 adaptive mode should have already retried

            else:
                raise  # unexpected error — propagate to outer handler

        # ── Step 2: Transform and write output ──
        # (your Spark / Pandas / business logic here)
        rows_processed = 1_000_000   # example

        # ── Step 3: Success ──
        write_audit("SUCCEEDED", rows=rows_processed)
        publish_metric("RowsProcessed", rows_processed)
        publish_metric("PipelineSuccess", 1)
        logger.info(f"✅ Pipeline succeeded. Rows: {rows_processed:,}")

    except NoCredentialsError:
        msg = "No AWS credentials found. Check IAM role configuration."
        logger.critical(msg)
        write_audit("FAILED", msg)
        publish_alert(f"CRITICAL: {PIPELINE_NAME} — No Credentials", msg)
        publish_metric("PipelineFailure", 1)
        raise

    except ClientError as e:
        code = e.response["Error"]["Code"]
        msg  = f"AWS ClientError [{code}]: {e.response['Error']['Message']}"
        logger.error(msg)
        write_audit("FAILED", msg)
        publish_alert(f"FAILED: {PIPELINE_NAME} — {code}", msg)
        publish_metric("PipelineFailure", 1)
        raise

    except Exception as e:
        msg = f"Unexpected error: {type(e).__name__}: {str(e)}"
        logger.exception(msg)
        write_audit("FAILED", msg)
        publish_alert(f"FAILED: {PIPELINE_NAME} — Unexpected Error", msg)
        publish_metric("PipelineFailure", 1)
        raise


if __name__ == "__main__":
    run_pipeline()

☁️ What This Pattern Gives You

✅ Every run is recorded in DynamoDB (audit trail)
✅ Failures immediately trigger SNS alerts (ops team gets email/Slack)
✅ CloudWatch metrics power dashboards and alarms (PipelineFailure alarm → PagerDuty)
✅ Errors are classified — graceful skips vs real failures vs config errors
✅ Logs have run_id for correlation across all systems

29.30.3

Paginators

AWS APIs never return more than a fixed number of results per call. Paginators are boto3's built-in mechanism to transparently walk through all pages. Never write manual NextToken loops — use paginators instead.

📄

Why Paginators Exist

Foundation ▼

The Problem — AWS Truncates Results

AWS APIs like list_objects_v2, get_tables, scan return at most a fixed number of items per call (e.g. S3 returns max 1000 objects per call). If there are more, AWS returns a continuation token (NextToken, NextMarker, ContinuationToken) that you must pass in the next call to get the next page.

🧠 Analogy

Think of Google Search results — you see page 1, then click "Next" to get page 2. AWS is the same. Paginators are like an automatic "click Next" loop that collects every page for you without you writing that loop manually.

⚠️ The Bug That Bites Everyone

If you call s3.list_objects_v2(Bucket="my-bucket") and get back 1000 objects, you might think that's all of them. But if the bucket has 50,000 objects, you silently missed 49,000. Paginators prevent this silent truncation bug.

python — wrong vs right way

import boto3
s3 = boto3.client("s3", region_name="us-east-1")

# ❌ WRONG — silently misses objects beyond the first 1000
response = s3.list_objects_v2(Bucket="my-data-lake", Prefix="bronze/events/")
objects  = response.get("Contents", [])
print(f"Found {len(objects)} objects")   # might say 1000, but real count is 50,000

# ✅ CORRECT — paginator walks all pages automatically
paginator = s3.get_paginator("list_objects_v2")
pages     = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/events/")

all_objects = []
for page in pages:
    all_objects.extend(page.get("Contents", []))

print(f"Found {len(all_objects)} objects")   # correct: 50,000

How Paginators Work Internally

get_paginator("operation_name") returns a Paginator object tied to that API operation. Calling .paginate(**kwargs) on it returns a PageIterator — a lazy iterator that makes one API call per page, automatically passing the continuation token between calls.

python — paginator anatomy

import boto3
s3 = boto3.client("s3", region_name="us-east-1")

# Step 1 — get a Paginator for a specific operation
paginator = s3.get_paginator("list_objects_v2")
# paginator knows: which field is the token, what the max page size is, etc.

# Step 2 — call paginate() with the same args you'd pass to the API
page_iterator = paginator.paginate(
    Bucket    = "my-data-lake",
    Prefix    = "bronze/events/2024/",
    PaginationConfig = {
        "MaxItems"  : 5000,   # total items to return across ALL pages (optional cap)
        "PageSize"  : 500,    # items per individual page / API call
        "StartingToken": None # resume from a specific page (advanced use)
    }
)

# Step 3 — iterate over pages (each page = one API call)
total = 0
for page in page_iterator:
    items = page.get("Contents", [])
    total += len(items)
    print(f"  Page had {len(items)} items")

print(f"Total objects found: {total}")

🔍

JMESPath Search — Filtering Across Pages

Power Feature ▼

page_iterator.search() with JMESPath

Instead of looping over pages and manually extracting fields, you can use .search(jmespath_expression) on a PageIterator. It applies a JMESPath query across all pages and yields matching values lazily — no need to collect all pages into memory first.

🧠 Analogy

JMESPath is like XPath for JSON. Contents[].Key means "from the Contents array in each page, give me the Key field of every item." It's a mini query language for navigating nested JSON structures.

python — JMESPath search on paginator

import boto3
s3 = boto3.client("s3", region_name="us-east-1")

paginator     = s3.get_paginator("list_objects_v2")
page_iterator = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")

# ── Extract only the Keys (file paths) from all pages ──
# JMESPath: "Contents[].Key" → from Contents array, get Key field of each item
all_keys = list(page_iterator.search("Contents[].Key"))
print(f"All S3 keys: {all_keys[:5]}")   # e.g. ['bronze/events/2024/01/file1.parquet', ...]

# ── Filter: only .parquet files, get their keys and sizes ──
page_iterator2 = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")

parquet_info = list(page_iterator2.search(
    "Contents[?ends_with(Key, '.parquet')].[Key, Size]"
))
# Returns list of [key, size] pairs for .parquet files only
for key, size in parquet_info:
    print(f"  {key}  ({size:,} bytes)")

# ── Get keys larger than 100MB ──
page_iterator3 = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")
large_files = list(page_iterator3.search("Contents[?Size > `104857600`].Key"))
print(f"Files > 100MB: {large_files}")

📌 JMESPath Quick Reference

Contents[].Key — all Keys from Contents array
Contents[?Size > `1000`].Key — filter by condition
Contents[?ends_with(Key, '.parquet')] — ends_with filter
Contents[].{k: Key, s: Size} — project to new shape
length(Contents[]) — count items

☁️

Paginators by Service

Reference ▼

S3 — list_objects_v2

The most common paginator in data engineering. Used to list all files in a bucket prefix — essential for building file inventories, finding new files to process, or checking what's landed in a landing zone.

python — S3 paginator

import boto3
from datetime import datetime, timezone, timedelta

s3        = boto3.client("s3", region_name="us-east-1")
paginator = s3.get_paginator("list_objects_v2")

# ── List all objects in a prefix ──
def list_all_s3_objects(bucket: str, prefix: str) -> list[dict]:
    """Return list of all S3 object metadata dicts under prefix."""
    pages   = paginator.paginate(Bucket=bucket, Prefix=prefix)
    objects = []
    for page in pages:
        objects.extend(page.get("Contents", []))
    return objects

# ── List only files modified in the last N hours (incremental check) ──
def list_recent_files(bucket: str, prefix: str, hours: int = 24) -> list[str]:
    cutoff    = datetime.now(timezone.utc) - timedelta(hours=hours)
    pages     = paginator.paginate(Bucket=bucket, Prefix=prefix)
    new_files = []
    for page in pages:
        for obj in page.get("Contents", []):
            if obj["LastModified"] >= cutoff:
                new_files.append(obj["Key"])
    return new_files

# ── Usage ──
all_files   = list_all_s3_objects("my-data-lake", "bronze/events/")
print(f"Total files: {len(all_files)}")

recent = list_recent_files("my-data-lake", "bronze/events/", hours=6)
print(f"Files landed in last 6 hours: {len(recent)}")

# ── List common prefixes (pseudo-folders) — useful for partition discovery ──
folder_paginator = s3.get_paginator("list_objects_v2")
folder_pages     = folder_paginator.paginate(
    Bucket    = "my-data-lake",
    Prefix    = "bronze/events/",
    Delimiter = "/"   # treat / as folder separator
)
partitions = []
for page in folder_pages:
    for prefix_obj in page.get("CommonPrefixes", []):
        partitions.append(prefix_obj["Prefix"])
print(f"Partition folders: {partitions}")
# e.g. ['bronze/events/year=2024/', 'bronze/events/year=2023/']

Glue — get_tables and get_partitions

Used to list all tables in a Glue database, or all partitions of a table. Essential for metadata-driven pipelines that discover tables dynamically rather than hardcoding table names.

python — Glue paginators

import boto3
glue = boto3.client("glue", region_name="us-east-1")

# ── List all tables in a Glue database ──
def list_glue_tables(database: str) -> list[str]:
    paginator  = glue.get_paginator("get_tables")
    pages      = paginator.paginate(DatabaseName=database)
    table_names = []
    for page in pages:
        for table in page["TableList"]:
            table_names.append(table["Name"])
    return table_names

tables = list_glue_tables("bronze_db")
print(f"Tables in bronze_db: {tables}")

# ── List all partitions of a table ──
def list_glue_partitions(database: str, table: str) -> list[dict]:
    paginator  = glue.get_paginator("get_partitions")
    pages      = paginator.paginate(DatabaseName=database, TableName=table)
    partitions = []
    for page in pages:
        partitions.extend(page["Partitions"])
    return partitions

parts = list_glue_partitions("bronze_db", "events")
print(f"Partition count: {len(parts)}")
for p in parts[:3]:
    print(f"  Values: {p['Values']} | Location: {p['StorageDescriptor']['Location']}")

# ── List all Glue databases ──
db_paginator = glue.get_paginator("get_databases")
all_dbs = list(db_paginator.paginate().search("DatabaseList[].Name"))
print(f"All databases: {all_dbs}")

Athena — get_query_results

Athena returns query results in pages of up to 1000 rows. To get all rows from a large result set you must paginate get_query_results. You also need to parse the ResultSet structure into a usable list of dicts.

python — Athena paginator + result parsing

import boto3, time
athena = boto3.client("athena", region_name="us-east-1")

def run_athena_query(sql: str, database: str, output_s3: str) -> list[dict]:
    """
    Run an Athena query, wait for it, paginate all results, return list of dicts.
    """
    # Step 1 — start query
    start = athena.start_query_execution(
        QueryString           = sql,
        QueryExecutionContext = {"Database": database},
        ResultConfiguration   = {"OutputLocation": output_s3}
    )
    query_id = start["QueryExecutionId"]
    print(f"Started query: {query_id}")

    # Step 2 — poll until done (Athena has no built-in waiter)
    while True:
        status = athena.get_query_execution(QueryExecutionId=query_id)
        state  = status["QueryExecution"]["Status"]["State"]
        if state == "SUCCEEDED":
            break
        elif state in ("FAILED", "CANCELLED"):
            reason = status["QueryExecution"]["Status"].get("StateChangeReason", "")
            raise RuntimeError(f"Athena query {state}: {reason}")
        time.sleep(2)   # poll every 2 seconds

    # Step 3 — paginate results
    paginator    = athena.get_paginator("get_query_results")
    pages        = paginator.paginate(QueryExecutionId=query_id)

    rows         = []
    column_names = None

    for page in pages:
        result_set = page["ResultSet"]
        if column_names is None:
            # First row of first page = header row with column names
            column_names = [
                col["VarCharValue"]
                for col in result_set["Rows"][0]["Data"]
            ]
            data_rows = result_set["Rows"][1:]   # skip header
        else:
            data_rows = result_set["Rows"]

        for row in data_rows:
            values = [cell.get("VarCharValue", "") for cell in row["Data"]]
            rows.append(dict(zip(column_names, values)))

    print(f"Query returned {len(rows)} rows")
    return rows


# ── Usage ──
results = run_athena_query(
    sql       = "SELECT event_type, COUNT(*) as cnt FROM events GROUP BY 1 ORDER BY 2 DESC",
    database  = "bronze_db",
    output_s3 = "s3://my-athena-results/queries/"
)
for row in results[:5]:
    print(row)  # {'event_type': 'click', 'cnt': '1234567'}

EMR — list_steps and list_clusters

Used to audit running clusters and steps programmatically — useful for monitoring dashboards, cost tracking, and finding stuck or zombie jobs.

python — EMR paginators

import boto3
emr = boto3.client("emr", region_name="us-east-1")

# ── List all RUNNING clusters ──
def list_running_clusters() -> list[dict]:
    paginator = emr.get_paginator("list_clusters")
    pages     = paginator.paginate(ClusterStates=["RUNNING", "WAITING"])
    clusters  = []
    for page in pages:
        clusters.extend(page["Clusters"])
    return clusters

running = list_running_clusters()
for c in running:
    print(f"Cluster: {c['Name']} | ID: {c['Id']} | State: {c['Status']['State']}")

# ── List all steps for a cluster ──
def list_cluster_steps(cluster_id: str) -> list[dict]:
    paginator = emr.get_paginator("list_steps")
    pages     = paginator.paginate(ClusterId=cluster_id)
    steps     = []
    for page in pages:
        steps.extend(page["Steps"])
    return steps

steps = list_cluster_steps("j-XXXXXXXXXXXXX")
for s in steps:
    print(f"  Step: {s['Name']} | State: {s['Status']['State']}")

DynamoDB — scan and query

DynamoDB scan and query return up to 1MB of data per call. For large tables or audit tables with many records, you must paginate. Use query over scan whenever possible — scan reads the entire table, query uses the index.

python — DynamoDB paginators

import boto3
from boto3.dynamodb.conditions import Key, Attr

dynamo = boto3.client("dynamodb", region_name="us-east-1")

# ── Paginated scan — read all items in audit table ──
def scan_all_items(table_name: str) -> list[dict]:
    paginator = dynamo.get_paginator("scan")
    pages     = paginator.paginate(TableName=table_name)
    items     = []
    for page in pages:
        items.extend(page["Items"])
    return items

all_runs = scan_all_items("pipeline_audit")
print(f"Total audit records: {len(all_runs)}")

# ── Paginated query — get all runs for a specific pipeline ──
def query_pipeline_runs(table_name: str, pipeline_name: str) -> list[dict]:
    paginator = dynamo.get_paginator("query")
    pages     = paginator.paginate(
        TableName                 = table_name,
        KeyConditionExpression    = "pipeline_name = :pn",
        ExpressionAttributeValues = {":pn": {"S": pipeline_name}},
        ScanIndexForward          = False   # newest first
    )
    items = []
    for page in pages:
        items.extend(page["Items"])
    return items

runs = query_pipeline_runs("pipeline_audit", "s3_to_delta_bronze")
for run in runs[:5]:
    print(f"  run_id={run['run_id']['S']} | status={run['status']['S']}")

⚠️ Avoid scan() on Large Tables in Production

scan reads every item in the table — it costs RCUs proportional to table size. For production audit tables with millions of records, always design a GSI (Global Secondary Index) and use query instead.

📋

Paginator Quick Reference

Cheatsheet ▼

All Common Paginators for Data Engineers

Service	Operation	Key Result Field	Token Field
S3	`list_objects_v2`	`Contents`	`ContinuationToken`
S3	`list_buckets`	`Buckets`	N/A (no pagination)
Glue	`get_tables`	`TableList`	`NextToken`
Glue	`get_partitions`	`Partitions`	`NextToken`
Glue	`get_databases`	`DatabaseList`	`NextToken`
Glue	`get_job_runs`	`JobRuns`	`NextToken`
Athena	`get_query_results`	`ResultSet.Rows`	`NextToken`
Athena	`list_query_executions`	`QueryExecutionIds`	`NextToken`
EMR	`list_clusters`	`Clusters`	`Marker`
EMR	`list_steps`	`Steps`	`Marker`
DynamoDB	`scan`	`Items`	`LastEvaluatedKey`
DynamoDB	`query`	`Items`	`LastEvaluatedKey`
SNS	`list_topics`	`Topics`	`NextToken`
Lambda	`list_functions`	`Functions`	`NextMarker`
IAM	`list_roles`	`Roles`	`Marker`

📌 Golden Rule

Every time you call a "list" or "get_*s" boto3 API, ask yourself: "Could there be more than one page of results?" If yes — use a paginator. In data engineering, the answer is almost always yes.

29.30.4

Waiters

Waiters are boto3's built-in polling mechanism — they repeatedly call an API until a resource reaches a desired state. Instead of writing your own while-loop with sleep(), you use a waiter and boto3 handles the polling, delay, and timeout for you.

⏳

Built-in Waiters

Foundation ▼

What Is a Waiter and How It Works

A waiter calls a specific AWS API on a fixed interval (e.g. every 15 seconds) and checks a field in the response. When that field matches the desired state (e.g. State = "running"), the waiter returns. If the field matches a failure state, it raises WaiterError. If max attempts are exhausted, it also raises WaiterError.

🧠 Analogy

A waiter is like a flight tracker app that refreshes every 30 seconds and beeps when your flight status changes to "Landed." You don't have to keep hitting refresh manually — it polls for you and alerts you when the condition is met.

python — basic waiter pattern

import boto3
from botocore.exceptions import WaiterError

s3 = boto3.client("s3", region_name="us-east-1")

# ── Get a waiter by name ──
waiter = s3.get_waiter("bucket_exists")

# ── Wait until the S3 bucket exists ──
# Polls s3.head_bucket() every 5s, up to 20 attempts (100s total)
try:
    waiter.wait(Bucket="my-new-data-lake-bucket")
    print("✅ Bucket is now available")
except WaiterError as e:
    print(f"❌ Waiter failed: {e}")
    # Either bucket never appeared, or an error occurred during polling

📌 How to Discover Available Waiters

client.waiter_names lists all built-in waiters for that service client.
Example: print(s3.waiter_names) → ['bucket_exists', 'bucket_not_exists', 'object_exists', 'object_not_exists']

S3 Waiters — bucket_exists, object_exists

S3 waiters are used to confirm that a bucket or object exists before proceeding — essential in pipelines where an upstream step creates a file and a downstream step needs to read it.

python — S3 waiters

import boto3
from botocore.exceptions import WaiterError
from botocore.waiter import WaiterConfig

s3 = boto3.client("s3", region_name="us-east-1")

# ── Wait for a bucket to exist ──
bucket_waiter = s3.get_waiter("bucket_exists")
bucket_waiter.wait(
    Bucket = "my-data-lake-bucket",
    WaiterConfig = WaiterConfig(delay=5, max_attempts=12)  # 5s × 12 = 60s max
)
print("Bucket exists — proceeding")

# ── Wait for a specific object (file) to appear ──
object_waiter = s3.get_waiter("object_exists")
try:
    object_waiter.wait(
        Bucket = "my-data-lake-bucket",
        Key    = "bronze/events/2024/01/01/events.parquet",
        WaiterConfig = WaiterConfig(delay=10, max_attempts=30)  # 10s × 30 = 5min max
    )
    print("✅ File has landed — safe to read")
except WaiterError:
    raise RuntimeError("File did not appear within 5 minutes — upstream job may have failed")

# ── Wait for object to be DELETED (useful for cleanup verification) ──
delete_waiter = s3.get_waiter("object_not_exists")
delete_waiter.wait(
    Bucket = "my-data-lake-bucket",
    Key    = "temp/staging/scratch_file.csv"
)
print("Temp file is gone")

EMR Waiters — cluster_running, step_complete

EMR has built-in waiters for cluster and step states. These are the most commonly used waiters in data pipelines that run Spark jobs on EMR — you submit a step and then wait for it to complete before writing the audit record or triggering the next step.

python — EMR waiters

import boto3
from botocore.exceptions import WaiterError
from botocore.waiter import WaiterConfig

emr = boto3.client("emr", region_name="us-east-1")

# ── Step 1: Start a cluster ──
cluster = emr.run_job_flow(
    Name         = "bronze-ingestion-cluster",
    ReleaseLabel = "emr-6.15.0",
    Instances    = {
        "MasterInstanceType": "m5.xlarge",
        "SlaveInstanceType" : "m5.xlarge",
        "InstanceCount"     : 3,
        "KeepJobFlowAliveWhenNoSteps": True
    },
    Applications  = [{"Name": "Spark"}],
    JobFlowRole   = "EMR_EC2_DefaultRole",
    ServiceRole   = "EMR_DefaultRole",
    AutoTerminate = False
)
cluster_id = cluster["JobFlowId"]
print(f"Cluster launched: {cluster_id}")

# ── Step 2: Wait for cluster to be RUNNING ──
cluster_waiter = emr.get_waiter("cluster_running")
try:
    cluster_waiter.wait(
        ClusterId    = cluster_id,
        WaiterConfig = WaiterConfig(delay=30, max_attempts=40)  # 30s × 40 = 20min max
    )
    print("✅ Cluster is running")
except WaiterError as e:
    raise RuntimeError(f"Cluster failed to start: {e}")

# ── Step 3: Submit a Spark step ──
step_response = emr.add_job_flow_steps(
    JobFlowId = cluster_id,
    Steps     = [{
        "Name"            : "run-bronze-etl",
        "ActionOnFailure" : "CONTINUE",
        "HadoopJarStep"   : {
            "Jar" : "command-runner.jar",
            "Args": [
                "spark-submit", "--deploy-mode", "cluster",
                "s3://my-scripts/bronze_etl.py",
                "--date", "2024-01-01"
            ]
        }
    }]
)
step_id = step_response["StepIds"][0]
print(f"Step submitted: {step_id}")

# ── Step 4: Wait for step to COMPLETE ──
step_waiter = emr.get_waiter("step_complete")
try:
    step_waiter.wait(
        ClusterId    = cluster_id,
        StepId       = step_id,
        WaiterConfig = WaiterConfig(delay=30, max_attempts=120)  # 30s × 120 = 60min max
    )
    print("✅ Spark step completed successfully")
except WaiterError as e:
    # Check if it FAILED or just timed out
    step_info  = emr.describe_step(ClusterId=cluster_id, StepId=step_id)
    step_state = step_info["Step"]["Status"]["State"]
    raise RuntimeError(f"Step ended in state: {step_state} — Error: {e}")

# ── Step 5: Terminate cluster ──
emr.terminate_job_flows(JobFlowIds=[cluster_id])
print("Cluster termination initiated")

Lambda Waiters — function_active, function_updated

After deploying a new Lambda function or updating its code, it takes a few seconds to become active. These waiters ensure the function is ready before you invoke it — important in CI/CD pipelines that deploy and then immediately test the function.

python — Lambda waiters

import boto3, zipfile, io
from botocore.waiter import WaiterConfig

lam = boto3.client("lambda", region_name="us-east-1")

# ── Deploy new function code ──
lam.update_function_code(
    FunctionName = "my-pipeline-trigger",
    S3Bucket     = "my-deploy-bucket",
    S3Key        = "lambdas/pipeline_trigger_v2.zip"
)
print("Code update submitted")

# ── Wait for update to complete before invoking ──
update_waiter = lam.get_waiter("function_updated")
update_waiter.wait(
    FunctionName = "my-pipeline-trigger",
    WaiterConfig = WaiterConfig(delay=5, max_attempts=20)   # 5s × 20 = 100s max
)
print("✅ Lambda function updated and ready")

# ── Now safe to invoke ──
response = lam.invoke(
    FunctionName   = "my-pipeline-trigger",
    InvocationType = "RequestResponse",
    Payload        = b'{"env": "prod", "date": "2024-01-01"}'
)
print(f"Lambda response status: {response['StatusCode']}")

⚙️

Waiter Config — delay and max_attempts

Configuration ▼

Customising Waiter Behaviour

Every waiter has a default delay (seconds between polls) and max_attempts (how many times to poll before giving up). The defaults are often too short for slow operations like EMR cluster startup. Always override them with WaiterConfig to match your operation's expected duration.

python — WaiterConfig customisation

import boto3
from botocore.waiter import WaiterConfig
from botocore.exceptions import WaiterError

emr = boto3.client("emr", region_name="us-east-1")

# ── Default waiter config (often too short for EMR) ──
# cluster_running default: delay=30s, max_attempts=60 → 30 minutes max
# step_complete   default: delay=30s, max_attempts=60 → 30 minutes max

# ── Override for a long-running Spark job ──
long_job_config = WaiterConfig(
    delay        = 60,    # poll every 60 seconds
    max_attempts = 120    # max 120 polls = 120 minutes (2 hours) max wait
)

step_waiter = emr.get_waiter("step_complete")
try:
    step_waiter.wait(
        ClusterId    = "j-XXXXXXXXXXXXX",
        StepId       = "s-XXXXXXXXXXXXX",
        WaiterConfig = long_job_config
    )
    print("✅ Step complete within 2 hours")
except WaiterError:
    print("❌ Step did not complete within 2 hours — investigate")

# ── Fast config for quick operations like S3 object check ──
fast_config = WaiterConfig(delay=3, max_attempts=20)  # 3s × 20 = 60s max

s3 = boto3.client("s3", region_name="us-east-1")
s3.get_waiter("object_exists").wait(
    Bucket       = "my-bucket",
    Key          = "trigger/ready.flag",
    WaiterConfig = fast_config
)

Waiter	Default delay	Default max_attempts	Recommended override
s3 bucket_exists	5s	20	delay=5, max=12 (60s)
s3 object_exists	5s	20	delay=10, max=30 (5min)
emr cluster_running	30s	60	delay=30, max=40 (20min)
emr step_complete	30s	60	delay=60, max=120 (2hr)
lambda function_active	5s	60	delay=5, max=20 (100s)
lambda function_updated	5s	60	delay=5, max=20 (100s)

🔧

Services Without Built-in Waiters — Manual Polling

Important ▼

Glue and Athena Have No Built-in Waiters

Two of the most commonly used services in data engineering — Glue and Athena — do not have built-in boto3 waiters. You must write your own polling loop. The pattern is always the same: call the status API, check the state field, sleep, repeat until terminal state.

⚠️ Common Mistake

Many engineers call glue.start_job_run() and assume the job is done when the call returns. It isn't — start_job_run is asynchronous. You must poll get_job_run() until the state is SUCCEEDED or FAILED.

python — manual Glue job waiter

import boto3, time
from botocore.exceptions import ClientError

glue = boto3.client("glue", region_name="us-east-1")

# Terminal states — stop polling when we hit one of these
GLUE_TERMINAL_STATES = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}

def wait_for_glue_job(
    job_name    : str,
    run_id      : str,
    poll_interval: int = 30,    # seconds between polls
    max_wait    : int  = 7200   # maximum total wait time in seconds (2 hours)
) -> str:
    """
    Poll Glue job until terminal state. Returns final JobRunState string.
    Raises RuntimeError if job fails or times out.
    """
    elapsed = 0
    attempt = 0

    while elapsed < max_wait:
        attempt += 1
        try:
            response = glue.get_job_run(JobName=job_name, RunId=run_id)
            state    = response["JobRun"]["JobRunState"]
            duration = response["JobRun"].get("ExecutionTime", 0)

            print(f"  [{attempt}] Glue job '{job_name}' state: {state} "
                  f"(elapsed: {elapsed}s, execution: {duration}s)")

            if state in GLUE_TERMINAL_STATES:
                if state == "SUCCEEDED":
                    print(f"✅ Glue job completed successfully in {duration}s")
                    return state
                else:
                    error_msg = response["JobRun"].get("ErrorMessage", "No error message")
                    raise RuntimeError(
                        f"❌ Glue job '{job_name}' ended with state '{state}': {error_msg}"
                    )

        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code in ("ThrottlingException", "ServiceUnavailableException"):
                print(f"  Throttled during polling — will retry after sleep")
            else:
                raise

        time.sleep(poll_interval)
        elapsed += poll_interval

    raise TimeoutError(
        f"Glue job '{job_name}' did not complete within {max_wait}s "
        f"(last known state: polling timed out)"
    )


# ── Full usage pattern ──
job_name = "bronze-events-etl"

# Start the job
run_response = glue.start_job_run(
    JobName   = job_name,
    Arguments = {
        "--date"       : "2024-01-01",
        "--env"        : "prod",
        "--output_path": "s3://my-data-lake/bronze/events/"
    }
)
run_id = run_response["JobRunId"]
print(f"Started Glue job run: {run_id}")

# Wait for it to finish
final_state = wait_for_glue_job(job_name, run_id, poll_interval=30, max_wait=3600)
print(f"Final state: {final_state}")

Manual Athena Query Waiter

Athena query execution is also asynchronous — start_query_execution returns immediately with a query ID. You must poll get_query_execution to know when it finishes. Here is the production-grade pattern with proper error handling and backoff.

python — manual Athena waiter

import boto3, time
from botocore.exceptions import ClientError

athena = boto3.client("athena", region_name="us-east-1")

ATHENA_TERMINAL_STATES = {"SUCCEEDED", "FAILED", "CANCELLED"}

def wait_for_athena_query(
    query_execution_id: str,
    poll_interval     : int = 2,     # start with 2s polls
    max_wait          : int = 1800   # 30 minutes max
) -> dict:
    """
    Poll Athena query until terminal state.
    Returns the full QueryExecution dict on success.
    Raises RuntimeError on FAILED/CANCELLED or timeout.
    """
    elapsed  = 0
    attempt  = 0
    interval = poll_interval   # will increase with backoff

    while elapsed < max_wait:
        attempt += 1
        try:
            response   = athena.get_query_execution(QueryExecutionId=query_execution_id)
            execution  = response["QueryExecution"]
            state      = execution["Status"]["State"]

            print(f"  [{attempt}] Athena query state: {state} (elapsed: {elapsed:.0f}s)")

            if state in ATHENA_TERMINAL_STATES:
                if state == "SUCCEEDED":
                    stats = execution.get("Statistics", {})
                    print(f"✅ Query succeeded | "
                          f"Scanned: {stats.get('DataScannedInBytes',0)/1e6:.1f} MB | "
                          f"Runtime: {stats.get('TotalExecutionTimeInMillis',0)/1000:.1f}s")
                    return execution
                else:
                    reason = execution["Status"].get("StateChangeReason", "No reason given")
                    raise RuntimeError(
                        f"❌ Athena query {state}: {reason} "
                        f"(QueryExecutionId: {query_execution_id})"
                    )

        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code == "ThrottlingException":
                interval = min(interval * 2, 30)   # back off on throttle, cap at 30s
                print(f"  Throttled — backing off to {interval}s")
            else:
                raise

        time.sleep(interval)
        elapsed += interval
        # Gradually increase poll interval to reduce API calls for long queries
        if elapsed > 30 and interval < 10:
            interval = 10
        elif elapsed > 120 and interval < 20:
            interval = 20

    raise TimeoutError(
        f"Athena query {query_execution_id} did not complete within {max_wait}s"
    )


# ── Full Athena workflow ──
start = athena.start_query_execution(
    QueryString           = """
        SELECT  date_trunc('hour', event_time) AS hour,
                event_type,
                COUNT(*)                       AS cnt
        FROM    bronze_db.events
        WHERE   year = '2024' AND month = '01'
        GROUP BY 1, 2
        ORDER BY 1, 3 DESC
    """,
    QueryExecutionContext = {"Database": "bronze_db"},
    ResultConfiguration   = {"OutputLocation": "s3://my-athena-results/queries/"}
)
query_id = start["QueryExecutionId"]
print(f"Athena query started: {query_id}")

execution = wait_for_athena_query(query_id, poll_interval=2, max_wait=900)
print(f"Query complete — now fetching results...")

🏗️

Custom Waiter with WaiterModel

Advanced ▼

Building a Custom Waiter for Any Service

boto3 allows you to define a custom waiter using WaiterModel — a JSON-like config that tells boto3 which API to call, which field to inspect, and which values are success vs failure. This is the proper way to build reusable waiters for services like Glue that don't have built-in ones.

python — custom WaiterModel for Glue

import boto3
from botocore.waiter import WaiterModel, create_waiter_with_client
from botocore.exceptions import WaiterError

glue = boto3.client("glue", region_name="us-east-1")

# ── Define the custom waiter model ──
# This tells boto3: call GetJobRun, look at JobRun.JobRunState,
# succeed on SUCCEEDED, fail on FAILED/ERROR/TIMEOUT/STOPPED
glue_job_waiter_model = WaiterModel({
    "version" : 2,
    "waiters" : {
        "JobRunComplete": {
            "delay"      : 30,           # poll every 30 seconds
            "maxAttempts": 120,          # up to 120 attempts = 60 minutes
            "operation"  : "GetJobRun",  # which boto3 API to call
            "acceptors"  : [
                {
                    "matcher"  : "path",
                    "expected" : "SUCCEEDED",
                    "argument" : "JobRun.JobRunState",
                    "state"    : "success"    # waiter returns when this matches
                },
                {
                    "matcher"  : "path",
                    "expected" : "FAILED",
                    "argument" : "JobRun.JobRunState",
                    "state"    : "failure"    # waiter raises WaiterError when this matches
                },
                {
                    "matcher"  : "path",
                    "expected" : "ERROR",
                    "argument" : "JobRun.JobRunState",
                    "state"    : "failure"
                },
                {
                    "matcher"  : "path",
                    "expected" : "TIMEOUT",
                    "argument" : "JobRun.JobRunState",
                    "state"    : "failure"
                },
                {
                    "matcher"  : "path",
                    "expected" : "STOPPED",
                    "argument" : "JobRun.JobRunState",
                    "state"    : "failure"
                }
            ]
        }
    }
})

# ── Create the waiter from the model ──
glue_job_waiter = create_waiter_with_client(
    waiter_name   = "JobRunComplete",
    waiter_model  = glue_job_waiter_model,
    client        = glue
)

# ── Use it exactly like a built-in waiter ──
job_name = "bronze-events-etl"
run      = glue.start_job_run(JobName=job_name, Arguments={"--date": "2024-01-01"})
run_id   = run["JobRunId"]
print(f"Started Glue job: {run_id}")

try:
    glue_job_waiter.wait(JobName=job_name, RunId=run_id)
    print("✅ Glue job succeeded")
except WaiterError as e:
    raise RuntimeError(f"❌ Glue job failed or timed out: {e}")

✅ When to Use Custom WaiterModel vs Manual Loop

Use WaiterModel when you want a reusable, shareable waiter that behaves like a built-in — good for team libraries and frameworks.
Use a manual polling loop when you need dynamic logic — e.g. adaptive poll intervals, logging intermediate states, or writing progress to a DynamoDB audit table during the wait.

📋

Waiters Quick Reference

Cheatsheet ▼

All Waiters Relevant to Data Engineers

Service	Waiter Name	Polls API	Waits For
S3	`bucket_exists`	head_bucket	Bucket to appear
S3	`bucket_not_exists`	head_bucket	Bucket to be deleted
S3	`object_exists`	head_object	Object to appear
S3	`object_not_exists`	head_object	Object to be deleted
EMR	`cluster_running`	describe_cluster	Cluster in RUNNING state
EMR	`cluster_terminated`	describe_cluster	Cluster terminated
EMR	`step_complete`	describe_step	Step COMPLETED
Lambda	`function_active`	get_function	Function Active state
Lambda	`function_updated`	get_function_configuration	Update propagated
Glue	❌ None built-in	get_job_run (manual)	SUCCEEDED / FAILED
Athena	❌ None built-in	get_query_execution (manual)	SUCCEEDED / FAILED
DynamoDB	`table_exists`	describe_table	Table ACTIVE
DynamoDB	`table_not_exists`	describe_table	Table deleted

📌 The Rule

Built-in waiter available → always use it.
No built-in waiter (Glue, Athena) → write a manual polling loop with proper terminal state checks and error handling, or build a WaiterModel for team reuse.

29.30.5

S3 APIs

S3 is the backbone of every data lake. As a data engineer you will use S3 APIs daily — listing files, uploading data, downloading for processing, copying between locations, generating presigned URLs, and managing large files with multipart uploads. This section covers every S3 API you need in production.

🪣

Bucket Operations

Foundation ▼

create_bucket, list_buckets, delete_bucket, head_bucket

Bucket-level operations are used in infrastructure setup, CI/CD pipelines, and environment provisioning. head_bucket is the right way to check if a bucket exists — it's lightweight and doesn't list contents.

python — bucket operations

import boto3
from botocore.exceptions import ClientError

s3     = boto3.client("s3", region_name="us-east-1")
REGION = "us-east-1"

# ── Create a bucket ──
# Note: us-east-1 does NOT take CreateBucketConfiguration — all other regions do
def create_bucket(bucket_name: str, region: str = "us-east-1"):
    if region == "us-east-1":
        s3.create_bucket(Bucket=bucket_name)
    else:
        s3.create_bucket(
            Bucket                    = bucket_name,
            CreateBucketConfiguration = {"LocationConstraint": region}
        )
    print(f"✅ Created bucket: {bucket_name}")

create_bucket("my-data-lake-bronze")

# ── List all buckets in the account ──
response = s3.list_buckets()
for bucket in response["Buckets"]:
    print(f"  {bucket['Name']}  (created: {bucket['CreationDate'].date()})")

# ── Check if bucket exists (lightweight — does NOT list contents) ──
def bucket_exists(bucket_name: str) -> bool:
    try:
        s3.head_bucket(Bucket=bucket_name)
        return True
    except ClientError as e:
        code = e.response["Error"]["Code"]
        if code in ("404", "NoSuchBucket"):
            return False
        raise   # re-raise unexpected errors (e.g. access denied)

print(bucket_exists("my-data-lake-bronze"))   # True
print(bucket_exists("nonexistent-bucket"))    # False

# ── Delete a bucket (must be empty first) ──
s3.delete_bucket(Bucket="my-temp-bucket")
print("Bucket deleted")

Bucket Policy and Lifecycle Configuration

Bucket policies control access at the bucket level. Lifecycle rules automate storage class transitions and object expiry — critical for cost control in data lakes where raw/bronze data accumulates over time.

python — bucket policy and lifecycle

import boto3, json

s3 = boto3.client("s3", region_name="us-east-1")

# ── Set a bucket policy ──
# This policy allows a specific IAM role to read all objects
bucket_policy = {
    "Version"  : "2012-10-17",
    "Statement": [{
        "Sid"      : "AllowGlueRead",
        "Effect"   : "Allow",
        "Principal": {"AWS": "arn:aws:iam::123456789012:role/GlueETLRole"},
        "Action"   : ["s3:GetObject", "s3:ListBucket"],
        "Resource" : [
            "arn:aws:s3:::my-data-lake-bronze",
            "arn:aws:s3:::my-data-lake-bronze/*"
        ]
    }]
}
s3.put_bucket_policy(
    Bucket = "my-data-lake-bronze",
    Policy = json.dumps(bucket_policy)
)
print("Bucket policy applied")

# ── Read current policy ──
policy = s3.get_bucket_policy(Bucket="my-data-lake-bronze")
print(json.loads(policy["Policy"]))

# ── Set lifecycle rules — transition old data to cheaper storage ──
lifecycle_config = {
    "Rules": [
        {
            "ID"     : "bronze-archive-policy",
            "Status" : "Enabled",
            "Filter" : {"Prefix": "bronze/"},   # apply to bronze/ prefix only
            "Transitions": [
                {"Days": 30,  "StorageClass": "STANDARD_IA"},    # after 30 days → IA
                {"Days": 90,  "StorageClass": "GLACIER_IR"},     # after 90 days → Glacier IR
                {"Days": 365, "StorageClass": "DEEP_ARCHIVE"},   # after 1 year → Deep Archive
            ],
            "Expiration": {"Days": 2555}   # delete after 7 years
        },
        {
            "ID"    : "delete-temp-files",
            "Status": "Enabled",
            "Filter": {"Prefix": "temp/"},
            "Expiration": {"Days": 3}      # temp files auto-deleted after 3 days
        }
    ]
}
s3.put_bucket_lifecycle_configuration(
    Bucket                  = "my-data-lake-bronze",
    LifecycleConfiguration  = lifecycle_config
)
print("Lifecycle rules applied")

# ── Enable versioning ──
s3.put_bucket_versioning(
    Bucket                  = "my-data-lake-bronze",
    VersioningConfiguration = {"Status": "Enabled"}
)
print("Versioning enabled")

📌 Lifecycle Best Practice for Data Lakes

Bronze (raw) data: transition to Standard-IA after 30 days, Glacier after 90 days.
Temp/staging data: expire after 3–7 days automatically.
Gold (curated) data: keep in Standard, no transition — it's queried frequently.

📦

Object Operations

Core ▼

upload_file vs put_object — When to Use Which

Both upload data to S3 but they work differently. upload_file is a high-level method from the S3 Transfer Manager — it automatically handles multipart uploads for large files, retries, and concurrency. put_object is a low-level single HTTP PUT — use it only for small objects or when you need full control over metadata and content type.

🧠 Analogy

upload_file is like a courier service — it figures out the best way to ship your package (breaks it into pieces if it's too big, retries if the truck breaks down). put_object is like mailing a letter yourself — simple, direct, but limited to small items and no retry logic.

python — upload_file vs put_object

import boto3, json
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

# ── upload_file — USE THIS for most cases (especially large files) ──
# Automatically uses multipart upload for files > 8MB
# Has built-in retry logic and progress callback support
s3.upload_file(
    Filename    = "/local/path/events_2024_01_01.parquet",  # local file path
    Bucket      = "my-data-lake",
    Key         = "bronze/events/year=2024/month=01/day=01/events.parquet",
    ExtraArgs   = {
        "ContentType"        : "application/octet-stream",
        "ServerSideEncryption": "AES256",    # SSE-S3 encryption
        "Metadata"           : {
            "pipeline"  : "bronze-ingestion",
            "source"    : "kafka-events",
            "row_count" : "1000000"
        }
    }
)
print("✅ Large file uploaded with multipart automatically")

# ── put_object — USE FOR small objects, in-memory data, config files ──
# Single HTTP PUT — no multipart, no automatic retry
config_data = {"pipeline": "bronze-etl", "version": "2.1.0", "active": True}
s3.put_object(
    Bucket      = "my-data-lake",
    Key         = "config/pipeline_config.json",
    Body        = json.dumps(config_data).encode("utf-8"),
    ContentType = "application/json"
)
print("Config file written to S3")

# ── put_object for writing a string directly (e.g. SQL, manifest) ──
sql_query = "SELECT * FROM events WHERE year = '2024'"
s3.put_object(
    Bucket = "my-data-lake",
    Key    = "queries/daily_events.sql",
    Body   = sql_query.encode("utf-8"),
    ContentType = "text/plain"
)

# ── put_object for writing bytes from memory ──
import io, pandas as pd
df    = pd.DataFrame({"id": [1,2,3], "value": ["a","b","c"]})
buf   = io.BytesIO()
df.to_parquet(buf, index=False)
buf.seek(0)
s3.put_object(
    Bucket      = "my-data-lake",
    Key         = "temp/small_df.parquet",
    Body        = buf.read(),
    ContentType = "application/octet-stream"
)

Method	Source	Multipart	Retry	Best For
`upload_file()`	Local file path	✅ Auto	✅ Built-in	Large files (>8MB)
`upload_fileobj()`	File-like object	✅ Auto	✅ Built-in	Streaming / in-memory
`put_object()`	Bytes / string	❌ None	❌ None	Small objects <5MB

download_file vs get_object

Same pattern on the download side. download_file saves directly to disk with multipart download and retry. get_object returns a streaming response body — use it when you want to read content into memory without saving to disk first.

python — download_file vs get_object

import boto3, json, io
import pandas as pd

s3 = boto3.client("s3", region_name="us-east-1")

# ── download_file — saves directly to disk ──
s3.download_file(
    Bucket   = "my-data-lake",
    Key      = "bronze/events/year=2024/month=01/day=01/events.parquet",
    Filename = "/tmp/events.parquet"
)
df = pd.read_parquet("/tmp/events.parquet")
print(f"Downloaded and loaded: {len(df):,} rows")

# ── get_object — read into memory (no disk write) ──
response = s3.get_object(
    Bucket = "my-data-lake",
    Key    = "config/pipeline_config.json"
)
# response["Body"] is a StreamingBody — must read() it
config = json.loads(response["Body"].read().decode("utf-8"))
print(f"Config loaded: {config}")

# ── get_object for Parquet directly into pandas ──
response = s3.get_object(
    Bucket = "my-data-lake",
    Key    = "bronze/events/year=2024/month=01/day=01/events.parquet"
)
buf = io.BytesIO(response["Body"].read())
df  = pd.read_parquet(buf)
print(f"Loaded parquet from S3 into memory: {df.shape}")

# ── download_fileobj — streaming download into file-like object ──
buf = io.BytesIO()
s3.download_fileobj(
    Bucket   = "my-data-lake",
    Key      = "bronze/events/year=2024/month=01/day=01/events.parquet",
    Fileobj  = buf
)
buf.seek(0)
df = pd.read_parquet(buf)
print(f"Streaming download complete: {df.shape}")

head_object — Check Existence and Get Metadata

head_object fetches an object's metadata without downloading its content. It is the correct and efficient way to check if a file exists, get its size, last modified time, and custom metadata — all in a single lightweight API call.

python — head_object

import boto3
from botocore.exceptions import ClientError

s3 = boto3.client("s3", region_name="us-east-1")

def get_object_info(bucket: str, key: str) -> dict | None:
    """
    Return object metadata if it exists, None if it doesn't.
    Never downloads the object body.
    """
    try:
        response = s3.head_object(Bucket=bucket, Key=key)
        return {
            "size"         : response["ContentLength"],        # bytes
            "last_modified": response["LastModified"],         # datetime
            "content_type" : response["ContentType"],
            "etag"         : response["ETag"].strip('"'),      # MD5 hash
            "metadata"     : response.get("Metadata", {}),    # custom metadata
            "storage_class": response.get("StorageClass", "STANDARD")
        }
    except ClientError as e:
        if e.response["Error"]["Code"] in ("404", "NoSuchKey"):
            return None
        raise

# ── Check if today's file has already landed ──
info = get_object_info(
    "my-data-lake",
    "bronze/events/year=2024/month=01/day=01/events.parquet"
)
if info:
    print(f"File exists — size: {info['size']/1e6:.1f} MB, "
          f"modified: {info['last_modified']}")
    print(f"Row count from metadata: {info['metadata'].get('row_count', 'unknown')}")
else:
    print("File not found — upstream job may not have run yet")

delete_object, delete_objects (batch), copy_object

Deleting and copying are common in pipeline cleanup, archiving, and promotion workflows. delete_objects can delete up to 1000 objects in a single API call — always use the batch version when cleaning up many files to avoid thousands of individual API calls.

python — delete and copy operations

import boto3

s3 = boto3.client("s3", region_name="us-east-1")

# ── Delete single object ──
s3.delete_object(Bucket="my-data-lake", Key="temp/staging/scratch.csv")
print("Single file deleted")

# ── Batch delete up to 1000 objects per call ──
# First list the objects to delete
paginator = s3.get_paginator("list_objects_v2")
pages     = paginator.paginate(Bucket="my-data-lake", Prefix="temp/")
keys_to_delete = [
    {"Key": obj["Key"]}
    for page in pages
    for obj in page.get("Contents", [])
]

# Delete in batches of 1000 (S3 limit per call)
for i in range(0, len(keys_to_delete), 1000):
    batch    = keys_to_delete[i:i+1000]
    response = s3.delete_objects(
        Bucket = "my-data-lake",
        Delete = {
            "Objects": batch,
            "Quiet"  : True    # suppress per-object success responses
        }
    )
    errors = response.get("Errors", [])
    if errors:
        for err in errors:
            print(f"  Failed to delete {err['Key']}: {err['Message']}")
    else:
        print(f"  Deleted batch of {len(batch)} objects")

print(f"Total deleted: {len(keys_to_delete)} objects")

# ── Copy object — promote from bronze to silver ──
# Copy does NOT download the file to your machine — it's a server-side copy
s3.copy_object(
    CopySource = {
        "Bucket": "my-data-lake",
        "Key"   : "bronze/events/year=2024/month=01/day=01/events.parquet"
    },
    Bucket     = "my-data-lake",
    Key        = "silver/events/year=2024/month=01/day=01/events.parquet",
    MetadataDirective = "COPY"   # COPY keeps original metadata; REPLACE overwrites it
)
print("Server-side copy complete — no data downloaded")

📌 copy_object Is Server-Side

copy_object never moves data through your machine — AWS copies it entirely within S3. For large files this is far faster and cheaper than download + re-upload. Use it for promotions (bronze → silver), archiving, and cross-bucket copies within the same region.

generate_presigned_url — Time-Limited Access

A presigned URL grants temporary access to a private S3 object without requiring the recipient to have AWS credentials. The URL is valid for a specified number of seconds. Used for sharing pipeline outputs with external teams, giving applications download links, and enabling secure file uploads from browsers.

python — presigned URLs

import boto3

s3 = boto3.client("s3", region_name="us-east-1")

# ── Presigned GET URL — share a file for download ──
download_url = s3.generate_presigned_url(
    ClientMethod = "get_object",
    Params       = {
        "Bucket": "my-data-lake",
        "Key"   : "gold/reports/daily_summary_2024_01_01.csv"
    },
    ExpiresIn    = 3600   # valid for 1 hour (3600 seconds)
)
print(f"Share this URL (expires in 1 hour):\n{download_url}")
# Anyone with this URL can download the file without AWS credentials

# ── Presigned PUT URL — allow external system to upload directly to S3 ──
upload_url = s3.generate_presigned_url(
    ClientMethod = "put_object",
    Params       = {
        "Bucket"     : "my-data-lake",
        "Key"        : "landing/external_feed/partner_data.csv",
        "ContentType": "text/csv"
    },
    ExpiresIn = 900   # valid for 15 minutes
)
print(f"Upload URL for partner:\n{upload_url}")
# Partner can HTTP PUT their file to this URL without any AWS SDK

🚀

Advanced S3 — Multipart Upload and Transfer Manager

Advanced ▼

Multipart Upload — Manual API

For files larger than 5GB (required) or 100MB (recommended), S3 requires multipart upload. You split the file into parts, upload each part separately, then tell S3 to assemble them. If a part fails, you only retry that part. upload_file does this automatically — but understanding the manual API is important for custom streaming scenarios.

python — manual multipart upload

import boto3, os

s3          = boto3.client("s3", region_name="us-east-1")
BUCKET      = "my-data-lake"
KEY         = "bronze/large_dataset/events_full_year.parquet"
LOCAL_FILE  = "/data/events_full_year.parquet"
PART_SIZE   = 100 * 1024 * 1024   # 100 MB per part (minimum 5MB except last part)

# ── Step 1: Initiate multipart upload ──
mpu      = s3.create_multipart_upload(Bucket=BUCKET, Key=KEY)
upload_id = mpu["UploadId"]
print(f"Multipart upload initiated: {upload_id}")

parts     = []
part_num  = 1

try:
    with open(LOCAL_FILE, "rb") as f:
        while True:
            data = f.read(PART_SIZE)
            if not data:
                break   # end of file

            # ── Step 2: Upload each part ──
            response = s3.upload_part(
                Bucket     = BUCKET,
                Key        = KEY,
                UploadId   = upload_id,
                PartNumber = part_num,
                Body       = data
            )
            parts.append({
                "PartNumber": part_num,
                "ETag"      : response["ETag"]   # AWS returns ETag per part
            })
            print(f"  Uploaded part {part_num} ({len(data)/1e6:.1f} MB)")
            part_num += 1

    # ── Step 3: Complete the multipart upload ──
    s3.complete_multipart_upload(
        Bucket          = BUCKET,
        Key             = KEY,
        UploadId        = upload_id,
        MultipartUpload = {"Parts": parts}
    )
    print(f"✅ Multipart upload complete — {part_num-1} parts uploaded")

except Exception as e:
    # ── CRITICAL: Always abort on failure to avoid storage charges ──
    s3.abort_multipart_upload(Bucket=BUCKET, Key=KEY, UploadId=upload_id)
    print(f"❌ Upload failed — multipart aborted: {e}")
    raise

⚠️ Always Abort on Failure

Incomplete multipart uploads accumulate in S3 and incur storage charges. Always call abort_multipart_upload in your except block. Set a lifecycle rule to auto-abort incomplete multipart uploads after 7 days as a safety net.

S3 Transfer Manager — TransferConfig for High-Throughput

The S3 Transfer Manager (used by upload_file and download_file) can be tuned via TransferConfig to maximise throughput. Key parameters: multipart_threshold (file size above which multipart kicks in), max_concurrency (parallel part uploads), and multipart_chunksize (size of each part).

python — TransferConfig for high-throughput pipelines

import boto3
from boto3.s3.transfer import TransferConfig
from concurrent.futures import ThreadPoolExecutor

s3 = boto3.client("s3", region_name="us-east-1")

# ── Tune the Transfer Manager ──
transfer_config = TransferConfig(
    multipart_threshold  = 50  * 1024 * 1024,   # use multipart for files > 50 MB
    multipart_chunksize  = 50  * 1024 * 1024,   # each part = 50 MB
    max_concurrency      = 20,                   # 20 parallel threads per transfer
    use_threads          = True
)

# ── Upload with tuned config ──
def upload_with_progress(local_path: str, bucket: str, key: str):
    uploaded_bytes = [0]
    file_size      = os.path.getsize(local_path)

    def progress(bytes_transferred):
        uploaded_bytes[0] += bytes_transferred
        pct = uploaded_bytes[0] / file_size * 100
        print(f"\r  Progress: {pct:.1f}% ({uploaded_bytes[0]/1e6:.1f}/{file_size/1e6:.1f} MB)", end="")

    s3.upload_file(
        Filename = local_path,
        Bucket   = bucket,
        Key      = key,
        Config   = transfer_config,
        Callback = progress
    )
    print(f"\n✅ Uploaded: {key}")

import os
upload_with_progress(
    "/data/events_2024.parquet",
    "my-data-lake",
    "bronze/events/events_2024.parquet"
)

# ── Parallel upload of multiple files using ThreadPoolExecutor ──
files_to_upload = [
    ("/data/jan.parquet", "bronze/events/year=2024/month=01/data.parquet"),
    ("/data/feb.parquet", "bronze/events/year=2024/month=02/data.parquet"),
    ("/data/mar.parquet", "bronze/events/year=2024/month=03/data.parquet"),
]

def upload_one(args):
    local_path, s3_key = args
    s3.upload_file(local_path, "my-data-lake", s3_key, Config=transfer_config)
    print(f"  ✅ Uploaded: {s3_key}")

with ThreadPoolExecutor(max_workers=4) as pool:
    pool.map(upload_one, files_to_upload)
print("All files uploaded in parallel")

S3 Select — Query Inside Objects Without Full Download

S3 Select lets you run a SQL expression against a single S3 object (CSV, JSON, Parquet) and retrieve only the matching rows — without downloading the entire file. For large files where you only need a subset, this can reduce data transfer by 80–90%.

python — S3 Select

import boto3

s3 = boto3.client("s3", region_name="us-east-1")

# ── S3 Select on a CSV file — get only rows where status = 'ERROR' ──
response = s3.select_object_content(
    Bucket         = "my-data-lake",
    Key            = "bronze/pipeline_logs/logs_2024_01_01.csv",
    ExpressionType = "SQL",
    Expression     = "SELECT s.run_id, s.status, s.error_msg FROM S3Object s WHERE s.status = 'ERROR'",
    InputSerialization  = {
        "CSV": {"FileHeaderInfo": "USE"},   # USE = first row is header
        "CompressionType": "NONE"
    },
    OutputSerialization = {
        "CSV": {}
    }
)

# ── Read the streaming result ──
result_rows = []
for event in response["Payload"]:
    if "Records" in event:
        data = event["Records"]["Payload"].decode("utf-8")
        result_rows.append(data)
    elif "Stats" in event:
        stats = event["Stats"]["Details"]
        print(f"Bytes scanned: {stats['BytesScanned']:,} | "
              f"Bytes returned: {stats['BytesReturned']:,}")

print(f"Error rows found: {len(result_rows)}")
for row in result_rows[:5]:
    print(f"  {row.strip()}")

✅ When S3 Select Is Worth It

S3 Select works well on a single large compressed CSV or JSON file where you need a small filtered subset. For Parquet files across many partitions, use Athena instead — it parallelises across files and has better predicate pushdown.

🔔

S3 Event Notifications

Event-Driven ▼

put_bucket_notification_configuration — Trigger Pipelines on File Arrival

S3 event notifications fire when objects are created, deleted, or restored. You configure them to send events to SQS, SNS, or Lambda. The most common data engineering pattern: new file lands in S3 → SQS receives the event → pipeline polls SQS and starts processing.

python — S3 event notification to SQS

import boto3

s3  = boto3.client("s3",  region_name="us-east-1")
sqs = boto3.client("sqs", region_name="us-east-1")

BUCKET    = "my-data-lake"
SQS_ARN   = "arn:aws:sqs:us-east-1:123456789012:s3-file-arrival-queue"
LAMBDA_ARN= "arn:aws:lambda:us-east-1:123456789012:function:trigger-pipeline"

# ── Configure S3 to send events to SQS on object creation ──
s3.put_bucket_notification_configuration(
    Bucket                    = BUCKET,
    NotificationConfiguration = {
        "QueueConfigurations": [
            {
                "Id"    : "new-bronze-file-notification",
                "QueueArn": SQS_ARN,
                "Events": ["s3:ObjectCreated:*"],   # all create events (PUT, POST, COPY)
                "Filter": {
                    "Key": {
                        "FilterRules": [
                            {"Name": "prefix", "Value": "landing/"},    # only landing/ prefix
                            {"Name": "suffix", "Value": ".parquet"}     # only .parquet files
                        ]
                    }
                }
            }
        ],
        # ── Also trigger a Lambda directly for CSV files ──
        "LambdaFunctionConfigurations": [
            {
                "Id"                : "csv-arrival-lambda",
                "LambdaFunctionArn" : LAMBDA_ARN,
                "Events"            : ["s3:ObjectCreated:Put"],
                "Filter"            : {
                    "Key": {
                        "FilterRules": [
                            {"Name": "prefix", "Value": "landing/csv/"},
                            {"Name": "suffix", "Value": ".csv"}
                        ]
                    }
                }
            }
        ]
    }
)
print("✅ S3 event notifications configured")
print("  New .parquet in landing/ → SQS queue")
print("  New .csv in landing/csv/ → Lambda function")

☁️ Event-Driven Pipeline Pattern

S3 object created → S3 sends event to SQS → your pipeline polls SQS with receive_message() → extracts bucket/key from message → starts Glue job or Spark step → deletes SQS message on success.
This is the foundation of file-arrival triggered batch pipelines on AWS.

29.30.6 — BOTO3 DEEP DIVE

Glue APIs

AWS Glue is your central ETL orchestration and metadata layer. Its boto3 API covers four domains: Glue Data Catalog (databases, tables, partitions), Crawlers (schema discovery), ETL Jobs (start/poll/manage), and Data Quality. Every production pipeline touches at least two of these.

🗄️

Catalog — Databases (create, get, delete, list) CATALOG ▼

What Is the Glue Data Catalog?

The Glue Data Catalog is a central metadata store — a fully managed Hive Metastore replacement. It stores database and table definitions (schema, location, partition info) that are shared across Glue jobs, EMR Spark, Athena, Redshift Spectrum, and Lake Formation. Think of it as the single source of truth for "what tables exist and where their data lives on S3."

📦 Analogy

The Glue Catalog is like a library card catalog. The actual books (data) are on S3. The card catalog (Glue) tells you which shelf (S3 path), what the book contains (schema), and how it's organized (partitions). Athena, EMR, and Glue jobs all consult the same catalog.

🔑 Key Fact

A Glue Database is just a namespace — a logical container for tables. It doesn't store data itself. All data lives on S3 (or another store). The database simply groups related tables together.

create_database() — Create a new Glue database

Creates a logical namespace in the Glue Catalog. You provide a name and optional description. All tables for a project or domain live under one database.

python — create_database()

import boto3
from botocore.exceptions import ClientError

glue = boto3.client("glue", region_name="us-east-1")

try:
    glue.create_database(
        DatabaseInput={
            "Name":        "orders_db",
            "Description": "Orders domain — Bronze and Silver tables",
            "LocationUri": "s3://my-datalake/orders/"  # optional default S3 location
        }
    )
    print("Database created: orders_db")

except ClientError as e:
    if e.response["Error"]["Code"] == "AlreadyExistsException":
        print("Database already exists — skipping")
    else:
        raise

✅ Real Usage

In a metadata-driven pipeline framework, you call create_database() once during environment provisioning (via Terraform or an init script). Day-to-day pipeline code never creates databases — it only creates tables inside an existing database.

get_database() and get_databases() — Read database metadata

get_database() fetches a single database by name. get_databases() lists all databases — use a paginator since there can be many.

python — get_database() and paginating get_databases()

# ── Get a single database ─────────────────────────────────────────
response = glue.get_database(Name="orders_db")
db = response["Database"]
print(db["Name"], db.get("Description"), db.get("LocationUri"))

# ── List ALL databases with paginator ─────────────────────────────
paginator = glue.get_paginator("get_databases")
for page in paginator.paginate():
    for db in page["DatabaseList"]:
        print(db["Name"])

delete_database() — Remove a database

Deletes the database definition from the catalog. It does NOT delete the underlying S3 data — only the metadata. Be careful in production — deleting a database removes all table definitions under it.

python — delete_database()

try:
    glue.delete_database(Name="old_staging_db")
    print("Deleted")
except ClientError as e:
    if e.response["Error"]["Code"] == "EntityNotFoundException":
        print("Database not found — nothing to delete")
    else:
        raise

⚠️ Warning

delete_database() cascades to all tables inside it — their catalog definitions are permanently removed. The S3 data files are untouched, but Athena and Glue jobs will no longer be able to find the tables until you re-register them.

📋

Catalog — Tables (create, get, update, delete, batch_delete) TABLES ▼

create_table() — Register a table in the Glue Catalog

You call create_table() to tell the Glue Catalog that a table exists — pointing it to an S3 path with schema information. This is what Glue Crawlers do under the hood. You can also do it manually for full schema control.

python — create_table() — register a Parquet table

glue.create_table(
    DatabaseName="orders_db",
    TableInput={
        "Name":        "orders_bronze",
        "Description": "Raw orders landed from DMS CDC",
        "StorageDescriptor": {
            "Columns": [
                {"Name": "order_id",   "Type": "string"},
                {"Name": "customer_id","Type": "string"},
                {"Name": "amount",      "Type": "double"},
                {"Name": "status",      "Type": "string"},
                {"Name": "created_at",  "Type": "timestamp"},
            ],
            "Location":          "s3://my-datalake/bronze/orders/",
            "InputFormat":       "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat":      "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
                "Parameters":           {"serialization.format": "1"}
            },
            "Compressed":        False,
            "StoredAsSubDirectories": False
        },
        "PartitionKeys": [
            {"Name": "year",  "Type": "string"},
            {"Name": "month", "Type": "string"},
            {"Name": "day",   "Type": "string"},
        ],
        "TableType": "EXTERNAL_TABLE",  # data lives on S3, not managed by Glue
        "Parameters": {
            "classification": "parquet",
            "EXTERNAL":       "TRUE"
        }
    }
)

📦 Analogy

create_table() is like registering a filing cabinet in the office directory. You're telling everyone "there's a cabinet called orders_bronze on the 3rd floor shelf (S3 path), and here's what kinds of documents are inside (schema)." The documents themselves aren't moved — only the registration is created.

get_table() — Fetch a single table definition

Retrieves the full table metadata — schema, S3 location, partition keys, SerDe info. Useful for validation before running a pipeline: check if the table exists and if the schema matches what you expect.

python — get_table() with existence check

def table_exists(glue_client, database: str, table: str) -> bool:
    try:
        glue_client.get_table(DatabaseName=database, Name=table)
        return True
    except ClientError as e:
        if e.response["Error"]["Code"] == "EntityNotFoundException":
            return False
        raise

if table_exists(glue, "orders_db", "orders_bronze"):
    resp = glue.get_table(DatabaseName="orders_db", Name="orders_bronze")
    tbl  = resp["Table"]
    cols = tbl["StorageDescriptor"]["Columns"]
    loc  = tbl["StorageDescriptor"]["Location"]
    print(f"Location: {loc}")
    print(f"Columns:  {[c['Name'] for c in cols]}")

update_table() — Schema evolution: add or modify columns

When your source schema evolves (a new column arrives), you update the Glue table definition so Athena and Glue jobs see the new column. You must pass the full updated TableInput — not just the delta.

python — update_table() to add a new column

# 1. Fetch existing definition
existing = glue.get_table(DatabaseName="orders_db", Name="orders_bronze")["Table"]
sd = existing["StorageDescriptor"]

# 2. Add the new column to the existing list
sd["Columns"].append({"Name": "discount_pct", "Type": "double"})

# 3. Push the full updated definition back
glue.update_table(
    DatabaseName="orders_db",
    TableInput={
        "Name":               existing["Name"],
        "StorageDescriptor":  sd,
        "PartitionKeys":      existing.get("PartitionKeys", []),
        "TableType":          existing.get("TableType", "EXTERNAL_TABLE"),
        "Parameters":         existing.get("Parameters", {}),
    }
)
print("Schema updated — discount_pct column added")

🔑 Important Pattern

Always fetch first, then modify, then push. Calling update_table() with an incomplete TableInput will wipe out fields you didn't include (like PartitionKeys). Never build the TableInput from scratch when updating.

get_tables() — List all tables in a database (with paginator)

Returns all table definitions in a database. A single database can have hundreds of tables — always use the paginator.

python — get_tables() with paginator

paginator = glue.get_paginator("get_tables")
all_tables = []

for page in paginator.paginate(DatabaseName="orders_db"):
    all_tables.extend(page["TableList"])

print(f"Found {len(all_tables)} tables")
for t in all_tables:
    print(t["Name"], t["StorageDescriptor"]["Location"])

delete_table() and batch_delete_table() — Remove table definitions

delete_table() removes a single table from the catalog. batch_delete_table() removes up to 25 tables in one API call — useful for cleanup scripts.

python — delete single table and batch delete

# ── Single delete ─────────────────────────────────────────────────
glue.delete_table(DatabaseName="orders_db", Name="orders_temp")

# ── Batch delete (up to 25 at once) ───────────────────────────────
tables_to_drop = ["stg_orders_20230101", "stg_orders_20230102", "stg_orders_20230103"]

response = glue.batch_delete_table(
    DatabaseName="orders_db",
    TablesToDelete=tables_to_drop
)

errors = response.get("Errors", [])
if errors:
    for err in errors:
        print(f"Failed to delete {err['TableName']}: {err['ErrorDetail']['ErrorMessage']}")

🗂️

Catalog — Partitions (create, batch_create, get, delete, update) PARTITIONS ▼

Why Do Partitions Need to Be Registered?

When your Spark job writes Parquet to S3 in partitioned directories (year=2024/month=01/day=15/), Athena and Glue don't automatically know about new partitions. You must register new partitions in the Glue Catalog so query engines can find them. Without registration, SELECT * FROM orders_bronze WHERE year='2024' in Athena returns zero rows even though the data is on S3.

Spark writes to S3

→

New partition dir created

→

batch_create_partition()

→

Athena can query it

🔑 Alternative

You can also run MSCK REPAIR TABLE orders_bronze in Athena or Glue SQL to auto-discover all partitions — but this scans the entire S3 path and is slow for large tables. The programmatic batch_create_partition() approach is always preferred in production pipelines.

batch_create_partition() — Bulk register new partitions (preferred)

The most important partition API. Registers up to 100 partitions in a single call. Call this at the end of every Spark write job to register the partitions just written.

python — batch_create_partition() after Spark write

from datetime import date, timedelta

def register_daily_partitions(glue_client, database, table, s3_base, year, month, day):
    """Register a single date partition after Spark write."""
    partition_values = [str(year), str(month).zfill(2), str(day).zfill(2)]
    s3_location = f"{s3_base}/year={year}/month={str(month).zfill(2)}/day={str(day).zfill(2)}/"

    # Fetch parent table's StorageDescriptor to clone it for the partition
    parent_sd = glue_client.get_table(
        DatabaseName=database, Name=table
    )["Table"]["StorageDescriptor"]

    # Override location for this specific partition
    partition_sd = {**parent_sd, "Location": s3_location}

    try:
        glue_client.batch_create_partition(
            DatabaseName=database,
            TableName=table,
            PartitionInputList=[{
                "Values":              partition_values,
                "StorageDescriptor":   partition_sd,
                "Parameters":          {}
            }]
        )
        print(f"Registered partition: {partition_values}")
    except ClientError as e:
        if e.response["Error"]["Code"] == "AlreadyExistsException":
            print(f"Partition {partition_values} already exists — skipping")
        else:
            raise

# ── Usage: register today's partition after pipeline completes ────
today = date.today()
register_daily_partitions(
    glue_client=glue,
    database="orders_db",
    table="orders_bronze",
    s3_base="s3://my-datalake/bronze/orders",
    year=today.year, month=today.month, day=today.day
)

get_partitions() — List partitions with paginator

Lists all registered partitions for a table. Returns partition values and their S3 locations. You can filter with an Expression (Hive-style filter like "year='2024' AND month='01'").

python — get_partitions() with filter

paginator = glue.get_paginator("get_partitions")

all_partitions = []
for page in paginator.paginate(
    DatabaseName="orders_db",
    TableName="orders_bronze",
    Expression="year='2024' AND month='03'"   # optional filter
):
    all_partitions.extend(page["Partitions"])

print(f"Found {len(all_partitions)} partitions for 2024-03")
for p in all_partitions[:3]:
    print(p["Values"], p["StorageDescriptor"]["Location"])

delete_partition() and batch_delete_partition() — Remove partitions

Used for cleanup — remove old partition registrations (e.g., after a retention policy deletes the S3 data). Removes the catalog entry only — does not delete S3 data.

python — batch_delete_partition() for cleanup

# Remove two specific date partitions
glue.batch_delete_partition(
    DatabaseName="orders_db",
    TableName="orders_bronze",
    PartitionsToDelete=[
        {"Values": ["2022", "01", "01"]},
        {"Values": ["2022", "01", "02"]},
    ]
)
print("Old partitions removed from catalog")

🕷️

Crawlers — create, start, stop, poll, delete CRAWLERS ▼

What Is a Glue Crawler?

A Glue Crawler is an automated schema discovery tool. You point it at an S3 path (or JDBC database), and it scans the files, infers the schema, detects partitions, and writes or updates table definitions in the Glue Catalog. You use crawlers for initial table registration and to detect schema changes automatically.

📦 Analogy

A crawler is like a librarian who walks through new shipments of books, reads the title page of each one, and writes a catalog entry for it. You don't have to tell the librarian exactly what each book contains — they figure it out themselves.

🟢

READY

Crawler is idle and ready to run.

🔵

RUNNING

Crawler is scanning S3 / JDBC now.

🟡

STOPPING

Stop requested, winding down.

create_crawler() — Define a new crawler

Creates a crawler that points to an S3 path. You specify which database to write tables into, what IAM role to use, and what S3 paths to crawl.

python — create_crawler() pointing to S3

glue.create_crawler(
    Name="orders-bronze-crawler",
    Role="arn:aws:iam::123456789012:role/GlueCrawlerRole",
    DatabaseName="orders_db",
    Description="Crawl S3 bronze orders and update Glue Catalog",
    Targets={
        "S3Targets": [
            {"Path": "s3://my-datalake/bronze/orders/"},
        ]
    },
    SchemaChangePolicy={
        "UpdateBehavior": "UPDATE_IN_DATABASE",  # add new columns automatically
        "DeleteBehavior": "LOG"                  # log removals, don't auto-delete
    },
    RecrawlPolicy={"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"}  # incremental
)

start_crawler() and stop_crawler() — Run and halt a crawler

python — start_crawler() and stop_crawler()

# ── Start crawler ─────────────────────────────────────────────────
try:
    glue.start_crawler(Name="orders-bronze-crawler")
    print("Crawler started")
except ClientError as e:
    if e.response["Error"]["Code"] == "CrawlerRunningException":
        print("Crawler already running — skipping")
    else:
        raise

# ── Stop crawler ──────────────────────────────────────────────────
glue.stop_crawler(Name="orders-bronze-crawler")

Manual Waiter Pattern — Poll get_crawler() until READY

Glue does NOT provide a built-in waiter for crawlers. You must implement polling manually: call get_crawler() in a loop, check the State field, and sleep between checks. This is the standard production pattern.

python — poll_until_crawler_ready() — production pattern

import time

def wait_for_crawler(glue_client, crawler_name: str, poll_sec: int = 15, timeout_sec: int = 900):
    """Block until the crawler reaches READY state (or timeout)."""
    elapsed = 0
    while elapsed < timeout_sec:
        state = glue_client.get_crawler(Name=crawler_name)["Crawler"]["State"]
        print(f"[{elapsed}s] Crawler state: {state}")

        if state == "READY":
            print("✅ Crawler finished")
            return
        if state in ("STOPPING", "FAILED"):
            raise RuntimeError(f"Crawler ended in unexpected state: {state}")

        time.sleep(poll_sec)
        elapsed += poll_sec

    raise TimeoutError(f"Crawler did not finish within {timeout_sec}s")

# ── Full pattern: start → wait → confirm ──────────────────────────
glue.start_crawler(Name="orders-bronze-crawler")
wait_for_crawler(glue, "orders-bronze-crawler")
print("Catalog updated — Athena can now query new partitions")

✅ Production Tip

In Airflow, you implement this same polling logic inside a sensor (or use a while True loop in a PythonOperator). Set timeout_sec to something reasonable for your data volume — crawling 1 TB of Parquet can take 10–20 minutes.

⚙️

Jobs — create, start_job_run(), get_job_run(), poll, stop JOBS ▼

Glue Job Types — Spark vs Python Shell

A Glue ETL Job is a managed execution environment for your transformation code. The two most important types: Spark ETL jobs (run your PySpark script on a managed cluster — most common) and Python Shell jobs (run plain Python on a single worker — for lightweight transforms, metadata updates, or API calls).

Job Type	Worker	Use Case	Billing Unit
Glue ETL (Spark)	G.1X, G.2X, G.4X, G.8X	PySpark transforms, large-scale data processing	DPU-hours
Python Shell	0.0625 DPU	Lightweight Python: API calls, metadata updates, orchestration helpers	DPU-hours (cheap)
Glue Streaming	G.1X	Continuous Spark Structured Streaming jobs	DPU-hours (continuous)

create_job() — Define a Glue ETL job

Registers a Glue job definition. The actual script lives on S3. You can pass default arguments that the script reads at runtime (like the target date, S3 path, or environment name).

python — create_job() for a PySpark ETL job

glue.create_job(
    Name="orders-bronze-to-silver",
    Description="Transform raw orders (Bronze) into clean orders (Silver)",
    Role="arn:aws:iam::123456789012:role/GlueJobRole",
    Command={
        "Name":           "glueetl",          # "glueetl" for Spark, "pythonshell" for Python
        "ScriptLocation": "s3://my-scripts/glue/bronze_to_silver.py",
        "PythonVersion":  "3"
    },
    DefaultArguments={
        "--job-language":                   "python",
        "--TempDir":                        "s3://my-datalake/glue-temp/",
        "--enable-metrics":                 "true",
        "--enable-continuous-cloudwatch-log":"true",
        "--enable-spark-ui":                "true",
        "--spark-event-logs-path":          "s3://my-datalake/spark-ui-logs/",
        "--SOURCE_DB":   "orders_db",
        "--SOURCE_TABLE":"orders_bronze",
        "--TARGET_PATH": "s3://my-datalake/silver/orders/",
    },
    GlueVersion="4.0",              # Spark 3.3 + Python 3.10
    WorkerType="G.1X",              # 4 vCPU, 16 GB per worker
    NumberOfWorkers=5,              # 1 driver + 4 executors
    Timeout=60,                     # minutes — job killed if exceeded
    MaxRetries=1,
    Tags={"Project": "orders-platform", "Env": "prod"}
)

start_job_run() — Trigger a Glue job execution

Launches a new run of an existing job. You can override default arguments at runtime — this is how metadata-driven pipelines work: one job definition, different arguments per pipeline or date.

python — start_job_run() with runtime argument overrides

from datetime import date

run_date = date.today().isoformat()  # "2024-03-15"

response = glue.start_job_run(
    JobName="orders-bronze-to-silver",
    Arguments={
        "--RUN_DATE":    run_date,          # override for this specific run
        "--ENV":         "prod",
        "--DRY_RUN":     "false",
    }
)

job_run_id = response["JobRunId"]
print(f"Started job run: {job_run_id}")
# JobRunId looks like: jr_abc123xyz456

get_job_run() — Poll job status (manual waiter)

Glue has no built-in waiter for jobs. You must poll get_job_run() until the JobRunState reaches a terminal state. Terminal states are: SUCCEEDED, FAILED, ERROR, TIMEOUT, STOPPED.

python — wait_for_glue_job() — production polling pattern

import time

TERMINAL_STATES = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}

def wait_for_glue_job(glue_client, job_name: str, run_id: str,
                       poll_sec: int = 20, timeout_sec: int = 3600) -> dict:
    """Poll until Glue job reaches terminal state. Raises on failure."""
    elapsed = 0
    while elapsed < timeout_sec:
        resp  = glue_client.get_job_run(JobName=job_name, RunId=run_id)
        run   = resp["JobRun"]
        state = run["JobRunState"]
        print(f"[{elapsed}s] {job_name} — {state}")

        if state in TERMINAL_STATES:
            if state != "SUCCEEDED":
                error_msg = run.get("ErrorMessage", "No error message")
                raise RuntimeError(f"Glue job {job_name} {state}: {error_msg}")
            print(f"✅ Job succeeded in {run.get('ExecutionTime', '?')}s")
            return run

        time.sleep(poll_sec)
        elapsed += poll_sec

    raise TimeoutError(f"Job {run_id} did not complete within {timeout_sec}s")

# ── Full pattern: start → wait → register partitions ──────────────
run_id = glue.start_job_run(
    JobName="orders-bronze-to-silver",
    Arguments={"--RUN_DATE": "2024-03-15"}
)["JobRunId"]

wait_for_glue_job(glue, "orders-bronze-to-silver", run_id)
print("Pipeline step complete — moving to next stage")

✅ What to Log After the Job

After wait_for_glue_job() returns, extract run["ExecutionTime"] (seconds), run["DPUSeconds"] (cost metric), and run["CompletedOn"] and write them to your DynamoDB audit table. This gives you a full history of every run with cost and duration per pipeline.

get_job_runs() — View run history with paginator

Returns all historical runs for a job. Useful for dashboards, SLA reporting, and debugging repeated failures.

python — get_job_runs() to fetch last 10 runs

paginator = glue.get_paginator("get_job_runs")
runs = []
for page in paginator.paginate(JobName="orders-bronze-to-silver", MaxResults=10):
    runs.extend(page["JobRuns"])
    if len(runs) >= 10:
        break

for r in runs:
    print(r["Id"], r["JobRunState"], r.get("ExecutionTime", "?"), "sec")

stop_job_run() and batch_stop_job_run() — Cancel running jobs

Use these to cancel jobs that are taking too long, consuming too many DPUs, or were triggered by mistake. batch_stop_job_run() cancels up to 25 runs in one call.

python — stop a running Glue job

# Stop a single run
glue.stop_job_run(JobName="orders-bronze-to-silver", RunId=job_run_id)

# Batch stop (useful in a cleanup handler)
glue.batch_stop_job_run(
    JobName="orders-bronze-to-silver",
    JobRunIds=["jr_run1", "jr_run2"]
)

✅

Glue Data Quality APIs DATA QUALITY ▼

What Is Glue Data Quality?

Glue Data Quality is a built-in rule engine that lets you define quality checks (completeness, uniqueness, freshness, custom SQL expressions) and run them against tables in the Glue Catalog. Results can be used to halt the pipeline on quality violations — preventing bad data from propagating to Silver or Gold layers.

📊

Completeness

% non-null values in a column ≥ threshold. E.g. order_id must be 100% complete.

🔑

Uniqueness

% distinct values ≥ threshold. E.g. order_id must be 100% unique.

🕐

Freshness

Max timestamp in column is within N hours of now. E.g. data must be <24 hours old.

🔢

RowCount

Table must have at least N rows. Catches empty-table bugs.

create_data_quality_ruleset() — Define DQ rules

A ruleset is a named set of DQ rules written in DQDL (Data Quality Definition Language). Rules are declarative — you describe what "good" looks like, not how to check it.

python — create_data_quality_ruleset() with DQDL rules

glue.create_data_quality_ruleset(
    Name="orders-bronze-dq-rules",
    Description="DQ checks for Bronze orders table",
    Ruleset="""
        Rules = [
            Completeness "order_id"    >= 1.0,
            Uniqueness   "order_id"    >= 0.999,
            Completeness "customer_id" >= 0.95,
            RowCount     >= 1000,
            ColumnValues "amount" between 0 and 100000,
            IsComplete   "status"
        ]
    """,
    TargetTable={
        "TableName":    "orders_bronze",
        "DatabaseName": "orders_db"
    }
)
print("DQ ruleset created: orders-bronze-dq-rules")

start_data_quality_ruleset_evaluation_run() — Run DQ checks

Triggers an evaluation of the ruleset against the target table. Returns a RunId that you poll to get results. The evaluation runs as a Glue job under the hood.

python — run DQ evaluation and poll for results

import time, json, boto3
from botocore.exceptions import ClientError

glue = boto3.client("glue", region_name="us-east-1")

# ── 1. Start the evaluation ──────────────────────────────────────
resp = glue.start_data_quality_ruleset_evaluation_run(
    DataSource={
        "GlueTable": {
            "DatabaseName": "orders_db",
            "TableName":    "orders_bronze"
        }
    },
    Role="arn:aws:iam::123456789012:role/GlueJobRole",
    RulesetNames=["orders-bronze-dq-rules"],
    AdditionalRunOptions={"CloudWatchMetricsEnabled": True}
)
run_id = resp["RunId"]
print(f"DQ evaluation started: {run_id}")

# ── 2. Poll until complete ────────────────────────────────────────
for i in range(60):  # up to 30 minutes
    time.sleep(30)
    run_detail = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
    status     = run_detail["Status"]
    print(f"DQ run status: {status}")

    if status == "SUCCEEDED":
        break
    if status in ("FAILED", "ERROR", "TIMEOUT"):
        raise RuntimeError(f"DQ evaluation failed: {status}")

# ── 3. Read results and gate the pipeline ─────────────────────────
results = run_detail.get("ResultIds", [])
if not results:
    raise RuntimeError("No DQ results returned")

result_detail = glue.get_data_quality_result(ResultId=results[0])
rule_results  = result_detail["RuleResults"]

failed_rules = [r for r in rule_results if r["Result"] == "FAIL"]

if failed_rules:
    for r in failed_rules:
        print(f"❌ FAILED: {r['Name']} — {r.get('EvaluationMessage', '')}")
    raise RuntimeError(f"{len(failed_rules)} DQ rules failed — pipeline halted")

print(f"✅ All {len(rule_results)} DQ rules passed — proceeding to Silver")

🔑 Production Gate Pattern

Insert this DQ check between Bronze write and Silver transformation. If DQ fails: publish a CloudWatch metric (dq_failure_count), write the failure to your DynamoDB audit table, trigger an SNS alert, and stop the pipeline. This prevents bad data from contaminating Silver and Gold layers.

✅ Full Production DQ Flow

Glue Bronze Job

→

start DQ eval run

→

poll status

→

check PASS/FAIL

→

if PASS → Silver Job

If any rule fails → CloudWatch metric + SNS alert + DynamoDB audit record → pipeline stopped.

29.30.7 — BOTO3 DEEP DIVE

Athena APIs

Amazon Athena is a serverless SQL engine that queries data directly on S3 using Trino under the hood. Its boto3 API is asynchronous — you submit a query, get back a QueryExecutionId, poll for completion, then fetch results. Mastering this pattern is essential for every automation script that reads from your data lake.

🚀

start_query_execution() — Submit a SQL query SUBMIT ▼

How Athena Execution Works (Async Model)

Athena does not return results immediately like a regular database call. Instead it works like a job system: you submit a query → Athena returns a QueryExecutionId → you poll until the state is SUCCEEDED → then you fetch results from S3 (or via the API). The results are always written to an S3 output location first.

start_query_execution()

→

QueryExecutionId

→

poll get_query_execution()

→

SUCCEEDED

→

get_query_results()

📦 Analogy

Submitting an Athena query is like dropping off a package at the post office. You get a tracking number (QueryExecutionId) immediately. You check the tracking site periodically (poll) until it says "Delivered" (SUCCEEDED). Only then do you go to the destination to pick up what was delivered (fetch results).

start_query_execution() — All key parameters

The four most important parameters: QueryString (your SQL), QueryExecutionContext (which database to run against), ResultConfiguration (where to write results on S3), and WorkGroup (which cost/permission boundary to use).

python — start_query_execution() full example

import boto3

athena = boto3.client("athena", region_name="us-east-1")

response = athena.start_query_execution(
    QueryString="""
        SELECT
            customer_id,
            COUNT(*)        AS order_count,
            SUM(amount)     AS total_spent,
            MAX(created_at) AS last_order_at
        FROM orders_db.orders_bronze
        WHERE year = '2024' AND month = '03'
        GROUP BY customer_id
        ORDER BY total_spent DESC
        LIMIT 100
    """,
    QueryExecutionContext={
        "Database": "orders_db",   # default database — avoids db prefix in SQL
        "Catalog":  "AwsDataCatalog"  # use Glue Catalog (default)
    },
    ResultConfiguration={
        "OutputLocation": "s3://my-datalake/athena-results/",
        "EncryptionConfiguration": {
            "EncryptionOption": "SSE_S3"   # encrypt results at rest
        }
    },
    WorkGroup="data-engineering-wg"   # workgroup controls cost and permissions
)

query_execution_id = response["QueryExecutionId"]
print(f"Query submitted: {query_execution_id}")

🔑 WorkGroup Best Practice

Always specify a WorkGroup in production. WorkGroups let you set per-team byte-scan limits (e.g. max 10 GB per query), enforce output encryption, and track costs separately per team. Never use the primary workgroup for production pipelines — it has no guardrails.

🔄

get_query_execution() — Poll until SUCCEEDED / FAILED POLL ▼

Query State Machine

An Athena query moves through four states. You must keep polling until you hit one of the two terminal states.

State	Meaning	Action
QUEUED	Query is waiting for resources	Keep polling
RUNNING	Query is actively executing on Athena/Trino	Keep polling
SUCCEEDED	Query finished — results are on S3	Fetch results ✅
FAILED	Query errored — check StateChangeReason	Raise exception ❌
CANCELLED	Query was stopped by user or timeout	Raise exception ❌

get_query_execution() — Production polling pattern with backoff

Poll with exponential backoff — start with a 2-second sleep, increase gradually. Most short queries finish in under 10 seconds; large scans can take minutes. The response also contains useful metadata: bytes scanned (cost) and execution time.

python — wait_for_athena_query() with exponential backoff

import time
from botocore.exceptions import ClientError

def wait_for_athena_query(athena_client, query_execution_id: str,
                           max_wait_sec: int = 300) -> dict:
    """Poll Athena until terminal state. Returns execution detail on success."""
    terminal = {"SUCCEEDED", "FAILED", "CANCELLED"}
    sleep_sec = 2
    elapsed   = 0

    while elapsed < max_wait_sec:
        resp      = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
        execution = resp["QueryExecution"]
        state     = execution["Status"]["State"]
        print(f"[{elapsed}s] Athena query state: {state}")

        if state == "SUCCEEDED":
            stats = execution.get("Statistics", {})
            print(f"  ✅ Done in {stats.get('TotalExecutionTimeInMillis', 0) / 1000:.1f}s")
            print(f"  💰 Bytes scanned: {stats.get('DataScannedInBytes', 0):,}")
            return execution

        if state in ("FAILED", "CANCELLED"):
            reason = execution["Status"].get("StateChangeReason", "Unknown")
            raise RuntimeError(f"Athena query {state}: {reason}")

        # Exponential backoff: 2 → 4 → 8 → 16 → 30 → 30 → …
        time.sleep(sleep_sec)
        elapsed  += sleep_sec
        sleep_sec = min(sleep_sec * 2, 30)

    raise TimeoutError(f"Athena query did not complete within {max_wait_sec}s")

# ── Usage ─────────────────────────────────────────────────────────
execution_detail = wait_for_athena_query(athena, query_execution_id)
output_location  = execution_detail["ResultConfiguration"]["OutputLocation"]
print(f"Results at: {output_location}")

📥

get_query_results() — Fetch results and parse into DataFrame RESULTS ▼

Understanding the ResultSet Structure

Athena returns results as a ResultSet object. The first row is always the column headers, subsequent rows are the data. Each row is a list of {"VarCharValue": "..."} dicts. You need to parse this structure into something usable like a list of dicts or a Pandas DataFrame.

json — raw ResultSet structure from Athena

{
  "ResultSet": {
    "Rows": [
      {"Data": [{"VarCharValue": "customer_id"}, {"VarCharValue": "order_count"}, {"VarCharValue": "total_spent"}]},
      {"Data": [{"VarCharValue": "CUST-001"},   {"VarCharValue": "12"},          {"VarCharValue": "1450.99"}]},
      {"Data": [{"VarCharValue": "CUST-002"},   {"VarCharValue": "5"},           {"VarCharValue": "230.50"}]}
    ],
    "ResultSetMetadata": {
      "ColumnInfo": [
        {"Name": "customer_id", "Type": "varchar"},
        {"Name": "order_count",  "Type": "bigint"},
        {"Name": "total_spent",  "Type": "double"}
      ]
    }
  }
}

get_query_results() with paginator — full parse pattern

A single query can return millions of rows. Always use the paginator. The first page's first row is the header — skip it. Subsequent pages have no header row.

python — get_query_results() → list of dicts → DataFrame

import pandas as pd

def athena_results_to_df(athena_client, query_execution_id: str) -> pd.DataFrame:
    """Paginate Athena results and return a Pandas DataFrame."""
    paginator = athena_client.get_paginator("get_query_results")
    pages     = paginator.paginate(QueryExecutionId=query_execution_id)

    headers = None
    rows    = []

    for page_num, page in enumerate(pages):
        result_rows = page["ResultSet"]["Rows"]

        if page_num == 0:
            # First row of first page = column headers
            headers = [col["VarCharValue"] for col in result_rows[0]["Data"]]
            result_rows = result_rows[1:]  # skip header row

        for row in result_rows:
            # Each cell is {"VarCharValue": "..."} — extract the value
            values = [cell.get("VarCharValue", None) for cell in row["Data"]]
            rows.append(dict(zip(headers, values)))

    return pd.DataFrame(rows)

# ── Full end-to-end pattern ───────────────────────────────────────
# 1. Submit
qid = athena.start_query_execution(
    QueryString="SELECT customer_id, SUM(amount) AS total FROM orders_bronze GROUP BY 1",
    QueryExecutionContext={"Database": "orders_db"},
    ResultConfiguration={"OutputLocation": "s3://my-datalake/athena-results/"}
)["QueryExecutionId"]

# 2. Poll
wait_for_athena_query(athena, qid)

# 3. Fetch and parse
df = athena_results_to_df(athena, qid)
print(df.head())
print(f"Rows returned: {len(df)}")

✅ Pro Tip — Large Result Sets

For queries returning millions of rows, don't use get_query_results() — it's slow to paginate at scale. Instead, read the result CSV directly from S3 using boto3 s3.get_object() or Pandas read_csv(output_location). Athena always writes results to the OutputLocation as a CSV file named {QueryExecutionId}.csv.

python — read large Athena results directly from S3

import pandas as pd, io

# After SUCCEEDED — results are at OutputLocation/{QueryExecutionId}.csv
s3 = boto3.client("s3")

bucket      = "my-datalake"
key         = f"athena-results/{qid}.csv"

obj         = s3.get_object(Bucket=bucket, Key=key)
df_large    = pd.read_csv(io.BytesIO(obj["Body"].read()))
print(f"Loaded {len(df_large):,} rows from S3 directly")

🛠️

stop_query_execution(), list_query_executions(), Named Queries UTILITIES ▼

stop_query_execution() — Cancel a running query

Immediately cancels a QUEUED or RUNNING query. Useful in timeout handlers or when a runaway query is scanning too much data.

python — stop_query_execution()

athena.stop_query_execution(QueryExecutionId=query_execution_id)
print(f"Cancelled query: {query_execution_id}")

list_query_executions() — Audit recent queries with paginator

Returns QueryExecutionIds for recent queries in a workgroup. Useful for dashboards, cost auditing, and debugging. Note: it returns IDs only — you then call get_query_execution() to get details for each.

python — list recent query IDs and fetch their status

paginator = athena.get_paginator("list_query_executions")
query_ids = []

for page in paginator.paginate(WorkGroup="data-engineering-wg", MaxResults=20):
    query_ids.extend(page["QueryExecutionIds"])
    if len(query_ids) >= 20:
        break

# Fetch details for each (batch_get_query_execution — up to 50 at once)
details = athena.batch_get_query_execution(QueryExecutionIds=query_ids[:20])
for q in details["QueryExecutions"]:
    state   = q["Status"]["State"]
    scanned = q.get("Statistics", {}).get("DataScannedInBytes", 0)
    print(f"{q['QueryExecutionId'][:8]}… {state:10s} {scanned/1e6:.1f} MB scanned")

create_named_query() and get_named_query() — Save reusable queries

Named queries are saved SQL templates stored in Athena — like stored procedures. They're visible in the Athena console and can be retrieved by name and executed programmatically. Useful for standard reconciliation queries, SLA checks, or data quality assertions that run on a schedule.

python — create and execute a named query

# ── Create a named query ──────────────────────────────────────────
resp = athena.create_named_query(
    Name="daily-order-count-check",
    Description="Row count reconciliation — run daily after Bronze load",
    Database="orders_db",
    QueryString="""
        SELECT
            year, month, day,
            COUNT(*) AS row_count
        FROM orders_bronze
        WHERE year = '${year}' AND month = '${month}'
        GROUP BY year, month, day
        ORDER BY day
    """,
    WorkGroup="data-engineering-wg"
)
named_query_id = resp["NamedQueryId"]

# ── Retrieve and execute it ───────────────────────────────────────
named = athena.get_named_query(NamedQueryId=named_query_id)
sql   = named["NamedQuery"]["QueryString"]

# Replace template variables and submit
sql_resolved = sql.replace("${year}", "2024").replace("${month}", "03")
qid = athena.start_query_execution(
    QueryString=sql_resolved,
    QueryExecutionContext={"Database": "orders_db"},
    ResultConfiguration={"OutputLocation": "s3://my-datalake/athena-results/"}
)["QueryExecutionId"]
wait_for_athena_query(athena, qid)
df = athena_results_to_df(athena, qid)
print(df)

🏭

Full Production Athena Pattern — Every DE Must Know PRODUCTION ▼

Complete reusable Athena query utility

This is the pattern you will use in almost every pipeline that needs to query the data lake programmatically — reconciliation checks, DQ row counts, audit queries, Silver→Gold transformation triggers via SQL.

python — run_athena_query() — complete production utility

import boto3, time, io, pandas as pd
from botocore.exceptions import ClientError

class AthenaRunner:
    """Production Athena query runner — submit, poll, parse."""

    def __init__(self, region: str, output_location: str, workgroup: str = "primary"):
        self.athena         = boto3.client("athena", region_name=region)
        self.s3             = boto3.client("s3", region_name=region)
        self.output_location= output_location   # e.g. "s3://bucket/athena-results/"
        self.workgroup      = workgroup

    def run(self, sql: str, database: str, max_wait: int = 300) -> pd.DataFrame:
        """Submit SQL, wait for completion, return DataFrame."""
        qid = self._submit(sql, database)
        self._wait(qid, max_wait)
        return self._fetch(qid)

    def _submit(self, sql: str, database: str) -> str:
        resp = self.athena.start_query_execution(
            QueryString=sql,
            QueryExecutionContext={"Database": database},
            ResultConfiguration={"OutputLocation": self.output_location},
            WorkGroup=self.workgroup
        )
        return resp["QueryExecutionId"]

    def _wait(self, qid: str, max_wait: int):
        sleep, elapsed = 2, 0
        while elapsed < max_wait:
            status = self.athena.get_query_execution(QueryExecutionId=qid)
            state  = status["QueryExecution"]["Status"]["State"]
            if state == "SUCCEEDED":
                return
            if state in ("FAILED", "CANCELLED"):
                reason = status["QueryExecution"]["Status"].get("StateChangeReason", "?")
                raise RuntimeError(f"Athena {state}: {reason}")
            time.sleep(sleep); elapsed += sleep; sleep = min(sleep * 2, 30)
        raise TimeoutError(f"Athena query {qid} timed out after {max_wait}s")

    def _fetch(self, qid: str) -> pd.DataFrame:
        # For large results read directly from S3
        bucket = self.output_location.split("/")[2]
        prefix = "/".join(self.output_location.split("/")[3:])
        key    = f"{prefix}{qid}.csv"
        obj    = self.s3.get_object(Bucket=bucket, Key=key)
        return pd.read_csv(io.BytesIO(obj["Body"].read()))


# ── Usage ─────────────────────────────────────────────────────────
runner = AthenaRunner(
    region="us-east-1",
    output_location="s3://my-datalake/athena-results/",
    workgroup="data-engineering-wg"
)

# Daily reconciliation check
df = runner.run(
    sql="SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM orders_bronze WHERE year='2024' AND month='03' AND day='15'",
    database="orders_db"
)
print(df)

# Gate: if row count is zero, halt the pipeline
if int(df["cnt"].iloc[0]) == 0:
    raise RuntimeError("No rows found for 2024-03-15 — pipeline halted")

✅ Where You Use This in Production

Row count reconciliation — compare source DB count vs Athena table count after every load
DQ assertions — SELECT COUNT(*) FROM orders WHERE order_id IS NULL — assert 0
SLA freshness check — SELECT MAX(created_at) FROM orders — assert within 24 hours
Pipeline triggers — run Athena SQL to create Gold summary tables after Silver is ready
Ad-hoc audit queries — triggered from Lambda or Airflow on a schedule

29.30.8 — BOTO3 DEEP DIVE

EMR APIs

Amazon EMR (Elastic MapReduce) is the most powerful way to run large-scale Spark jobs on AWS. Its boto3 API covers three modes: classic EMR clusters (you manage nodes), EMR Serverless (zero cluster management), and EMR on EKS. You'll use these APIs to spin up clusters, submit Spark steps, poll for completion, and tear down — all programmatically from Airflow, Lambda, or a control script.

🖥️

Cluster Management — run_job_flow(), describe_cluster(), list_clusters() CLUSTER ▼

How EMR Cluster Lifecycle Works

An EMR cluster goes through a predictable lifecycle: STARTING → BOOTSTRAPPING → RUNNING → WAITING → TERMINATING → TERMINATED. You create it with run_job_flow(), submit Spark steps to it, poll its state, and optionally auto-terminate it after all steps complete. If KeepJobFlowAliveWhenNoSteps=False, EMR shuts itself down — great for cost control.

run_job_flow()

→

STARTING / BOOTSTRAPPING

→

WAITING

→

add_job_flow_steps()

→

RUNNING step

→

TERMINATED

📦 Analogy

Think of EMR like renting a supercomputer farm. run_job_flow() is you calling the rental company and saying "set up 10 machines with Spark installed." Once they're ready (WAITING), you send your job (add steps). When the job finishes, the machines are returned (TERMINATED). You only pay while the machines are running.

run_job_flow() — Spin up an EMR Cluster

run_job_flow() is the main API to create an EMR cluster. Key parameters: ReleaseLabel (EMR version), Instances (node types and counts), Applications (Spark, Hadoop etc.), JobFlowRole (EC2 instance profile), ServiceRole (EMR service IAM role), and AutoTerminationPolicy / KeepJobFlowAliveWhenNoSteps for cost control.

python — run_job_flow() full cluster creation

import boto3

emr = boto3.client("emr", region_name="us-east-1")

response = emr.run_job_flow(
    Name="de-spark-pipeline-cluster",
    ReleaseLabel="emr-6.15.0",          # EMR version — includes Spark 3.4
    LogUri="s3://my-datalake/emr-logs/", # all cluster logs go here

    Instances={
        "MasterInstanceType":  "m5.xlarge",
        "SlaveInstanceType":   "m5.2xlarge",
        "InstanceCount":       5,          # 1 master + 4 core nodes
        "KeepJobFlowAliveWhenNoSteps": False,  # auto-terminate when done!
        "TerminationProtected": False,
        "Ec2SubnetId": "subnet-0abc123def456",  # private subnet
        "EmrManagedMasterSecurityGroup": "sg-master-xxx",
        "EmrManagedSlaveSecurityGroup":  "sg-slave-xxx",
    },

    Applications=[
        {"Name": "Spark"},
        {"Name": "Hadoop"},
        {"Name": "Hive"},        # for Glue metastore access
    ],

    Configurations=[
        {
            "Classification": "spark-defaults",
            "Properties": {
                "spark.sql.shuffle.partitions": "200",
                "spark.executor.memory":         "6g",
                "spark.executor.cores":          "2",
                "spark.dynamicAllocation.enabled":"true",
            }
        },
        {
            "Classification": "hive-site",
            "Properties": {
                "hive.metastore.client.factory.class":
                    "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
            }   # use Glue as the Hive metastore
        }
    ],

    BootstrapActions=[
        {
            "Name": "Install Python Libraries",
            "ScriptBootstrapAction": {
                "Path": "s3://my-datalake/bootstrap/install_deps.sh",
                "Args": []
            }
        }
    ],

    JobFlowRole="EMR_EC2_DefaultRole",   # IAM instance profile for EC2 nodes
    ServiceRole="EMR_DefaultRole",        # IAM role for EMR service itself

    Tags=[
        {"Key": "Project",     "Value": "DataPlatform"},
        {"Key": "Environment", "Value": "prod"},
        {"Key": "CostCenter",  "Value": "DE-Team"},
    ],

    VisibleToAllUsers=True,
)

cluster_id = response["JobFlowId"]   # e.g. "j-2AXXXXXXGAPLF"
print(f"Cluster created: {cluster_id}")

🔑 Always Set KeepJobFlowAliveWhenNoSteps=False

In production batch pipelines, always set KeepJobFlowAliveWhenNoSteps=False. This means the cluster auto-terminates after all steps finish — you never pay for an idle cluster. If you forget this, an idle cluster can silently cost hundreds of dollars overnight.

describe_cluster() — Check Cluster State

After creating a cluster, use describe_cluster() to poll its state. The key field is Cluster['Status']['State']. Valid states: STARTING, BOOTSTRAPPING, RUNNING, WAITING, TERMINATING, TERMINATED, TERMINATED_WITH_ERRORS.

python — describe_cluster() + polling loop

import time

def wait_for_cluster_ready(emr_client, cluster_id, poll_interval=30):
    """Poll until cluster reaches WAITING (ready) or a terminal error state."""
    terminal_states = {"WAITING", "TERMINATED", "TERMINATED_WITH_ERRORS"}

    while True:
        response = emr_client.describe_cluster(ClusterId=cluster_id)
        state    = response["Cluster"]["Status"]["State"]
        reason   = response["Cluster"]["Status"].get("StateChangeReason", {})

        print(f"Cluster {cluster_id} state: {state}")

        if state == "WAITING":
            print("✅ Cluster is ready — WAITING for steps.")
            return True

        if state in {"TERMINATED", "TERMINATED_WITH_ERRORS"}:
            print(f"❌ Cluster failed. Reason: {reason}")
            raise RuntimeError(f"Cluster {cluster_id} terminated unexpectedly: {reason}")

        time.sleep(poll_interval)   # check every 30 seconds

# Usage
wait_for_cluster_ready(emr, cluster_id)

list_clusters() with Paginator — Filter by State

Use list_clusters() with a paginator to list all clusters, optionally filtered by state. Useful for finding all RUNNING or WAITING clusters to monitor cost or add steps programmatically.

python — list_clusters() with paginator

# List all RUNNING clusters
paginator = emr.get_paginator("list_clusters")
pages = paginator.paginate(ClusterStates=["RUNNING", "WAITING"])

for page in pages:
    for cluster in page["Clusters"]:
        print(f"ID: {cluster['Id']}  Name: {cluster['Name']}  State: {cluster['Status']['State']}")

terminate_job_flows() — Shut Down Clusters

If you used KeepJobFlowAliveWhenNoSteps=True (e.g. for a long-running cluster), you must explicitly terminate it when done. terminate_job_flows() accepts a list of cluster IDs — useful for bulk cleanup.

python — terminate_job_flows()

# Terminate a single cluster
emr.terminate_job_flows(JobFlowIds=[cluster_id])
print(f"Termination requested for {cluster_id}")

# Terminate multiple clusters in one call
emr.terminate_job_flows(
    JobFlowIds=["j-CLUSTER1", "j-CLUSTER2", "j-CLUSTER3"]
)

⚠️ Termination Protection

If TerminationProtected=True was set on the cluster, terminate_job_flows() will silently fail (the cluster won't terminate). You must first call set_termination_protection(JobFlowIds=[id], TerminationProtected=False) before terminating. Always keep TerminationProtected=False for auto-managed clusters.

⚡

Steps — add_job_flow_steps(), describe_step(), list_steps() STEPS ▼

What is an EMR Step?

A Step in EMR is a unit of work submitted to the cluster — in practice, this is almost always a spark-submit command. Steps run sequentially by default. Each step has a state: PENDING → RUNNING → COMPLETED / FAILED / CANCELLED. The ActionOnFailure field controls what happens when a step fails — either continue to the next step or terminate the entire cluster.

📦 Analogy

A cluster is like a factory floor with machines. Steps are the work orders you send to the factory. Each work order (step) runs one at a time. If a step fails, you can choose to either skip it and continue the next order (CONTINUE) or shut down the entire factory (TERMINATE_CLUSTER).

add_job_flow_steps() — Submit a Spark Step

The HadoopJarStep for Spark always uses command-runner.jar as the Jar, and passes the actual spark-submit command as Args. This is the standard pattern for running PySpark scripts on EMR.

python — add_job_flow_steps() — spark-submit step

response = emr.add_job_flow_steps(
    JobFlowId=cluster_id,
    Steps=[
        {
            "Name": "Silver Layer Transform — Orders",
            "ActionOnFailure": "CONTINUE",   # or "TERMINATE_CLUSTER"
            "HadoopJarStep": {
                "Jar": "command-runner.jar",   # always this value for spark-submit
                "Args": [
                    "spark-submit",
                    "--deploy-mode",  "cluster",
                    "--master",       "yarn",
                    "--conf", "spark.sql.shuffle.partitions=200",
                    "--conf", "spark.executor.memory=6g",
                    "--conf", "spark.executor.cores=2",
                    "--py-files", "s3://my-datalake/code/utils.zip",
                    "s3://my-datalake/code/orders_silver_transform.py",
                    # Any extra args become sys.argv in your script:
                    "--env",        "prod",
                    "--run-date",   "2024-03-15",
                    "--source-path","s3://my-datalake/bronze/orders/",
                    "--target-path","s3://my-datalake/silver/orders/",
                ]
            }
        }
    ]
)

step_ids = [s["StepId"] for s in response["StepIds"]]
print(f"Steps submitted: {step_ids}")   # e.g. ['s-2XXXXXXHXXXXXX']

🔑 ActionOnFailure choices

CONTINUE — step fails, next step still runs (good for independent steps). TERMINATE_CLUSTER — step fails, the whole cluster shuts down immediately (good for dependent pipelines where failure of step 1 makes step 2 meaningless). CANCEL_AND_WAIT — step fails, remaining steps are cancelled but cluster stays alive (good for debugging).

Submitting Multiple Steps at Once

You can pass multiple step definitions in one add_job_flow_steps() call. EMR queues them and runs them sequentially. This is the cleanest pattern for a multi-stage pipeline (Bronze → Silver → Gold).

python — multi-step pipeline: Bronze → Silver → Gold

def make_step(name, script_s3_path, extra_args=None):
    args = [
        "spark-submit", "--deploy-mode", "cluster",
        "--master", "yarn",
        script_s3_path
    ]
    if extra_args:
        args.extend(extra_args)
    return {
        "Name": name,
        "ActionOnFailure": "TERMINATE_CLUSTER",
        "HadoopJarStep": {"Jar": "command-runner.jar", "Args": args}
    }

response = emr.add_job_flow_steps(
    JobFlowId=cluster_id,
    Steps=[
        make_step("Bronze Ingest",   "s3://my-bucket/code/bronze_ingest.py"),
        make_step("Silver Transform", "s3://my-bucket/code/silver_transform.py"),
        make_step("Gold Aggregate",   "s3://my-bucket/code/gold_aggregate.py"),
    ]
)
step_ids = [s["StepId"] for s in response["StepIds"]]
print(f"3 steps submitted: {step_ids}")

describe_step() — Poll Step State

Use describe_step() to poll a specific step's state. The terminal states are COMPLETED, FAILED, and CANCELLED. Always poll with a sleep to avoid throttling.

python — describe_step() polling with exponential backoff

import time

def wait_for_step(emr_client, cluster_id, step_id, poll_interval=30):
    """Poll until step reaches a terminal state. Returns True on COMPLETED."""
    terminal_states = {"COMPLETED", "FAILED", "CANCELLED"}

    while True:
        response = emr_client.describe_step(
            ClusterId=cluster_id,
            StepId=step_id
        )
        step   = response["Step"]
        state  = step["Status"]["State"]
        name   = step["Name"]
        reason = step["Status"].get("FailureDetails", {})

        print(f"Step '{name}' ({step_id}): {state}")

        if state == "COMPLETED":
            print(f"✅ Step completed successfully.")
            return True

        if state in {"FAILED", "CANCELLED"}:
            print(f"❌ Step failed: {reason}")
            raise RuntimeError(f"EMR step {step_id} ended with state {state}: {reason}")

        time.sleep(poll_interval)

# Poll the last submitted step
wait_for_step(emr, cluster_id, step_ids[-1])

list_steps() with Paginator — All Steps History

Use list_steps() with a paginator to get the history of all steps on a cluster. Filter by StepStates to narrow results.

python — list_steps() with paginator

paginator = emr.get_paginator("list_steps")
pages = paginator.paginate(
    ClusterId=cluster_id,
    StepStates=["COMPLETED", "FAILED", "RUNNING"]
)

for page in pages:
    for step in page["Steps"]:
        print(
            f"StepId: {step['Id']}  "
            f"Name: {step['Name']}  "
            f"State: {step['Status']['State']}"
        )

cancel_steps() — Cancel Pending/Running Steps

If you need to abort a running or queued step (e.g. runaway job), use cancel_steps(). It accepts a list of step IDs.

python — cancel_steps()

emr.cancel_steps(
    ClusterId=cluster_id,
    StepIds=["s-STEPID1", "s-STEPID2"],
    StepCancellationOption="SEND_INTERRUPT"  # or "TERMINATE_PROCESS"
)
print("Steps cancellation requested.")

Using the Built-in EMR Waiter

Boto3 has built-in waiters for EMR. step_complete waiter polls describe_step() internally until the step reaches a terminal state. It's simpler than a manual loop but less customizable.

python — EMR step_complete waiter

# Built-in waiter — polls every 30 seconds, max 60 attempts (30 min)
waiter = emr.get_waiter("step_complete")
waiter.wait(
    ClusterId=cluster_id,
    StepId=step_ids[0],
    WaiterConfig={
        "Delay":      30,    # seconds between polls
        "MaxAttempts":120,   # 120 × 30s = 60 minutes max
    }
)
print("Step completed (waiter returned).")

☁️ Available EMR Waiters

cluster_running — waits for cluster state = RUNNING
cluster_terminated — waits for cluster state = TERMINATED
step_complete — waits for step state = COMPLETED

Note: There is no built-in waiter for WAITING state (cluster ready for steps). For that, use a manual polling loop as shown above.

☁️

EMR Serverless — create_application(), start_job_run(), get_job_run() SERVERLESS ▼

What is EMR Serverless and When to Use It

EMR Serverless removes all cluster management overhead. There are no master nodes, no core nodes, no bootstrap actions — you just submit a Spark job and AWS figures out compute. You pay only for the vCPU-hours and GB-hours your job actually uses. It's ideal for sporadic or unpredictable workloads where you don't want an idle cluster burning money.

Feature	Classic EMR	EMR Serverless
Cluster startup time	5–10 min	30–60 sec (pre-init)
Cluster management	You manage	AWS manages
Cost model	Per-hour (instance)	Per-use (vCPU·s)
Custom bootstrap	Yes	No (use custom images)
Best for	Long-running / predictable	Sporadic / variable

create_application() and start_application()

With EMR Serverless, you first create an Application (a reusable Spark runtime environment), then submit job runs to it. The application can be pre-initialized to reduce cold-start latency.

python — create and start an EMR Serverless application

emr_serverless = boto3.client("emr-serverless", region_name="us-east-1")

# Step 1: Create application (one-time setup, reuse across job runs)
app_response = emr_serverless.create_application(
    name="de-spark-app",
    releaseLabel="emr-6.15.0",
    type="SPARK",
    autoStartConfiguration={"enabled": True},   # auto-start when job submitted
    autoStopConfiguration={
        "enabled":         True,
        "idleTimeoutMinutes": 15   # stop if idle for 15 min
    },
    initialCapacity={           # pre-warm workers to reduce latency
        "DRIVER": {
            "workerCount": 1,
            "workerConfiguration": {"cpu":"2vCPU", "memory":"4GB"}
        },
        "EXECUTOR": {
            "workerCount": 5,
            "workerConfiguration": {"cpu":"4vCPU", "memory":"16GB"}
        }
    }
)
application_id = app_response["applicationId"]
print(f"Application created: {application_id}")

# Step 2: Start the application (if not auto-starting)
emr_serverless.start_application(applicationId=application_id)
print("Application started.")

start_job_run() — Submit a Spark Job

Once the application exists, submit Spark jobs with start_job_run(). The key section is jobDriver.sparkSubmit where you provide the script S3 path, entry point arguments, and Spark configuration overrides.

python — start_job_run() — Serverless Spark job

job_response = emr_serverless.start_job_run(
    applicationId=application_id,
    executionRoleArn="arn:aws:iam::123456789012:role/EMRServerlessExecutionRole",
    name="silver-orders-transform-2024-03-15",

    jobDriver={
        "sparkSubmit": {
            "entryPoint":     "s3://my-datalake/code/silver_transform.py",
            "entryPointArguments": [
                "--run-date", "2024-03-15",
                "--env",      "prod"
            ],
            "sparkSubmitParameters": (
                "--conf spark.executor.cores=4 "
                "--conf spark.executor.memory=16g "
                "--conf spark.sql.shuffle.partitions=200 "
                "--py-files s3://my-datalake/code/utils.zip"
            )
        }
    },

    configurationOverrides={
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://my-datalake/emr-serverless-logs/"
            }
        }
    },

    tags={"Project": "DataPlatform", "Env": "prod"}
)

job_run_id = job_response["jobRunId"]
print(f"Job submitted: {job_run_id}")

get_job_run() — Poll Job State

Poll get_job_run() to track your serverless job. States: SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS / FAILED / CANCELLING / CANCELLED.

python — poll get_job_run() until terminal state

import time

def wait_for_serverless_job(client, app_id, job_run_id, poll_interval=20):
    terminal_states = {"SUCCESS", "FAILED", "CANCELLED"}

    while True:
        resp  = client.get_job_run(applicationId=app_id, jobRunId=job_run_id)
        state = resp["jobRun"]["state"]
        print(f"Job {job_run_id}: {state}")

        if state == "SUCCESS":
            print("✅ Serverless job succeeded.")
            return True

        if state in {"FAILED", "CANCELLED"}:
            details = resp["jobRun"].get("stateDetails", "No details")
            raise RuntimeError(f"Serverless job {job_run_id} ended: {state} — {details}")

        time.sleep(poll_interval)

wait_for_serverless_job(emr_serverless, application_id, job_run_id)

cancel_job_run() and stop_application()

Use cancel_job_run() to abort a running serverless job. Use stop_application() to stop the application (releases all pre-initialized capacity). Use delete_application() to permanently remove the application.

python — cancel, stop, delete

# Cancel a running job
emr_serverless.cancel_job_run(
    applicationId=application_id,
    jobRunId=job_run_id
)

# Stop application (releases pre-init capacity but keeps config)
emr_serverless.stop_application(applicationId=application_id)

# Delete application permanently
emr_serverless.delete_application(applicationId=application_id)

🏗️

Full Production Pattern — Spin Up → Submit Steps → Poll → Terminate PRODUCTION ▼

Complete End-to-End EMR Pipeline (Classic)

This is the full pattern you'll use in Airflow, Lambda, or a control script: create cluster → wait for WAITING state → submit steps → poll each step → handle failure → terminate cluster → publish audit metrics.

python — complete EMR batch pipeline orchestration

import boto3, time, json
from botocore.exceptions import ClientError

emr = boto3.client("emr",         region_name="us-east-1")
cw  = boto3.client("cloudwatch",   region_name="us-east-1")
sns = boto3.client("sns",          region_name="us-east-1")

ALERT_TOPIC = "arn:aws:sns:us-east-1:123456789012:de-alerts"

def run_emr_pipeline(run_date):
    cluster_id = None
    try:
        # ── 1. Create cluster ─────────────────────
        resp = emr.run_job_flow(
            Name=f"pipeline-{run_date}",
            ReleaseLabel="emr-6.15.0",
            LogUri="s3://my-datalake/emr-logs/",
            Instances={
                "MasterInstanceType": "m5.xlarge",
                "SlaveInstanceType":  "m5.2xlarge",
                "InstanceCount":      4,
                "KeepJobFlowAliveWhenNoSteps": True,  # we'll terminate manually
                "Ec2SubnetId": "subnet-0abc123"
            },
            Applications=[{"Name": "Spark"}],
            JobFlowRole="EMR_EC2_DefaultRole",
            ServiceRole="EMR_DefaultRole",
        )
        cluster_id = resp["JobFlowId"]
        print(f"Cluster created: {cluster_id}")

        # ── 2. Wait for cluster to reach WAITING ──
        while True:
            state = emr.describe_cluster(
                ClusterId=cluster_id
            )["Cluster"]["Status"]["State"]
            print(f"  Cluster state: {state}")
            if state == "WAITING":   break
            if state in {"TERMINATED", "TERMINATED_WITH_ERRORS"}:
                raise RuntimeError(f"Cluster failed during startup: {state}")
            time.sleep(30)

        # ── 3. Submit Spark steps ─────────────────
        step_resp = emr.add_job_flow_steps(
            JobFlowId=cluster_id,
            Steps=[
                {
                    "Name": f"Silver Transform {run_date}",
                    "ActionOnFailure": "CONTINUE",
                    "HadoopJarStep": {
                        "Jar":  "command-runner.jar",
                        "Args": [
                            "spark-submit", "--deploy-mode", "cluster",
                            "s3://my-datalake/code/silver_transform.py",
                            "--run-date", run_date
                        ]
                    }
                }
            ]
        )
        step_id = step_resp["StepIds"][0]
        print(f"Step submitted: {step_id}")

        # ── 4. Poll step ──────────────────────────
        while True:
            step_state = emr.describe_step(
                ClusterId=cluster_id, StepId=step_id
            )["Step"]["Status"]["State"]
            print(f"  Step state: {step_state}")
            if step_state == "COMPLETED": break
            if step_state in {"FAILED", "CANCELLED"}:
                raise RuntimeError(f"Step {step_id} ended: {step_state}")
            time.sleep(30)

        print("✅ Pipeline completed successfully.")

    except (Exception, ClientError) as e:
        # ── 5. Alert on failure ───────────────────
        sns.publish(
            TopicArn=ALERT_TOPIC,
            Subject=f"[FAILURE] EMR pipeline {run_date}",
            Message=str(e)
        )
        raise

    finally:
        # ── 6. Always terminate cluster ───────────
        if cluster_id:
            emr.terminate_job_flows(JobFlowIds=[cluster_id])
            print(f"Cluster {cluster_id} termination requested.")

run_emr_pipeline("2024-03-15")

🔑 Always Terminate in a finally Block

Always wrap your EMR orchestration in a try/finally block and terminate the cluster in the finally. If you don't, a failed pipeline leaves an idle cluster running indefinitely, silently burning money. This is one of the most common and expensive production mistakes.

Spot Instances — Cost Reduction Pattern

Task nodes on EMR are stateless (no HDFS data) — they are perfect for Spot Instances. Use On-Demand for master and core nodes (stable), and Spot for task nodes (can be interrupted without data loss). This can reduce cluster cost by 60–80%.

python — Mixed On-Demand + Spot instance groups

response = emr.run_job_flow(
    Name="cost-optimized-cluster",
    ReleaseLabel="emr-6.15.0",
    LogUri="s3://my-datalake/emr-logs/",

    Instances={
        "InstanceGroups": [
            {  # Master — On-Demand (never interrupt master)
                "Name":          "Master",
                "InstanceRole":  "MASTER",
                "InstanceType":  "m5.xlarge",
                "InstanceCount": 1,
                "Market":        "ON_DEMAND",
            },
            {  # Core — On-Demand (stores HDFS data, can't be interrupted)
                "Name":          "Core",
                "InstanceRole":  "CORE",
                "InstanceType":  "m5.2xlarge",
                "InstanceCount": 2,
                "Market":        "ON_DEMAND",
            },
            {  # Task — Spot (stateless, safe to interrupt)
                "Name":          "Task-Spot",
                "InstanceRole":  "TASK",
                "InstanceType":  "m5.2xlarge",
                "InstanceCount": 6,
                "Market":        "SPOT",
                "BidPrice":      "0.10",    # max bid per hour
            },
        ],
        "KeepJobFlowAliveWhenNoSteps": False,
        "Ec2SubnetId": "subnet-0abc123"
    },

    Applications=[{"Name": "Spark"}],
    JobFlowRole="EMR_EC2_DefaultRole",
    ServiceRole="EMR_DefaultRole",
)

29.30.9 — BOTO3 DEEP DIVE

Lambda APIs

AWS Lambda lets you run Python code without managing servers. In data engineering, Lambda is the glue between services — it reacts to S3 file arrivals, triggers Glue jobs, invokes EMR steps, writes audit records to DynamoDB, and sends alerts via SNS. The boto3 Lambda API lets you invoke functions, deploy code updates, manage configuration, and set up triggers — all programmatically.

⚡

invoke() — Call a Lambda Function Most Used ▼

How invoke() Works — Sync vs Async

invoke() calls a Lambda function. The key parameter is InvocationType:

• RequestResponse (default) — synchronous. Your code waits until Lambda finishes and returns the response payload. Use this when you need the result (e.g. a validation function that returns pass/fail).
• Event — asynchronous. boto3 returns immediately after queuing the invocation; Lambda runs in the background. Use this when you just want to trigger something and don't need the result (e.g. fire-and-forget notification Lambda).

📦 Analogy

RequestResponse is like calling a colleague on the phone and waiting on hold until they answer. Event is like leaving a voicemail — you hang up immediately and they call back later (or not — you won't know).

python — invoke() synchronous (RequestResponse)

import boto3
import json

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Synchronous invoke — wait for result ──
response = lambda_client.invoke(
    FunctionName   = "validate-orders-fn",        # function name or ARN
    InvocationType = "RequestResponse",           # sync — wait for response
    Payload        = json.dumps({                 # must be bytes or JSON string
        "bucket"   : "my-datalake",
        "key"      : "bronze/orders/2024/03/15/orders.parquet",
        "run_date" : "2024-03-15"
    }).encode("utf-8")
)

# ── Parse the response ──
status_code = response["StatusCode"]             # HTTP 200 = Lambda invoked OK
function_error = response.get("FunctionError")  # "Handled" or "Unhandled" if Lambda threw

payload = json.loads(response["Payload"].read()) # read the StreamingBody

print(f"HTTP Status : {status_code}")           # 200
print(f"FunctionError: {function_error}")        # None if success
print(f"Lambda result: {payload}")
# Example payload: {"status": "passed", "row_count": 48291, "dq_score": 99.2}

if function_error:
    raise RuntimeError(f"Lambda function error: {payload}")

⚠️ StatusCode 200 ≠ Lambda Success

response["StatusCode"] == 200 just means Lambda was invoked successfully — not that your function ran without errors. Always check response.get("FunctionError"). If it's "Handled" or "Unhandled", the function threw an exception. The actual error is in the Payload.

python — invoke() asynchronous (Event)

import boto3, json

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Async invoke — fire and forget ──
response = lambda_client.invoke(
    FunctionName   = "send-pipeline-alert-fn",
    InvocationType = "Event",                      # async — returns immediately (202)
    Payload        = json.dumps({
        "pipeline" : "orders-silver-load",
        "status"   : "SUCCESS",
        "rows"     : 48291,
        "duration" : "4m 32s"
    }).encode("utf-8")
)

# StatusCode 202 = accepted for async execution
print(f"Async invoke accepted. Status: {response['StatusCode']}")
# No Payload to read — Lambda runs in the background

📌 When to Use Each

Use RequestResponse when the next step in your pipeline depends on the Lambda result (validation, data lookup, token generation). Use Event for notifications, audit writes, and fire-and-forget triggers where you don't need confirmation before proceeding.

Reading Response Payload Correctly

The Payload in a Lambda response is a StreamingBody object — you must call .read() on it and then json.loads() to get the actual data. If you forget .read(), you'll get a StreamingBody object instead of your data.

python — safe payload reader with error handling

import boto3, json
from botocore.exceptions import ClientError

lambda_client = boto3.client("lambda", region_name="us-east-1")

def invoke_lambda(function_name: str, payload: dict) -> dict:
    """
    Invoke Lambda synchronously and return the parsed response.
    Raises RuntimeError if Lambda itself threw an exception.
    """
    try:
        response = lambda_client.invoke(
            FunctionName   = function_name,
            InvocationType = "RequestResponse",
            Payload        = json.dumps(payload).encode("utf-8")
        )
    except ClientError as e:
        # boto3-level error (e.g. function not found, no permissions)
        code = e.response["Error"]["Code"]
        if code == "ResourceNotFoundException":
            raise ValueError(f"Lambda function '{function_name}' not found")
        raise

    # Read and parse the payload
    raw = response["Payload"].read()
    result = json.loads(raw) if raw else {}

    # Check for Lambda-level errors (function threw an exception)
    if response.get("FunctionError"):
        error_type = result.get("errorType", "UnknownError")
        error_msg  = result.get("errorMessage", str(result))
        raise RuntimeError(f"Lambda {function_name} failed [{error_type}]: {error_msg}")

    return result


# ── Usage ──
result = invoke_lambda(
    function_name = "validate-orders-fn",
    payload       = {"bucket": "my-datalake", "key": "bronze/orders/2024-03-15/"}
)
print(f"DQ Score: {result['dq_score']}")
print(f"Row Count: {result['row_count']}")

🔧

Managing Lambda Functions — list, get, create, update, delete Management ▼

list_functions() with Paginator

Lambda functions are paginated when listing. Always use a paginator — an account can have hundreds of functions. Useful for auditing all Lambda functions, finding functions by naming convention, or building a deployment inventory.

python — list_functions() with paginator

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── List all Lambda functions ──
paginator = lambda_client.get_paginator("list_functions")
pages = paginator.paginate()

de_functions = []  # collect only data engineering functions

for page in pages:
    for fn in page["Functions"]:
        name    = fn["FunctionName"]
        runtime = fn["Runtime"]
        memory  = fn["MemorySize"]
        timeout = fn["Timeout"]
        role    = fn["Role"]

        # Filter to only DE-related Lambdas by naming convention
        if "pipeline" in name or "etl" in name or "glue" in name:
            de_functions.append(name)
            print(f"{name:50s} | {runtime:12s} | {memory}MB | {timeout}s")

print(f"\nTotal DE Lambda functions: {len(de_functions)}")

get_function() — Inspect a Specific Function

get_function() returns full metadata about a Lambda function: its runtime, memory, timeout, environment variables, VPC config, layers, and a pre-signed URL to download the current deployment package.

python — get_function()

response = lambda_client.get_function(FunctionName="validate-orders-fn")

config = response["Configuration"]
print(f"Runtime      : {config['Runtime']}")         # python3.11
print(f"Memory       : {config['MemorySize']} MB")   # 512
print(f"Timeout      : {config['Timeout']} seconds") # 300
print(f"Last Modified: {config['LastModified']}")
print(f"Code Size    : {config['CodeSize']} bytes")

# Environment variables (redacted in response — values shown)
env_vars = config.get("Environment", {}).get("Variables", {})
print(f"Env vars     : {list(env_vars.keys())}")     # ['ENV', 'BUCKET', 'TABLE']

# VPC config (if function is in a VPC)
vpc = config.get("VpcConfig", {})
print(f"VPC Subnets  : {vpc.get('SubnetIds', [])}")

# Pre-signed S3 URL to download current code package
code_url = response["Code"]["Location"]
print(f"Code download URL (expires 10min): {code_url[:60]}...")

create_function() — Deploy a New Lambda

You can create Lambda functions programmatically — useful in CI/CD pipelines or when building infrastructure-as-code without Terraform. The code must be uploaded as a ZIP file (either inline as bytes or referenced from an S3 bucket).

python — create_function() from S3

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Deploy Lambda from S3 ZIP ──
response = lambda_client.create_function(
    FunctionName  = "trigger-glue-on-arrival-fn",
    Runtime       = "python3.11",
    Role          = "arn:aws:iam::123456789012:role/LambdaGlueRole",
    Handler       = "lambda_handler.handler",   # filename.function_name
    Code          = {
        "S3Bucket": "my-datalake",
        "S3Key"   : "lambda-code/trigger-glue-on-arrival.zip"
    },
    Description   = "Trigger Glue ETL job when a new file lands in Bronze S3",
    Timeout       = 300,          # 5 minutes max (Lambda hard limit = 15 min)
    MemorySize    = 256,          # MB — start low, tune after profiling
    Environment   = {
        "Variables": {
            "GLUE_JOB_NAME" : "orders-bronze-to-silver",
            "ENV"           : "prod",
            "AUDIT_TABLE"   : "pipeline-audit-log"
        }
    },
    VpcConfig = {
        "SubnetIds"        : ["subnet-private-1a", "subnet-private-1b"],
        "SecurityGroupIds" : ["sg-lambda-outbound"]
    },
    Layers = [
        "arn:aws:lambda:us-east-1:123456789012:layer:pandas-layer:3"
    ],
    Tags = {"Project": "DataPlatform", "Environment": "prod"}
)

arn = response["FunctionArn"]
print(f"Created Lambda: {arn}")

📌 Handler Format

Handler = "filename.function_name". If your file is lambda_handler.py and your function is def handler(event, context):, then Handler = "lambda_handler.handler". The filename is the Python module name (no .py).

update_function_code() — Deploy New Code

When you update your Lambda code (bug fix, new logic), use update_function_code() to push the new ZIP. This is what CI/CD pipelines do after building and uploading a new deployment package to S3.

python — update_function_code() from S3

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Update code — typically run from a CI/CD pipeline after new ZIP is uploaded ──
response = lambda_client.update_function_code(
    FunctionName = "trigger-glue-on-arrival-fn",
    S3Bucket     = "my-datalake",
    S3Key        = "lambda-code/trigger-glue-on-arrival-v2.zip",
    Publish      = True    # True = create a new published version (immutable snapshot)
)

print(f"Updated to version : {response['Version']}")
print(f"Last modified      : {response['LastModified']}")
print(f"Code SHA256        : {response['CodeSha256']}")

# ── Alternative: upload ZIP bytes directly (for small functions in CI ──
with open("my_lambda.zip", "rb") as f:
    zip_bytes = f.read()

lambda_client.update_function_code(
    FunctionName = "trigger-glue-on-arrival-fn",
    ZipFile      = zip_bytes
)

update_function_configuration() — Change Settings

Use update_function_configuration() to change runtime settings without redeploying code: update environment variables, increase memory or timeout, change the execution role, or update VPC settings.

python — update_function_configuration()

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Update environment variables ──
lambda_client.update_function_configuration(
    FunctionName = "trigger-glue-on-arrival-fn",
    Environment  = {
        "Variables": {
            "GLUE_JOB_NAME" : "orders-bronze-to-silver-v2",   # updated job name
            "ENV"           : "prod",
            "AUDIT_TABLE"   : "pipeline-audit-log",
            "ALERT_SNS_ARN" : "arn:aws:sns:us-east-1:123:pipeline-alerts"  # new var
        }
    },
    Timeout    = 600,   # increased from 300 to 600 seconds
    MemorySize = 512    # increased from 256 to 512 MB
)

print("Function configuration updated")

# ── Important: wait for update to propagate before invoking ──
# Lambda updates are eventually consistent — add a small sleep or use a waiter
import time; time.sleep(3)   # or use lambda_client.get_waiter("function_updated")

⚠️ Update Propagation Delay

After calling update_function_code() or update_function_configuration(), the update takes a few seconds to propagate. If you immediately invoke the function, you might hit the old version. Use the function_updated waiter or a brief sleep in CI/CD pipelines.

add_permission() — Grant Trigger Access

When you want S3, EventBridge, SQS, or SNS to invoke your Lambda, you must grant them permission via a resource-based policy using add_permission(). Without this, the trigger source gets a permission error when trying to call Lambda.

python — add_permission() for S3 trigger

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# ── Allow S3 to invoke this Lambda (for S3 event notifications) ──
lambda_client.add_permission(
    FunctionName  = "trigger-glue-on-arrival-fn",
    StatementId   = "AllowS3Invoke-bronze-bucket",   # unique ID for this permission
    Action        = "lambda:InvokeFunction",
    Principal     = "s3.amazonaws.com",
    SourceArn     = "arn:aws:s3:::my-datalake",      # only THIS bucket can invoke
    SourceAccount = "123456789012"                    # prevents confused deputy attack
)
print("S3 trigger permission added")


# ── Allow EventBridge to invoke Lambda (for scheduled runs) ──
lambda_client.add_permission(
    FunctionName = "trigger-glue-on-arrival-fn",
    StatementId  = "AllowEventBridgeInvoke",
    Action       = "lambda:InvokeFunction",
    Principal    = "events.amazonaws.com",
    SourceArn    = "arn:aws:events:us-east-1:123456789012:rule/daily-etl-rule"
)

💡 Resource Policy vs Execution Role

Execution Role controls what the Lambda function can do (e.g. read S3, call Glue). Resource Policy (add_permission) controls who can invoke the Lambda. Both are needed. The execution role is set on the function; resource policies are attached to specific trigger sources.

delete_function()

Delete a Lambda function when decommissioning a pipeline. You can delete a specific published version or the entire function (all versions).

python — delete_function()

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

# Delete entire function (all versions and aliases)
lambda_client.delete_function(FunctionName="old-legacy-pipeline-fn")
print("Function deleted")

# Delete only a specific published version (keeps $LATEST and other versions)
lambda_client.delete_function(
    FunctionName = "trigger-glue-on-arrival-fn",
    Qualifier    = "5"   # version number
)
print("Version 5 deleted")

🏭

Lambda in Data Engineering — Production Patterns Production ▼

Pattern 1 — S3 File Arrival → Trigger Glue Job

The most common DE Lambda pattern: S3 sends an event notification to Lambda when a file lands. Lambda reads the event, extracts the bucket and key, then starts a Glue job with the file details as arguments. This makes pipelines event-driven — they run automatically the moment data arrives.

python — Lambda handler: S3 event → Glue job

import boto3, os, json, logging
from datetime import datetime

logger = logging.getLogger()
logger.setLevel(logging.INFO)

glue   = boto3.client("glue",     region_name="us-east-1")
dynamo = boto3.client("dynamodb", region_name="us-east-1")
sns    = boto3.client("sns",      region_name="us-east-1")

GLUE_JOB_NAME = os.environ["GLUE_JOB_NAME"]    # from Lambda env vars
AUDIT_TABLE   = os.environ["AUDIT_TABLE"]
SNS_ARN       = os.environ["ALERT_SNS_ARN"]


def handler(event, context):
    """Triggered by S3 event notification when a new file lands."""

    for record in event.get("Records", []):

        # ── Extract S3 file details from the event ──
        bucket = record["s3"]["bucket"]["name"]
        key    = record["s3"]["object"]["key"]
        size   = record["s3"]["object"]["size"]
        logger.info(f"File arrived: s3://{bucket}/{key}  size={size} bytes")

        # ── Start the Glue ETL job ──
        try:
            response = glue.start_job_run(
                JobName   = GLUE_JOB_NAME,
                Arguments = {
                    "--source_bucket" : bucket,
                    "--source_key"    : key,
                    "--run_date"      : datetime.utcnow().strftime("%Y-%m-%d"),
                    "--trigger"       : "s3_event"
                }
            )
            run_id = response["JobRunId"]
            logger.info(f"Glue job started: {run_id}")

        except glue.exceptions.ConcurrentRunsExceededException:
            logger.warning("Glue job already at max concurrent runs — skipping this trigger")
            return {"status": "skipped", "reason": "concurrent_limit"}

        # ── Write audit record to DynamoDB ──
        dynamo.put_item(
            TableName = AUDIT_TABLE,
            Item = {
                "run_id"    : {"S": run_id},
                "job_name"  : {"S": GLUE_JOB_NAME},
                "source_key": {"S": key},
                "status"    : {"S": "STARTED"},
                "triggered_at": {"S": datetime.utcnow().isoformat()},
                "trigger"   : {"S": "s3_event"}
            }
        )

    return {"status": "ok", "glue_run_id": run_id}

✅ Real Usage

Your data provider drops an orders CSV into s3://my-datalake/landing/orders/ every 30 minutes. S3 notifies Lambda, Lambda triggers the Glue Bronze ingestion job with the exact file path. No polling, no cron — runs the moment data arrives.

Pattern 2 — Trigger EMR Step from Lambda

When your heavy processing runs on EMR, Lambda acts as the orchestrator: it finds the running cluster (or creates one), submits a Spark step, and returns the step ID. A separate polling mechanism (Airflow, another Lambda, or a Step Function) tracks completion.

python — Lambda: trigger EMR Spark step

import boto3, os, json

emr = boto3.client("emr", region_name="us-east-1")

CLUSTER_ID  = os.environ["EMR_CLUSTER_ID"]
SCRIPT_PATH = os.environ["SPARK_SCRIPT_S3_PATH"]   # s3://my-datalake/code/transform.py


def handler(event, context):
    run_date = event.get("run_date", "2024-03-15")

    # Submit Spark step to existing running cluster
    response = emr.add_job_flow_steps(
        JobFlowId = CLUSTER_ID,
        Steps = [{
            "Name"           : f"Silver Transform {run_date}",
            "ActionOnFailure": "CONTINUE",
            "HadoopJarStep"  : {
                "Jar" : "command-runner.jar",
                "Args": [
                    "spark-submit",
                    "--deploy-mode", "cluster",
                    "--master",      "yarn",
                    SCRIPT_PATH,
                    "--run-date", run_date
                ]
            }
        }]
    )

    step_id = response["StepIds"][0]
    print(f"EMR step submitted: {step_id}")

    return {"cluster_id": CLUSTER_ID, "step_id": step_id, "run_date": run_date}

Pattern 3 — Metadata Write to DynamoDB from Lambda

Lambda is commonly used as a pipeline audit writer — called at the end of every Glue or EMR job (via SNS or EventBridge) to record the run result in a DynamoDB audit table. This gives you a central job history table queryable from anywhere.

python — Lambda: write audit record from Glue job completion event

import boto3, os, json
from datetime import datetime

dynamo = boto3.client("dynamodb", region_name="us-east-1")
AUDIT_TABLE = os.environ["AUDIT_TABLE"]


def handler(event, context):
    """
    Called by EventBridge when a Glue job state changes to SUCCEEDED or FAILED.
    EventBridge Glue event detail structure:
    {
      "jobName": "orders-bronze-to-silver",
      "state"  : "SUCCEEDED",
      "jobRunId": "jr_abc123",
      "message": ""
    }
    """
    detail   = event.get("detail", {})
    job_name = detail.get("jobName", "unknown")
    state    = detail.get("state", "unknown")
    run_id   = detail.get("jobRunId", "unknown")
    message  = detail.get("message", "")

    dynamo.put_item(
        TableName = AUDIT_TABLE,
        Item = {
            "run_id"      : {"S": run_id},
            "job_name"    : {"S": job_name},
            "status"      : {"S": state},
            "message"     : {"S": message},
            "recorded_at" : {"S": datetime.utcnow().isoformat()}
        }
    )
    print(f"Audit written: {job_name} → {state}")
    return {"ok": True}

Pattern 4 — Pipeline Failure Notification Handler

A dedicated Lambda function subscribed to an SNS failure topic handles all pipeline failure alerts. It formats a rich message and sends it to Slack (via HTTP), email, or PagerDuty. Centralising notifications in one Lambda keeps all other pipeline code clean.

python — Lambda: failure notification handler

import boto3, os, json
import urllib.request

SLACK_WEBHOOK = os.environ.get("SLACK_WEBHOOK_URL")


def handler(event, context):
    """Triggered by SNS pipeline-failure topic."""
    for record in event.get("Records", []):
        message = json.loads(record["Sns"]["Message"])

        pipeline  = message.get("pipeline", "unknown")
        status    = message.get("status",   "FAILED")
        error     = message.get("error",    "No error details")
        run_id    = message.get("run_id",   "unknown")
        run_date  = message.get("run_date", "unknown")

        # ── Format Slack message ──
        text = (
            f":rotating_light: *Pipeline Alert*\n"
            f"Pipeline : `{pipeline}`\n"
            f"Status   : *{status}*\n"
            f"Run ID   : `{run_id}`\n"
            f"Run Date : `{run_date}`\n"
            f"Error    : ```{error}```"
        )

        if SLACK_WEBHOOK:
            payload = json.dumps({"text": text}).encode("utf-8")
            req = urllib.request.Request(
                SLACK_WEBHOOK,
                data=payload,
                headers={"Content-Type": "application/json"},
                method="POST"
            )
            with urllib.request.urlopen(req) as resp:
                print(f"Slack notification sent: {resp.status}")

    return {"ok": True}

🔶 Full Alert Architecture

CloudWatch Alarm detects pipeline failure → publishes to SNS pipeline-alerts topic → SNS fans out to: (1) this Lambda for Slack/PagerDuty, (2) Email subscription, (3) SQS for dead-letter storage. One SNS topic, three notification channels.

Error Handling in Lambda Handlers

Lambda automatically retries async invocations on failure (2 retries by default). This means your handler must be idempotent — safe to run twice with the same event. For Glue triggers, always check if a run is already in progress before starting another.

python — idempotent Lambda handler pattern

import boto3, os, json, logging
from botocore.exceptions import ClientError

logger = logging.getLogger()
logger.setLevel(logging.INFO)

glue = boto3.client("glue", region_name="us-east-1")
GLUE_JOB_NAME = os.environ["GLUE_JOB_NAME"]


def is_job_already_running(job_name: str) -> bool:
    """Check if a Glue job run is already in progress."""
    paginator = glue.get_paginator("get_job_runs")
    for page in paginator.paginate(JobName=job_name, MaxResults=5):
        for run in page["JobRuns"]:
            if run["JobRunState"] in ("STARTING", "RUNNING", "STOPPING"):
                logger.info(f"Job already running: {run['Id']}")
                return True
    return False


def handler(event, context):
    try:
        # Idempotency check — skip if already running
        if is_job_already_running(GLUE_JOB_NAME):
            return {"status": "skipped", "reason": "already_running"}

        response = glue.start_job_run(
            JobName   = GLUE_JOB_NAME,
            Arguments = {"--run_date": event.get("run_date", "latest")}
        )
        run_id = response["JobRunId"]
        logger.info(f"Started Glue job: {run_id}")
        return {"status": "started", "run_id": run_id}

    except ClientError as e:
        code = e.response["Error"]["Code"]
        logger.error(f"ClientError [{code}]: {e}")
        # Re-raise so Lambda marks this invocation as failed and retries
        raise

⏳

Lambda Waiters — FunctionActive, FunctionUpdated Waiters ▼

Built-in Lambda Waiters

Lambda has two built-in boto3 waiters that are useful in CI/CD pipelines:

• function_active — waits until a newly created function is fully active and invocable.
• function_updated — waits until a update_function_code() or update_function_configuration() update has fully propagated and the function is ready to invoke again.

Without waiters, invoking immediately after create/update can hit a stale state and fail with ResourceConflictException.

python — Lambda waiters in CI/CD pipeline

import boto3

lambda_client = boto3.client("lambda", region_name="us-east-1")

FUNCTION_NAME = "trigger-glue-on-arrival-fn"

# ── 1. Deploy new code ──
lambda_client.update_function_code(
    FunctionName = FUNCTION_NAME,
    S3Bucket     = "my-datalake",
    S3Key        = "lambda-code/trigger-glue-v3.zip"
)
print("Code update submitted. Waiting for propagation...")

# ── 2. Wait until update is fully applied ──
waiter = lambda_client.get_waiter("function_updated")
waiter.wait(
    FunctionName = FUNCTION_NAME,
    WaiterConfig = {
        "Delay"      : 5,    # poll every 5 seconds
        "MaxAttempts": 20    # give up after 100 seconds
    }
)
print("Update complete. Function is ready.")

# ── 3. Safe to invoke now ──
response = lambda_client.invoke(
    FunctionName   = FUNCTION_NAME,
    InvocationType = "RequestResponse",
    Payload        = b'{"test": true}'
)
print(f"Test invoke status: {response['StatusCode']}")

✅ When This Matters

In a GitHub Actions CI/CD pipeline: upload ZIP to S3 → update_function_code() → function_updated waiter → smoke-test invoke → done. Without the waiter, the smoke test might invoke the old version and give a false green.

📋

Lambda API Quick Reference Reference ▼

All Lambda boto3 APIs at a Glance

API	What It Does	Key Parameter	Returns
`invoke()`	Call a Lambda function	`InvocationType` (RequestResponse / Event)	StatusCode, Payload, FunctionError
`list_functions()`	List all functions (paginated)	—	List of function configs
`get_function()`	Get metadata + code URL for a function	`FunctionName`	Configuration + Code
`create_function()`	Deploy a new Lambda	`Runtime, Role, Handler, Code`	FunctionArn
`update_function_code()`	Push new deployment package	`S3Bucket/S3Key` or `ZipFile`	Version, CodeSha256
`update_function_configuration()`	Change memory, timeout, env vars	`MemorySize, Timeout, Environment`	Updated config
`add_permission()`	Allow a service to invoke Lambda	`Principal, Action, SourceArn`	Statement JSON
`delete_function()`	Delete function or specific version	`FunctionName, Qualifier`	—
Waiter: `function_active`	Wait until new function is invocable	`FunctionName`	—
Waiter: `function_updated`	Wait until code/config update is live	`FunctionName`	—

29.30.10

Secrets Manager APIs

AWS Secrets Manager is where production pipelines store database passwords, API tokens, and private keys — never in code or environment variables committed to git. Every Data Engineer must know how to retrieve, create, rotate, and list secrets using boto3. This section covers every API you will actually use, with the full production pattern that appears in almost every real pipeline.

🗝️

get_secret_value() — The Core Retrieval API Most Used ▼

What is get_secret_value()?

get_secret_value() is the single most important Secrets Manager API. It retrieves the current value of a secret. Secrets are stored either as a plain string (SecretString) or as binary data (SecretBinary). In Data Engineering, almost all secrets are stored as a JSON string inside SecretString — so you always call json.loads(SecretString) after retrieval.

💡 Analogy

Think of Secrets Manager as a safety deposit box in a bank vault. Your code does not carry the keys around — it goes to the vault, shows its IAM identity badge, and retrieves the key it needs. The bank (AWS) logs every visit. No one else can open your box without the right badge.

📌 Key Concept

The secret value is always in response["SecretString"] for text secrets. The value is a raw string — if you stored JSON (username + password + host), you must parse it with json.loads(). This is the universal pattern in every production pipeline.

python — get_secret_value() basic usage

import boto3
import json

client = boto3.client("secretsmanager", region_name="us-east-1")

# ── Retrieve a secret ──
response = client.get_secret_value(SecretId="prod/orders-db/credentials")

# SecretString contains the raw JSON string stored in Secrets Manager
secret_string = response["SecretString"]

# Parse it into a Python dict
credentials = json.loads(secret_string)

# Now use individual fields
db_host     = credentials["host"]
db_port     = credentials["port"]
db_user     = credentials["username"]
db_password = credentials["password"]
db_name     = credentials["dbname"]

print(f"Connecting to {db_host}:{db_port}/{db_name} as {db_user}")

📦 Example Secret JSON stored in Secrets Manager

The secret value stored in AWS (SecretString) looks like this:

{"host":"orders-db.abc123.us-east-1.rds.amazonaws.com","port":5432,"username":"etl_user","password":"S3cur3P@ss!","dbname":"orders"}

After json.loads(), you get a Python dict with all those fields ready to use in your JDBC connection or psycopg2 call.

SecretId — Name vs ARN

The SecretId parameter accepts either the secret name (human-readable) or the full ARN. In production, always prefer the ARN for cross-account access or when there could be naming conflicts. For same-account access, the name works fine and is easier to read in code.

python — SecretId as name vs ARN

import boto3, json

client = boto3.client("secretsmanager", region_name="us-east-1")

# Option 1 — by name (simple, same-account)
resp = client.get_secret_value(SecretId="prod/orders-db/credentials")

# Option 2 — by full ARN (cross-account, unambiguous)
resp = client.get_secret_value(
    SecretId="arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/orders-db/credentials-AbCdEf"
)

# Option 3 — retrieve a specific VERSION (for rotation testing)
resp = client.get_secret_value(
    SecretId    = "prod/orders-db/credentials",
    VersionStage= "AWSPREVIOUS"   # AWSCURRENT (default) or AWSPREVIOUS
)

credentials = json.loads(resp["SecretString"])
print(credentials)

ℹ️ Version Stages

Secrets Manager maintains two versions during rotation: AWSCURRENT (default — the active secret your app uses) and AWSPREVIOUS (the old secret, kept briefly so in-flight connections don't fail). You almost always use AWSCURRENT; AWSPREVIOUS is useful for debugging rotation issues.

Full Production Pattern — DB Connection from Secret

This is the pattern every real pipeline uses. Retrieve secret → parse JSON → build JDBC URL or psycopg2 connection. The secret is fetched once at startup (or per Lambda invocation) and cached in a module-level variable to avoid hitting Secrets Manager on every row processed.

python — production DB connection pattern with Secrets Manager

import boto3, json, psycopg2
from functools import lru_cache

sm = boto3.client("secretsmanager", region_name="us-east-1")


@lru_cache(maxsize=1)
def get_db_credentials(secret_name: str) -> dict:
    """
    Fetch and cache DB credentials from Secrets Manager.
    lru_cache ensures we only call Secrets Manager ONCE per process lifetime.
    In Lambda, this cache persists across warm invocations — saving cost + latency.
    """
    resp = sm.get_secret_value(SecretId=secret_name)
    return json.loads(resp["SecretString"])


def get_db_connection(secret_name: str):
    """Return a live psycopg2 connection using credentials from Secrets Manager."""
    creds = get_db_credentials(secret_name)
    conn = psycopg2.connect(
        host     = creds["host"],
        port     = creds.get("port", 5432),
        user     = creds["username"],
        password = creds["password"],
        dbname   = creds["dbname"]
    )
    return conn


# ── Usage in a pipeline ──
if __name__ == "__main__":
    SECRET = "prod/orders-db/credentials"
    conn   = get_db_connection(SECRET)
    cursor = conn.cursor()
    cursor.execute("SELECT COUNT(*) FROM orders WHERE status = 'PENDING'")
    count = cursor.fetchone()[0]
    print(f"Pending orders: {count}")
    conn.close()

📌 PySpark JDBC Pattern

For Spark JDBC connections, pass the password directly from the secret into spark.read.jdbc() — never hardcode it in the url string or write it to a config file on disk.

python — PySpark JDBC with Secrets Manager credentials

import boto3, json
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("orders-etl").getOrCreate()
sm    = boto3.client("secretsmanager", region_name="us-east-1")

# ── Fetch credentials ──
creds = json.loads(
    sm.get_secret_value(SecretId="prod/orders-db/credentials")["SecretString"]
)

jdbc_url = (
    f"jdbc:postgresql://{creds['host']}:{creds.get('port',5432)}/{creds['dbname']}"
)

# ── Read from PostgreSQL via JDBC ──
df = (
    spark.read
    .format("jdbc")
    .option("url",      jdbc_url)
    .option("dbtable",  "orders")
    .option("user",     creds["username"])
    .option("password", creds["password"])  # ← from Secrets Manager, not hardcoded
    .option("driver",   "org.postgresql.Driver")
    .load()
)

df.show(5)

➕

create_secret() and put_secret_value() — Creating and Updating Secrets Write APIs ▼

create_secret() — First-Time Secret Creation

create_secret() creates a brand-new secret in Secrets Manager. You typically call this in Terraform or a one-time setup script — not in your running pipeline. It sets the name, description, optional KMS key, and the initial secret value.

python — create_secret()

import boto3, json

sm = boto3.client("secretsmanager", region_name="us-east-1")

# ── Create a new secret ──
response = sm.create_secret(
    Name        = "prod/orders-db/credentials",   # The secret name (path-style is best practice)
    Description = "PostgreSQL credentials for the orders pipeline",
    KmsKeyId    = "alias/my-pipeline-key",         # Optional — use CMK instead of default key
    SecretString= json.dumps({                     # Always store as JSON for structured creds
        "host"    : "orders-db.abc123.us-east-1.rds.amazonaws.com",
        "port"    : 5432,
        "username": "etl_user",
        "password": "InitialP@ssword123!",
        "dbname"  : "orders"
    }),
    Tags = [
        {"Key": "Project",     "Value": "orders-platform"},
        {"Key": "Environment", "Value": "prod"},
        {"Key": "ManagedBy",   "Value": "terraform"}
    ]
)

print(f"Secret created: {response['ARN']}")

⚠️ Already Exists Error

If you call create_secret() on a name that already exists, you get a ResourceExistsException. Always wrap in a try/except and fall back to put_secret_value() if the secret already exists, or use Terraform/CDK to manage the lifecycle.

put_secret_value() — Rotate or Update an Existing Secret

put_secret_value() updates the value of an existing secret. This is used during manual rotation (changing a password), updating an API token that has expired, or programmatic rotation outside of Secrets Manager's built-in rotation. The old value becomes AWSPREVIOUS; the new value becomes AWSCURRENT.

python — put_secret_value() for password rotation

import boto3, json

sm = boto3.client("secretsmanager", region_name="us-east-1")

SECRET_NAME = "prod/orders-db/credentials"

# ── Read current secret ──
current = json.loads(
    sm.get_secret_value(SecretId=SECRET_NAME)["SecretString"]
)

# ── Build updated secret (only change what rotated) ──
updated = {**current, "password": "NewRotatedP@ss456!"}

# ── Write new version ──
response = sm.put_secret_value(
    SecretId    = SECRET_NAME,
    SecretString= json.dumps(updated),
    # VersionStages is optional — AWS sets AWSCURRENT automatically
)

print(f"Secret updated. New version: {response['VersionId']}")

📦 Real Use Case — Refreshing an Expired API Token

Some APIs (Snowflake OAuth, Salesforce, etc.) return short-lived tokens. A scheduled Lambda runs every hour, fetches a fresh token from the API, and calls put_secret_value() to update Secrets Manager. All pipelines reading that secret automatically get the fresh token on their next invocation — no restarts needed.

Upsert Pattern — create or update safely

In automation scripts, you often don't know if the secret exists yet. The pattern is: try create_secret(), and if it fails with ResourceExistsException, call put_secret_value() instead. This is the safe upsert pattern for secrets.

python — secret upsert pattern

import boto3, json
from botocore.exceptions import ClientError

sm = boto3.client("secretsmanager", region_name="us-east-1")

def upsert_secret(secret_name: str, secret_dict: dict) -> str:
    """Create secret if it doesn't exist, else update it."""
    secret_string = json.dumps(secret_dict)
    try:
        resp = sm.create_secret(
            Name         = secret_name,
            SecretString = secret_string
        )
        print(f"Created new secret: {secret_name}")
        return resp["ARN"]

    except ClientError as e:
        if e.response["Error"]["Code"] == "ResourceExistsException":
            # Secret already exists — just update the value
            resp = sm.put_secret_value(
                SecretId     = secret_name,
                SecretString = secret_string
            )
            print(f"Updated existing secret: {secret_name}")
            return secret_name
        else:
            raise   # Re-raise any other error


# ── Usage ──
upsert_secret(
    "prod/snowflake/credentials",
    {"account": "xy12345", "user": "etl_svc", "private_key": "..."}
)

🔍

describe_secret() and list_secrets() — Inspection APIs Inspection ▼

describe_secret() — Metadata Without the Value

describe_secret() returns metadata about a secret — its ARN, name, description, rotation configuration, version IDs, tags — but NOT the actual secret value. Use this to check if a secret exists, find its rotation status, or audit its tags without exposing the credential.

python — describe_secret()

import boto3
from botocore.exceptions import ClientError

sm = boto3.client("secretsmanager", region_name="us-east-1")

def secret_exists(secret_name: str) -> bool:
    """Check if a secret exists without fetching its value."""
    try:
        meta = sm.describe_secret(SecretId=secret_name)
        print(f"Secret found: {meta['ARN']}")
        print(f"Last rotated: {meta.get('LastRotatedDate', 'Never')}")
        print(f"Rotation enabled: {meta.get('RotationEnabled', False)}")
        return True
    except ClientError as e:
        if e.response["Error"]["Code"] == "ResourceNotFoundException":
            return False
        raise


# ── Check rotation status in a pipeline health check ──
if not secret_exists("prod/orders-db/credentials"):
    raise RuntimeError("Required secret not found — check Secrets Manager setup!")

print("Secret health check passed ✓")

📌 When to use describe_secret vs get_secret_value

Use describe_secret() when you only need to check existence or metadata (health checks, audits). Use get_secret_value() when you need the actual password or token. describe_secret does not count toward your Secrets Manager read quota the same way as get_secret_value.

list_secrets() with Paginator — Audit All Secrets

list_secrets() returns all secrets in the account/region. Since an account can have hundreds of secrets, always use the paginator — never call list_secrets() in a bare loop with NextToken. You can filter by name prefix or tags to narrow results.

python — list_secrets() with paginator

import boto3

sm = boto3.client("secretsmanager", region_name="us-east-1")
paginator = sm.get_paginator("list_secrets")

# ── List ALL secrets in the account ──
all_secrets = []
for page in paginator.paginate():
    for secret in page["SecretList"]:
        all_secrets.append({
            "name"            : secret["Name"],
            "arn"             : secret["ARN"],
            "rotation_enabled": secret.get("RotationEnabled", False),
            "last_rotated"    : secret.get("LastRotatedDate", None),
            "tags"            : {t["Key"]: t["Value"] for t in secret.get("Tags", [])}
        })

print(f"Total secrets: {len(all_secrets)}")

# ── Filter: find secrets for prod environment ──
prod_secrets = [s for s in all_secrets if s["tags"].get("Environment") == "prod"]
print(f"Prod secrets: {len(prod_secrets)}")

# ── Audit: find secrets where rotation is NOT enabled ──
no_rotation = [s for s in all_secrets if not s["rotation_enabled"]]
print(f"Secrets without rotation: {len(no_rotation)}")
for s in no_rotation:
    print(f"  ⚠️  {s['name']}")

list_secrets() with Filters — Find by Name Prefix

The Filters parameter lets you narrow down secrets without fetching all of them. Filter by name, tag-key, tag-value, or description. This is much more efficient when you have hundreds of secrets and only care about a specific project's secrets.

python — list_secrets() with Filters

import boto3

sm = boto3.client("secretsmanager", region_name="us-east-1")
paginator = sm.get_paginator("list_secrets")

# ── Find only secrets under the "prod/" prefix ──
for page in paginator.paginate(
    Filters=[
        {"Key": "name", "Values": ["prod/"]}   # Prefix match on name
    ]
):
    for secret in page["SecretList"]:
        print(secret["Name"])

🗑️

delete_secret() — Deletion with Recovery Window Deletion ▼

delete_secret() — Soft Delete with Recovery Window

Secrets Manager does NOT immediately delete secrets. By default, a secret enters a recovery window (7–30 days) during which it can be restored. This protects against accidental deletion — if a pipeline breaks because a secret was deleted, you have time to recover. After the recovery window, the secret is permanently gone.

python — delete_secret()

import boto3

sm = boto3.client("secretsmanager", region_name="us-east-1")

# ── Soft delete with 7-day recovery window (default behaviour) ──
response = sm.delete_secret(
    SecretId              = "prod/old-api/token",
    RecoveryWindowInDays  = 7      # Can be 7 to 30 days
)
print(f"Secret scheduled for deletion on: {response['DeletionDate']}")

# ── Hard delete (immediate, NO recovery possible) — use with extreme caution ──
sm.delete_secret(
    SecretId                   = "dev/temp-secret/test",
    ForceDeleteWithoutRecovery = True   # Permanent — no going back
)

⚠️ ForceDeleteWithoutRecovery

Never use ForceDeleteWithoutRecovery=True in production pipelines. This permanently destroys the secret with no way to recover it. It is only appropriate for ephemeral test secrets. In production, always use the recovery window.

ℹ️ Restore a Deleted Secret

During the recovery window, call sm.restore_secret(SecretId="...") to cancel the deletion and bring the secret back to active status. After the recovery window expires, this is no longer possible.

🔄

Automatic Secret Rotation Advanced ▼

How Automatic Rotation Works

Secrets Manager can automatically rotate credentials on a schedule using a Lambda function. When rotation triggers: (1) Lambda generates a new credential, (2) updates the DB/API with the new credential, (3) calls put_secret_value() to update Secrets Manager, (4) verifies the new credential works. The old value stays as AWSPREVIOUS until the next rotation. Your pipelines calling get_secret_value() always get AWSCURRENT automatically — no pipeline restarts needed.

┌─────────────────────────────────────────────────────────────────┐ │ AUTOMATIC ROTATION FLOW │ │ │ │ Schedule (e.g. every 30 days) │ │ │ │ │ ▼ │ │ Secrets Manager ──► triggers ──► Rotation Lambda │ │ │ │ │ ┌────▼────────────────────┐ │ │ │ 1. Generate new password │ │ │ │ 2. Update RDS user pw │ │ │ │ 3. put_secret_value() │ │ │ │ 4. Verify new creds work │ │ │ └─────────────────────────┘ │ │ │ │ Pipeline calls get_secret_value() ──► AWSCURRENT (new pw) │ │ (no restart needed — always fresh) │ └─────────────────────────────────────────────────────────────────┘

python — enable automatic rotation via boto3

import boto3

sm = boto3.client("secretsmanager", region_name="us-east-1")

# ── Enable automatic rotation with a 30-day schedule ──
sm.rotate_secret(
    SecretId           = "prod/orders-db/credentials",
    RotationLambdaARN  = "arn:aws:lambda:us-east-1:123456789012:function:SecretsManagerRotation",
    RotationRules      = {
        "AutomaticallyAfterDays": 30   # Rotate every 30 days
    }
)

print("Automatic rotation enabled. Lambda will rotate the secret every 30 days.")

📌 AWS Managed Rotation Lambdas

For RDS PostgreSQL, MySQL, and Oracle, AWS provides pre-built rotation Lambda functions available in the Serverless Application Repository. You don't need to write your own rotation Lambda — just reference the AWS-managed one when enabling rotation for RDS secrets.

⚖️

Secrets Manager vs Parameter Store — When to Use Which Decision Guide ▼

Side-by-Side Comparison

Both services store secrets, but they serve different use cases. Knowing when to use which is a common interview question and a real decision you make in every project.

Feature	Secrets Manager	Parameter Store
Cost	$0.40/secret/month + API calls	Free (Standard tier)
Automatic Rotation	✅ Built-in with Lambda	❌ Manual only
Max Value Size	65KB	4KB (Standard), 8KB (Advanced)
Cross-account access	✅ Native resource policy	⚠️ Requires more setup
Versioning	✅ AWSCURRENT / AWSPREVIOUS	✅ History with Advanced tier
KMS Encryption	✅ Always encrypted	SecureString only
Best for	DB passwords, API tokens, private keys	Config values, feature flags, non-secret params

💡 Rule of Thumb

If it's a password, key, or token → Secrets Manager. If it's a config value (S3 bucket name, environment name, feature flag, non-sensitive URL) → Parameter Store. The cost difference is real: 100 secrets = $40/month in Secrets Manager, $0 in Parameter Store.

🏭

Full Production Pattern — Every DE Must Know This Production ▼

The Universal Secret Retrieval Utility

Every real project has a secrets utility module that all pipelines import. This module handles: retrieval, JSON parsing, caching, error handling, and logging. Write it once, use it everywhere. Here is the complete production-grade version:

python — secrets_util.py — production secrets utility module

"""
secrets_util.py — Universal Secrets Manager utility for data pipelines.
Usage:
    from secrets_util import get_secret, get_db_creds
"""
import boto3
import json
import logging
from functools import lru_cache
from typing import Any
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)

# Module-level client (reused across calls in Lambda warm invocations)
_sm_client = None


def _get_client(region: str = "us-east-1"):
    global _sm_client
    if _sm_client is None:
        _sm_client = boto3.client("secretsmanager", region_name=region)
    return _sm_client


@lru_cache(maxsize=32)
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
    """
    Retrieve and parse a JSON secret from Secrets Manager.
    Result is cached in-process (safe for Lambda warm invocations).

    Args:
        secret_name: Secret name or ARN
        region     : AWS region

    Returns:
        dict parsed from the SecretString JSON

    Raises:
        ValueError  : if secret is not valid JSON
        RuntimeError: if secret not found or access denied
    """
    client = _get_client(region)
    try:
        logger.info(f"Fetching secret: {secret_name}")
        response      = client.get_secret_value(SecretId=secret_name)
        secret_string = response.get("SecretString")

        if not secret_string:
            raise ValueError(f"Secret '{secret_name}' has no SecretString (may be binary)")

        return json.loads(secret_string)

    except ClientError as e:
        code = e.response["Error"]["Code"]
        if code == "ResourceNotFoundException":
            raise RuntimeError(f"Secret not found: {secret_name}") from e
        elif code == "AccessDeniedException":
            raise RuntimeError(f"IAM permission denied for secret: {secret_name}") from e
        elif code in ("DecryptionFailure", "InternalServiceError"):
            raise RuntimeError(f"KMS/internal error retrieving {secret_name}: {code}") from e
        else:
            raise

    except json.JSONDecodeError as e:
        raise ValueError(f"Secret '{secret_name}' is not valid JSON: {e}") from e


def get_db_creds(secret_name: str, region: str = "us-east-1") -> dict:
    """
    Convenience wrapper — returns DB credential keys with validation.
    Expects secret to have: host, port, username, password, dbname
    """
    creds = get_secret(secret_name, region)
    required = {"host", "username", "password", "dbname"}
    missing  = required - creds.keys()
    if missing:
        raise ValueError(f"Secret '{secret_name}' missing keys: {missing}")
    creds.setdefault("port", 5432)
    return creds


def get_jdbc_url(secret_name: str, engine: str = "postgresql") -> tuple[str, str, str]:
    """
    Returns (jdbc_url, user, password) ready for spark.read.jdbc().
    engine: 'postgresql' | 'mysql' | 'redshift'
    """
    creds = get_db_creds(secret_name)
    driver_map = {
        "postgresql": ("postgresql", "org.postgresql.Driver"),
        "mysql"     : ("mysql",      "com.mysql.cj.jdbc.Driver"),
        "redshift"  : ("redshift",   "com.amazon.redshift.jdbc42.Driver"),
    }
    scheme, _ = driver_map.get(engine, ("postgresql", "org.postgresql.Driver"))
    url = f"jdbc:{scheme}://{creds['host']}:{creds['port']}/{creds['dbname']}"
    return url, creds["username"], creds["password"]


# ─────────────────────────────────────────────
# Example usage in a Glue / EMR PySpark job
# ─────────────────────────────────────────────
if __name__ == "__main__":
    SECRET = "prod/orders-db/credentials"

    # Pattern 1 — get raw dict
    creds = get_secret(SECRET)
    print(f"Host: {creds['host']}")

    # Pattern 2 — get JDBC tuple for Spark
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("test").getOrCreate()

    jdbc_url, user, pwd = get_jdbc_url(SECRET, engine="postgresql")
    df = (
        spark.read.format("jdbc")
        .option("url",      jdbc_url)
        .option("dbtable",  "orders")
        .option("user",     user)
        .option("password", pwd)
        .load()
    )
    df.printSchema()

🔶 Architecture Note

In Glue jobs, the job's IAM role must have secretsmanager:GetSecretValue permission on the specific secret ARN (not *). In EMR, the EC2 instance profile attached to the cluster must have the same permission. In Lambda, the execution role needs it. This is the IAM setup that makes boto3 auth work without any keys in code.

Quick API Reference Summary

API	Purpose	Key Parameters	Returns
`get_secret_value()`	Retrieve secret value	SecretId, VersionStage	SecretString or SecretBinary
`create_secret()`	Create brand-new secret	Name, SecretString, KmsKeyId	ARN, Name, VersionId
`put_secret_value()`	Update/rotate existing secret	SecretId, SecretString	ARN, VersionId
`describe_secret()`	Get metadata (no value)	SecretId	ARN, rotation info, tags
`list_secrets()`	List all secrets (paginate!)	Filters	SecretList[]
`delete_secret()`	Schedule deletion	SecretId, RecoveryWindowInDays	DeletionDate
`restore_secret()`	Cancel scheduled deletion	SecretId	ARN, Name
`rotate_secret()`	Enable/trigger rotation	SecretId, RotationLambdaARN, RotationRules	ARN

29.30.11 — BOTO3 DEEP DIVE

SQS APIs

Amazon SQS (Simple Queue Service) is a fully managed message queue used to decouple pipeline stages, buffer events, and handle failures gracefully via Dead Letter Queues. Every DE must master the consume-process-delete pattern — the backbone of reliable event-driven pipelines.

📬

Queue Operations — create, get URL, delete SETUP ▼

create_queue()

Creates a new SQS queue. You specify the queue name and optional attributes like visibility timeout, message retention, and whether it's a FIFO queue.

📦 Analogy

Think of a queue like a post-office mailbox. create_queue() is registering a new mailbox. Messages go in one end, your consumer picks them up from the other.

python

import boto3

sqs = boto3.client('sqs', region_name='us-east-1')

# Create a Standard Queue
response = sqs.create_queue(
    QueueName='pipeline-events',
    Attributes={
        'VisibilityTimeout': '60',          # seconds — how long msg is hidden after receive
        'MessageRetentionPeriod': '86400',   # 1 day in seconds
        'ReceiveMessageWaitTimeSeconds': '20', # long polling — always set to 20
    }
)
queue_url = response['QueueUrl']
print(queue_url)
# https://sqs.us-east-1.amazonaws.com/123456789/pipeline-events


# Create a FIFO Queue (ordered + exactly-once)
fifo_response = sqs.create_queue(
    QueueName='ordered-jobs.fifo',   # FIFO queues MUST end in .fifo
    Attributes={
        'FifoQueue': 'true',
        'ContentBasedDeduplication': 'true'
    }
)

💡 Key Point

Standard queues = at-least-once delivery, unordered. FIFO queues = exactly-once, ordered, but limited to 300 TPS. For most data pipelines, Standard is sufficient.

get_queue_url()

Retrieves the URL of an existing queue by name. You need the URL for every SQS operation — treat it like a queue's address.

python

# Get the URL of an existing queue
response = sqs.get_queue_url(QueueName='pipeline-events')
queue_url = response['QueueUrl']
print(f"Queue URL: {queue_url}")

# For cross-account queue access, pass the owner account ID
response = sqs.get_queue_url(
    QueueName='shared-pipeline-events',
    QueueOwnerAWSAccountId='987654321012'
)

delete_queue()

Permanently deletes a queue and all its messages. Irreversible — use with caution in production.

python

sqs.delete_queue(QueueUrl=queue_url)
print("Queue deleted")

# Note: After deletion, the queue name cannot be reused for 60 seconds

📤

Sending Messages — send_message() & send_message_batch() PRODUCER ▼

send_message() — single message

Sends a single message to the queue. The message body can be any string — typically a JSON payload. You can optionally set a delay and attach message attributes (metadata).

python

import json

# Basic send — JSON payload is the standard pattern
event = {
    "pipeline_name": "customer_load",
    "s3_path": "s3://my-bucket/raw/customers/2024-01-15.csv",
    "triggered_by": "s3_event"
}

response = sqs.send_message(
    QueueUrl=queue_url,
    MessageBody=json.dumps(event),
    DelaySeconds=0,                # 0 = available immediately (default)
    MessageAttributes={
        'source': {
            'DataType': 'String',
            'StringValue': 's3-event-trigger'
        },
        'priority': {
            'DataType': 'Number',
            'StringValue': '1'
        }
    }
)
print(f"Message ID: {response['MessageId']}")

📦 Analogy

DelaySeconds is like a timed-release envelope — you drop it in the mailbox now, but the recipient can't open it until the delay expires. Useful for spacing out pipeline triggers.

send_message_batch() — up to 10 messages

Sends up to 10 messages in one API call. More efficient than calling send_message() in a loop. Each entry needs a unique Id within the batch.

python

# Send 3 pipeline events in one batch call
files = [
    "s3://bucket/raw/orders/part-001.parquet",
    "s3://bucket/raw/orders/part-002.parquet",
    "s3://bucket/raw/orders/part-003.parquet",
]

entries = [
    {
        'Id': str(i),                          # Unique ID within this batch
        'MessageBody': json.dumps({'file': f}),
        'DelaySeconds': 0
    }
    for i, f in enumerate(files)
]

response = sqs.send_message_batch(
    QueueUrl=queue_url,
    Entries=entries
)

# Always check for failures in batch operations
if response.get('Failed'):
    for failure in response['Failed']:
        print(f"Failed: {failure['Id']} — {failure['Message']}")
print(f"Sent: {len(response['Successful'])} messages")

💡 Key Point

Always inspect response['Failed'] in batch calls. Unlike single send_message(), batch calls don't raise exceptions for individual failures — they just report them in the response.

📥

receive_message() — Long Polling, VisibilityTimeout CONSUMER ▼

receive_message() — the core consumer call

receive_message() fetches up to 10 messages at a time from the queue. After you receive a message, it becomes invisible to other consumers for the VisibilityTimeout duration — giving you time to process and delete it.

📦 Analogy

It's like picking up a package from a shared locker. Once you open the locker, it's locked to others for 60 seconds. If you don't put it back (delete it) or confirm you took it, the locker re-opens and someone else can grab it.

python

response = sqs.receive_message(
    QueueUrl=queue_url,
    MaxNumberOfMessages=10,       # max allowed is 10
    WaitTimeSeconds=20,           # ALWAYS use 20 — enables long polling
    VisibilityTimeout=120,        # hide msg for 120s while you process
    MessageAttributeNames=['All'] # receive custom attributes too
)

messages = response.get('Messages', [])  # empty list if queue is empty
print(f"Received {len(messages)} messages")

for msg in messages:
    body = json.loads(msg['Body'])
    receipt_handle = msg['ReceiptHandle']   # needed to delete the message
    message_id = msg['MessageId']

    print(f"Processing: {body}")
    print(f"Receipt: {receipt_handle[:40]}...")

Long Polling vs Short Polling

Short polling (WaitTimeSeconds=0) returns immediately even if the queue is empty — wasteful and expensive. Long polling (WaitTimeSeconds=20) waits up to 20 seconds for a message to arrive — saves cost and reduces empty responses.

python

# ❌ Short polling — costs money for empty polls
response = sqs.receive_message(QueueUrl=queue_url, WaitTimeSeconds=0)

# ✅ Long polling — waits up to 20s, returns as soon as message arrives
response = sqs.receive_message(QueueUrl=queue_url, WaitTimeSeconds=20)

💡 Key Point

Always set WaitTimeSeconds=20. It's the maximum and almost always the right choice. Long polling reduces the number of empty ReceiveMessage responses and lowers SQS costs significantly.

VisibilityTimeout — the invisible window

After a message is received, it's hidden from other consumers for VisibilityTimeout seconds. If you don't delete it within that window, it becomes visible again and another consumer can pick it up. Set it to at least 3x your expected processing time.

python

# If processing takes longer than expected, extend the timeout
sqs.change_message_visibility(
    QueueUrl=queue_url,
    ReceiptHandle=receipt_handle,
    VisibilityTimeout=300  # extend by 5 more minutes
)
print("Extended visibility timeout")

🗑️

delete_message() & delete_message_batch() CLEANUP ▼

delete_message() — confirm successful processing

After successfully processing a message, you must delete it using its ReceiptHandle. If you don't delete it, it reappears in the queue after the visibility timeout — causing duplicate processing.

python

# Delete a single message after processing
sqs.delete_message(
    QueueUrl=queue_url,
    ReceiptHandle=receipt_handle   # from receive_message() response
)
print("Message deleted — processing confirmed")

delete_message_batch() — bulk delete (up to 10)

Deletes up to 10 messages in a single API call. Use this when you receive a batch of messages and process them all together.

python

# Delete all successfully processed messages in one call
entries = [
    {'Id': str(i), 'ReceiptHandle': msg['ReceiptHandle']}
    for i, msg in enumerate(processed_messages)
]

response = sqs.delete_message_batch(
    QueueUrl=queue_url,
    Entries=entries
)

if response.get('Failed'):
    print(f"Some deletes failed: {response['Failed']}")

📊

get_queue_attributes() — Queue Depth & DLQ Monitoring MONITORING ▼

get_queue_attributes() — inspect queue health

Returns queue metadata including approximate message counts, configuration, and the DLQ redrive policy. Vital for monitoring queue depth in pipeline observability.

python

response = sqs.get_queue_attributes(
    QueueUrl=queue_url,
    AttributeNames=['All']
)
attrs = response['Attributes']

# Key metrics every DE monitors
visible_msgs    = int(attrs['ApproximateNumberOfMessages'])
invisible_msgs  = int(attrs['ApproximateNumberOfMessagesNotVisible'])
delayed_msgs    = int(attrs['ApproximateNumberOfMessagesDelayed'])
redrive_policy  = attrs.get('RedrivePolicy', 'No DLQ configured')

print(f"Visible (waiting):     {visible_msgs}")
print(f"Invisible (in-flight): {invisible_msgs}")
print(f"Delayed:               {delayed_msgs}")
print(f"DLQ policy:            {redrive_policy}")

# Alert if queue is building up
if visible_msgs > 1000:
    print("⚠️  Queue depth alarm — pipeline may be falling behind")

Dead Letter Queue (DLQ) — handling failed messages

A DLQ is a separate queue where messages are automatically moved after failing processing maxReceiveCount times. It's your safety net — failed messages don't vanish, they accumulate in the DLQ for inspection and replay.

📦 Analogy

Think of the DLQ as a "return to sender" bin. Letters that couldn't be delivered (processed) automatically go there. You can inspect them, fix the issue, and re-send them — or alert on-call when the bin fills up.

python

# Step 1 — Create the DLQ first
dlq_response = sqs.create_queue(QueueName='pipeline-events-dlq')
dlq_url = dlq_response['QueueUrl']

# Get DLQ ARN (needed for redrive policy)
dlq_attrs = sqs.get_queue_attributes(
    QueueUrl=dlq_url,
    AttributeNames=['QueueArn']
)
dlq_arn = dlq_attrs['Attributes']['QueueArn']

# Step 2 — Create main queue with DLQ redrive policy
main_queue = sqs.create_queue(
    QueueName='pipeline-events',
    Attributes={
        'RedrivePolicy': json.dumps({
            'deadLetterTargetArn': dlq_arn,
            'maxReceiveCount': '3'   # after 3 failed attempts → DLQ
        }),
        'VisibilityTimeout': '60'
    }
)

# Step 3 — Monitor DLQ depth with CloudWatch alarm
# DLQ filling up = alert on-call immediately
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
    AlarmName='pipeline-dlq-depth',
    MetricName='ApproximateNumberOfMessagesVisible',
    Namespace='AWS/SQS',
    Dimensions=[{'Name': 'QueueName', 'Value': 'pipeline-events-dlq'}],
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    EvaluationPeriods=1,
    Period=60,
    Statistic='Sum',
    AlarmActions=['arn:aws:sns:us-east-1:123456789:on-call-alerts']
)

🔄

Full Consume-Process-Delete Pattern — Production Template PRODUCTION ▼

The canonical SQS consumer loop

This is the most important SQS pattern for data engineers. The rule: receive → process → delete on success → let visibility timeout expire on failure (message returns to queue → eventually hits DLQ).

receive_message()

→

process payload

→

✅ success? delete_message()

→

❌ failure? let timeout expire → retry → DLQ

python

import boto3, json, time, logging
from botocore.exceptions import ClientError

logger = logging.getLogger(__name__)
sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/pipeline-events"

def process_message(body: dict) -> None:
    """Your actual pipeline logic goes here."""
    pipeline_name = body['pipeline_name']
    s3_path = body['s3_path']
    logger.info(f"Starting pipeline: {pipeline_name} for {s3_path}")
    # e.g. trigger Glue job, run Spark transform, etc.
    glue = boto3.client('glue')
    glue.start_job_run(
        JobName=pipeline_name,
        Arguments={'--input_path': s3_path}
    )

def run_consumer_loop():
    print("Consumer loop started...")
    while True:
        try:
            # Step 1 — Receive (long polling, up to 10 at once)
            response = sqs.receive_message(
                QueueUrl=QUEUE_URL,
                MaxNumberOfMessages=10,
                WaitTimeSeconds=20,      # long poll
                VisibilityTimeout=120    # 2 min to process
            )
            messages = response.get('Messages', [])

            if not messages:
                print("No messages, waiting...")
                continue

            for msg in messages:
                receipt = msg['ReceiptHandle']
                body = json.loads(msg['Body'])

                try:
                    # Step 2 — Process
                    process_message(body)

                    # Step 3 — Delete ONLY on success
                    sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=receipt)
                    logger.info(f"✅ Processed and deleted: {msg['MessageId']}")

                except Exception as e:
                    # Step 4 — On failure: do NOT delete
                    # Message becomes visible again after VisibilityTimeout
                    # After maxReceiveCount attempts → goes to DLQ
                    logger.error(f"❌ Processing failed: {e} — message will retry")

        except ClientError as e:
            logger.error(f"SQS error: {e.response['Error']['Code']}")
            time.sleep(5)

if __name__ == "__main__":
    run_consumer_loop()

⚠️ Critical Rule

Never delete a message before processing succeeds. The delete is your commit — only commit when you know the work is done. This is the SQS equivalent of exactly-once processing.

change_message_visibility() — extend timeout mid-processing

If your processing is taking longer than expected, extend the visibility timeout before it expires. Otherwise the message reappears and gets double-processed.

python

import threading

def keep_alive_heartbeat(queue_url, receipt_handle, interval=90):
    """Extend visibility every 90s for long-running jobs."""
    while True:
        time.sleep(interval)
        try:
            sqs.change_message_visibility(
                QueueUrl=queue_url,
                ReceiptHandle=receipt_handle,
                VisibilityTimeout=120   # reset the clock
            )
            logger.info("Heartbeat: visibility extended")
        except Exception:
            break  # message already deleted or expired

# Start heartbeat in background thread during processing
t = threading.Thread(target=keep_alive_heartbeat, args=(QUEUE_URL, receipt), daemon=True)
t.start()

📋

SQS APIs Quick Reference REFERENCE ▼

API Call	What It Does	Key Param
`create_queue()`	Create new queue (Standard or FIFO)	`QueueName`, `Attributes`
`get_queue_url()`	Get URL of existing queue by name	`QueueName`
`delete_queue()`	Permanently delete queue + messages	`QueueUrl`
`send_message()`	Send one message	`MessageBody`, `DelaySeconds`
`send_message_batch()`	Send up to 10 messages at once	`Entries` (list, max 10)
`receive_message()`	Fetch up to 10 messages	`WaitTimeSeconds=20`, `VisibilityTimeout`
`delete_message()`	Confirm processing — remove from queue	`ReceiptHandle`
`delete_message_batch()`	Delete up to 10 messages at once	`Entries` with ReceiptHandles
`change_message_visibility()`	Extend or reset visibility timeout	`VisibilityTimeout`
`get_queue_attributes()`	Queue depth, DLQ policy, config	`AttributeNames=['All']`

💡 Interview Tip

The three things interviewers always ask about SQS: (1) difference between Standard and FIFO, (2) what VisibilityTimeout does and why you must delete messages, (3) how DLQ works and why it matters for reliability.

29.30.12 — BOTO3 DEEP DIVE

SNS APIs

Amazon SNS (Simple Notification Service) is a fully managed pub/sub messaging service. One topic can fan-out a single message to many subscribers — email, SQS queues, Lambda functions, or HTTPS endpoints. For data engineers, SNS is the backbone of pipeline alerting (failure notifications, SLA breaches, CloudWatch alarms) and fan-out architectures where one event needs to trigger multiple independent consumers.

📋

Topic Management — create, delete, list SETUP ▼

create_topic()

Creates a new SNS topic. A topic is a logical channel — publishers send messages to it, and SNS delivers a copy to every subscriber. Returns a TopicArn which you'll use for every other operation.

📦 Analogy

Think of a topic like a radio station. create_topic() launches the station. Anyone can tune in (subscribe), and when the station broadcasts (publish), every listener hears it at the same time — regardless of how many listeners there are.

python

import boto3

sns = boto3.client('sns', region_name='us-east-1')

# Create a Standard topic (most common for DE)
response = sns.create_topic(
    Name='pipeline-alerts',
    Attributes={
        'DisplayName': 'Pipeline Alerts'   # shown in email subject prefix
    }
)
topic_arn = response['TopicArn']
print(topic_arn)
# arn:aws:sns:us-east-1:123456789012:pipeline-alerts


# Create a FIFO topic — ordered, exactly-once delivery to SQS FIFO subscribers
fifo_response = sns.create_topic(
    Name='pipeline-events.fifo',   # FIFO topics MUST end in .fifo
    Attributes={
        'FifoTopic': 'true',
        'ContentBasedDeduplication': 'true'
    }
)

💡 Key Point

create_topic() is idempotent — calling it again with the same Name just returns the existing topic's ARN instead of erroring. Safe to call on every pipeline startup.

delete_topic()

Permanently deletes a topic and all its subscriptions. Any messages already delivered to subscribers (e.g. sitting in an SQS queue) are not removed — only the topic and the subscription links are deleted.

python

sns.delete_topic(TopicArn=topic_arn)
print("Topic deleted")

# Note: subscriptions attached to this topic are also removed,
# but messages already in downstream SQS queues remain there.

list_topics() — with paginator

Lists all SNS topics in the account/region. Like most list operations, results are paginated — always use a paginator instead of manually following NextToken.

python

paginator = sns.get_paginator('list_topics')

for page in paginator.paginate():
    for topic in page['Topics']:
        print(topic['TopicArn'])

# Filter for pipeline-related topics only
pipeline_topics = [
    t['TopicArn'] for page in paginator.paginate()
    for t in page['Topics']
    if 'pipeline' in t['TopicArn']
]

🔔

Subscriptions — subscribe, unsubscribe, filter policies SUBSCRIBERS ▼

subscribe() — endpoint & protocol

Attaches a subscriber to a topic. The Protocol tells SNS how to deliver the message, and Endpoint tells it where. Common protocols for data pipelines: email, sqs, lambda, https, sms.

📦 Analogy

subscribe() is like giving the radio station your address and saying "deliver every broadcast here, by mail / by SQS queue / by phone call." The station doesn't care how many addresses are on file — it broadcasts once, and the postal system (SNS) delivers to all of them.

python

# 1. Email subscription — for on-call alerts
email_sub = sns.subscribe(
    TopicArn=topic_arn,
    Protocol='email',
    Endpoint='data-oncall@company.com'
)
# Subscriber must confirm via a link sent to their inbox before
# messages start flowing — SNS marks this as PendingConfirmation


# 2. SQS subscription — for pipeline-to-pipeline fan-out
sqs_sub = sns.subscribe(
    TopicArn=topic_arn,
    Protocol='sqs',
    Endpoint='arn:aws:sqs:us-east-1:123456789012:dq-failure-queue',
    Attributes={
        'RawMessageDelivery': 'true'   # delivers raw JSON, not wrapped in SNS envelope
    }
)


# 3. Lambda subscription — for routing to Slack via a Lambda function
lambda_sub = sns.subscribe(
    TopicArn=topic_arn,
    Protocol='lambda',
    Endpoint='arn:aws:lambda:us-east-1:123456789012:function:slack-notifier'
)


# 4. HTTPS subscription — for webhooks (e.g. PagerDuty, custom API)
https_sub = sns.subscribe(
    TopicArn=topic_arn,
    Protocol='https',
    Endpoint='https://events.pagerduty.com/integration/abc123/enqueue',
    ReturnSubscriptionArn=True
)

⚠️ Confirmation Required

email and https subscriptions sit in PendingConfirmation state until the endpoint confirms (clicking a link, or responding to an SNS handshake POST). sqs and lambda subscriptions where SNS has the right IAM permission are auto-confirmed.

unsubscribe()

Removes a subscription using its SubscriptionArn (returned by subscribe() or found via list_subscriptions_by_topic()).

python

sns.unsubscribe(
    SubscriptionArn=email_sub['SubscriptionArn']
)
print("Unsubscribed")

# Note: if a subscription is still PendingConfirmation,
# its SubscriptionArn will literally be the string 'PendingConfirmation'
# and cannot be unsubscribed via API — it must be left to expire (3 days)

list_subscriptions_by_topic()

Lists every subscriber attached to a specific topic — useful for auditing who/what receives pipeline alerts.

python

paginator = sns.get_paginator('list_subscriptions_by_topic')

for page in paginator.paginate(TopicArn=topic_arn):
    for sub in page['Subscriptions']:
        print(f"{sub['Protocol']:6} -> {sub['Endpoint']} ({sub['SubscriptionArn']})")

# email  -> data-oncall@company.com (PendingConfirmation)
# sqs    -> arn:aws:sqs:...:dq-failure-queue (arn:aws:sns:...:8f2e...)
# lambda -> arn:aws:lambda:...:slack-notifier (arn:aws:sns:...:a1b9...)

set_subscription_attributes() — filter policies

By default, every subscriber gets every message published to a topic. A filter policy lets a subscriber say "only send me messages where MessageAttributes match this pattern" — turning one topic into many logical channels.

📦 Analogy

Filter policies are like telling the radio station "only ring my phone for the traffic segment, not the weather segment" — same station, same broadcasts, but you only get notified for the parts relevant to you.

python

import json

# Only deliver to this subscriber when severity = ERROR or CRITICAL
# AND the pipeline is one of the "tier-1" pipelines
filter_policy = {
    "severity": ["ERROR", "CRITICAL"],
    "pipeline_tier": ["tier1"]
}

sns.set_subscription_attributes(
    SubscriptionArn=sqs_sub['SubscriptionArn'],
    AttributeName='FilterPolicy',
    AttributeValue=json.dumps(filter_policy)
)

# Optionally scope the filter policy to the message body itself
# instead of MessageAttributes (FilterPolicyScope='MessageBody')
sns.set_subscription_attributes(
    SubscriptionArn=sqs_sub['SubscriptionArn'],
    AttributeName='FilterPolicyScope',
    AttributeValue='MessageBody'
)

💡 Key Point

Filter policies are evaluated by SNS, not by your consumer code. This means filtered-out messages never even reach your SQS queue or Lambda — saving cost and reducing noise. This is the core mechanism behind SNS fan-out routing.

📤

Publishing Messages — publish() & publish_batch() PRODUCER ▼

publish() — TopicArn, Message, Subject

Sends a single message to every subscriber of a topic (subject to filter policies). Subject becomes the email subject line for email subscribers and is ignored by other protocols. MessageAttributes is metadata used by filter policies and is delivered separately from the message body.

python

import json

# Basic alert publish — the classic "pipeline failed" notification
message_body = {
    "pipeline_name": "customer_load",
    "run_id": "run_20260615_0300",
    "status": "FAILED",
    "error": "Schema mismatch on column 'email'"
}

response = sns.publish(
    TopicArn=topic_arn,
    Subject="❌ Pipeline Failed: customer_load",   # only used by email subscribers
    Message=json.dumps(message_body),
    MessageAttributes={
        'severity': {
            'DataType': 'String',
            'StringValue': 'ERROR'
        },
        'pipeline_tier': {
            'DataType': 'String',
            'StringValue': 'tier1'
        }
    }
)
print(response['MessageId'])

💡 Key Point

MessageAttributes are not part of Message — they're separate key/value metadata sent alongside it. Filter policies match against MessageAttributes (or the message body, if FilterPolicyScope='MessageBody'), not against the raw text of Message.

publish_batch() — up to 10 messages

Publishes up to 10 messages in a single API call — much more efficient than calling publish() in a loop when a pipeline run produces multiple events (e.g. one alert per failed table in a multi-table load).

python

failed_tables = ["orders", "customers", "inventory"]

entries = []
for i, table in enumerate(failed_tables):
    entries.append({
        'Id': str(i),                     # unique within this batch only
        'Message': json.dumps({
            "table": table,
            "status": "FAILED",
            "run_id": "run_20260615_0300"
        }),
        'Subject': f"Table load failed: {table}",
        'MessageAttributes': {
            'severity': {'DataType': 'String', 'StringValue': 'ERROR'}
        }
    })

response = sns.publish_batch(
    TopicArn=topic_arn,
    PublishBatchRequestEntries=entries
)

# Always check for partial failures
for failed in response.get('Failed', []):
    print(f"Entry {failed['Id']} failed: {failed['Code']} - {failed['Message']}")

⚠️ Partial Failures

publish_batch() can partially succeed — some entries in Successful, others in Failed. Always check both lists; a non-empty response doesn't mean every message was published.

🔀

Fan-Out Architecture & Data Engineering Patterns PRODUCTION ▼

SNS → SQS fan-out pattern

One published event needs to trigger multiple independent consumers — e.g. a "new file arrived" event should (1) trigger an ingestion pipeline, (2) update a data catalog, and (3) log to an audit system. Publish once to SNS; attach an SQS queue per consumer.

S3 file arrives

→

sns.publish() — one event

→

SQS Queue A (ingestion)

↘

SQS Queue B (catalog update)

↘

SQS Queue C (audit log)

python

# Setup once: one topic, three SQS subscribers
topic_arn = sns.create_topic(Name='s3-file-arrivals')['TopicArn']

for queue_arn in [
    'arn:aws:sqs:us-east-1:123456789012:ingestion-queue',
    'arn:aws:sqs:us-east-1:123456789012:catalog-update-queue',
    'arn:aws:sqs:us-east-1:123456789012:audit-log-queue',
]:
    sns.subscribe(
        TopicArn=topic_arn,
        Protocol='sqs',
        Endpoint=queue_arn,
        Attributes={'RawMessageDelivery': 'true'}
    )

# Each time a file lands, publish ONE event — all 3 queues get a copy
sns.publish(
    TopicArn=topic_arn,
    Message=json.dumps({
        "bucket": "my-data-lake",
        "key": "raw/customers/2026-06-15.csv",
        "event_time": "2026-06-15T03:00:00Z"
    })
)

Pipeline failure alerts — SNS → email / Slack

The most common SNS pattern in data engineering: when a Glue job, EMR step, or Airflow task fails, publish to an alert topic. Email subscribers get a notification directly; a Lambda subscriber can reformat the message and post it to Slack.

python

# Inside a Lambda subscriber — reformats SNS message for Slack
import json
import urllib.request

SLACK_WEBHOOK = "https://hooks.slack.com/services/T000/B000/XXXX"

def lambda_handler(event, context):
    for record in event['Records']:
        sns_message = json.loads(record['Sns']['Message'])

        slack_payload = {
            "text": (
                f"🚨 *Pipeline Failure*\\n"
                f"Pipeline: `{sns_message['pipeline_name']}`\\n"
                f"Run ID: `{sns_message['run_id']}`\\n"
                f"Error: {sns_message['error']}`"
            )
        }
        req = urllib.request.Request(
            SLACK_WEBHOOK,
            data=json.dumps(slack_payload).encode(),
            headers={'Content-Type': 'application/json'}
        )
        urllib.request.urlopen(req)

CloudWatch Alarm → SNS → notification

CloudWatch Alarms publish directly to an SNS topic when a metric breaches a threshold (e.g. EMR step failure count, Glue job duration, DLQ depth). You don't call publish() yourself here — CloudWatch does it for you once the alarm is wired to the topic ARN.

python

cloudwatch = boto3.client('cloudwatch')

cloudwatch.put_metric_alarm(
    AlarmName='glue-job-duration-exceeded',
    Namespace='AWS/Glue',
    MetricName='glue.driver.aggregate.elapsedTime',
    Statistic='Maximum',
    Period=300,
    EvaluationPeriods=1,
    Threshold=3600000,           # 1 hour in ms
    ComparisonOperator='GreaterThanThreshold',
    AlarmActions=[topic_arn],     # <-- SNS topic gets notified on breach
    Dimensions=[{'Name': 'JobName', 'Value': 'customer_load_etl'}]
)

# Now: alarm fires -> SNS publishes -> every subscriber (email/Slack/PagerDuty) notified

📦 Analogy

SNS is the nervous system of a pipeline — CloudWatch is the nerve ending that feels the pain (a breached threshold), SNS is the spinal cord that broadcasts the signal, and your email/Slack/PagerDuty subscribers are the brain that decides what to do about it.

📊

SNS APIs Quick Reference REFERENCE ▼

API Call	What It Does	Key Param
`create_topic()`	Create a topic (Standard or FIFO)	`Name` (idempotent)
`delete_topic()`	Delete topic + all subscriptions	`TopicArn`
`list_topics()`	List all topics (paginated)	—
`subscribe()`	Attach a subscriber to a topic	`Protocol`, `Endpoint`
`unsubscribe()`	Remove a subscriber	`SubscriptionArn`
`list_subscriptions_by_topic()`	List subscribers on a topic (paginated)	`TopicArn`
`set_subscription_attributes()`	Set filter policy / raw delivery	`AttributeName='FilterPolicy'`
`publish()`	Send one message to all subscribers	`Message`, `MessageAttributes`
`publish_batch()`	Send up to 10 messages at once	`PublishBatchRequestEntries`

💡 Interview Tip

The three things interviewers always ask about SNS: (1) SNS vs SQS — SNS is push pub/sub (fan-out to many), SQS is a pull queue (one consumer per message; (2) how filter policies let one topic serve multiple consumers with different needs; (3) the SNS → SQS fan-out pattern and why RawMessageDelivery matters.

29.30.13 — BOTO3 DEEP DIVE

DynamoDB APIs

Amazon DynamoDB is a fully managed, serverless key-value and document database. For Data Engineers, DynamoDB is not a primary data warehouse — it's the metadata backbone of your pipeline: job audit tables, control tables, watermark tracking, run history, and config storage. It's millisecond-latency at any scale, requires zero schema management, and is the go-to choice for storing pipeline state that your pipeline code reads and writes programmatically via boto3.

🗄️

Table Operations — create, describe, delete, list SETUP ▼

create_table()

Creates a DynamoDB table. You must define the KeySchema (partition key + optional sort key) and their types in AttributeDefinitions. Every other attribute is schema-free — you just put any JSON you want per item. The two billing modes are PAY_PER_REQUEST (on-demand, great for pipelines with variable traffic) and PROVISIONED (fixed read/write capacity units).

📦 Analogy

DynamoDB is like a giant filing cabinet where every drawer (partition key) has labelled folders (sort key), but each folder can hold any kind of paper — you don't need to pre-declare what fields the paper will have. You only decide which label to put on the drawer and folder at creation time; everything inside is free-form.

python

import boto3

dynamodb = boto3.client('dynamodb', region_name='us-east-1')

# Pipeline audit table — partition key = pipeline_id, sort key = run_id
dynamodb.create_table(
    TableName='pipeline_audit',
    KeySchema=[
        {'AttributeName': 'pipeline_id', 'KeyType': 'HASH'},   # partition key
        {'AttributeName': 'run_id',      'KeyType': 'RANGE'}  # sort key
    ],
    AttributeDefinitions=[
        {'AttributeName': 'pipeline_id', 'AttributeType': 'S'},  # S=String
        {'AttributeName': 'run_id',      'AttributeType': 'S'}
    ],
    BillingMode='PAY_PER_REQUEST'   # no capacity planning needed
)

# Watermark table — just a partition key, no sort key
dynamodb.create_table(
    TableName='pipeline_watermark',
    KeySchema=[
        {'AttributeName': 'pipeline_id', 'KeyType': 'HASH'}
    ],
    AttributeDefinitions=[
        {'AttributeName': 'pipeline_id', 'AttributeType': 'S'}
    ],
    BillingMode='PAY_PER_REQUEST'
)

💡 Key Point

Only the key attributes (pipeline_id, run_id) appear in AttributeDefinitions. Every other field you'll write — status, row_count, error_message — is not declared anywhere. DynamoDB is schema-less beyond the key.

describe_table()

Returns the full table description including its status (CREATING, ACTIVE, DELETING), key schema, billing mode, item count, and size. Use this to wait for a table to become ACTIVE after creation before writing to it.

python

import time

# Wait for table to be ACTIVE before writing
def wait_for_table_active(table_name, max_wait=60):
    for _ in range(max_wait // 2):
        resp = dynamodb.describe_table(TableName=table_name)
        status = resp['Table']['TableStatus']
        if status == 'ACTIVE':
            print(f"{table_name} is ACTIVE")
            return
        print(f"Status: {status} — waiting...")
        time.sleep(2)
    raise TimeoutError("Table did not become ACTIVE in time")

wait_for_table_active('pipeline_audit')

# Also useful for reading current item count (eventually consistent)
info = dynamodb.describe_table(TableName='pipeline_audit')['Table']
print(f"Items: {info['ItemCount']}, Size: {info['TableSizeBytes']} bytes")

delete_table() and list_tables()

delete_table() permanently destroys the table and all its items. list_tables() returns all tables in the region — note it paginates via LastEvaluatedTableName, not the standard NextToken, so use the paginator.

python

# Delete a table
dynamodb.delete_table(TableName='pipeline_watermark')

# List all tables — always use paginator
paginator = dynamodb.get_paginator('list_tables')
for page in paginator.paginate():
    for table_name in page['TableNames']:
        print(table_name)

✍️

Item Operations — put, get, update, delete CORE CRUD ▼

put_item() — full item write

put_item() writes a complete item. If an item with the same key already exists, it is completely replaced. All attribute values must be typed using DynamoDB's type notation: S (string), N (number — always a string!), BOOL, NULL, L (list), M (map).

📦 Analogy

put_item() is like placing a whole new document into the filing cabinet. If a document with that label already exists, it gets shredded and replaced — not merged. If you want to update only specific fields, use update_item() instead.

python

from datetime import datetime, timezone

# Write a job audit record — every pipeline run writes one of these
dynamodb.put_item(
    TableName='pipeline_audit',
    Item={
        'pipeline_id':  {'S': 'sales-etl-daily'},
        'run_id':       {'S': 'run-2024-01-15-001'},
        'status':       {'S': 'SUCCEEDED'},
        'start_time':   {'S': '2024-01-15T06:00:00Z'},
        'end_time':     {'S': '2024-01-15T06:45:23Z'},
        'rows_read':    {'N': '1482930'},     # N is ALWAYS a string
        'rows_written': {'N': '1482930'},
        'rows_rejected':{'N': '0'},
        'dq_score':     {'N': '99.8'},
        'error_message':{'NULL': True}      # NULL type for no error
    }
)

# Prevent overwrite — only write if this run_id does NOT already exist
try:
    dynamodb.put_item(
        TableName='pipeline_audit',
        Item={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-001'}},
        ConditionExpression='attribute_not_exists(run_id)'
    )
except dynamodb.exceptions.ConditionalCheckFailedException:
    print("Run ID already exists — skipping (idempotent write)")

⚠️ Numbers are Strings

DynamoDB's N type requires the value to be a string — e.g. {'N': '1482930'} not {'N': 1482930}. This is a very common bug. When reading back, the number is also returned as a string and you must cast it: int(item['rows_read']['N']).

get_item() — point read by key

get_item() fetches a single item by its exact partition key (and sort key if the table has one). This is a point read — DynamoDB routes directly to the partition that holds this key and returns it in single-digit milliseconds. There is no table scan involved.

python

# Read a specific pipeline run record
response = dynamodb.get_item(
    TableName='pipeline_audit',
    Key={
        'pipeline_id': {'S': 'sales-etl-daily'},
        'run_id':       {'S': 'run-2024-01-15-001'}
    }
)

# 'Item' key is absent if the item doesn't exist — always check!
if 'Item' in response:
    item = response['Item']
    print(f"Status : {item['status']['S']}")
    print(f"Rows   : {int(item['rows_written']['N'])}")
    print(f"DQ     : {float(item['dq_score']['N'])}")
else:
    print("Item not found")


# Use ProjectionExpression to fetch only specific attributes (saves RCUs)
response = dynamodb.get_item(
    TableName='pipeline_audit',
    Key={
        'pipeline_id': {'S': 'sales-etl-daily'},
        'run_id':       {'S': 'run-2024-01-15-001'}
    },
    ProjectionExpression='#s, rows_written',      # only these fields
    ExpressionAttributeNames={'#s': 'status'}       # 'status' is a reserved word
)

💡 Key Point

DynamoDB has reserved words (like status, name, size) that conflict with expression syntax. Always use ExpressionAttributeNames with a # placeholder when your attribute name matches a reserved word.

update_item() — partial attribute update

update_item() modifies specific attributes of an existing item without replacing the whole item. This is the right choice when your pipeline wants to update just the status and end_time without losing other fields already written. Uses UpdateExpression with clauses: SET (add/update), REMOVE (delete attribute), ADD (increment numbers, add to sets), DELETE (remove from sets).

python

# Pattern: pipeline starts → write RUNNING status
#           pipeline ends  → update to SUCCEEDED/FAILED + end_time + counts

# Step 1: mark job as RUNNING at startup
dynamodb.update_item(
    TableName='pipeline_audit',
    Key={
        'pipeline_id': {'S': 'sales-etl-daily'},
        'run_id':       {'S': 'run-2024-01-15-002'}
    },
    UpdateExpression='SET #s = :s, start_time = :t',
    ExpressionAttributeNames={'#s': 'status'},
    ExpressionAttributeValues={
        ':s': {'S': 'RUNNING'},
        ':t': {'S': datetime.now(timezone.utc).isoformat()}
    }
)

# Step 2: mark job as SUCCEEDED at completion
dynamodb.update_item(
    TableName='pipeline_audit',
    Key={
        'pipeline_id': {'S': 'sales-etl-daily'},
        'run_id':       {'S': 'run-2024-01-15-002'}
    },
    UpdateExpression='SET #s = :s, end_time = :e, rows_written = :r, dq_score = :d',
    ExpressionAttributeNames={'#s': 'status'},
    ExpressionAttributeValues={
        ':s': {'S': 'SUCCEEDED'},
        ':e': {'S': datetime.now(timezone.utc).isoformat()},
        ':r': {'N': '2341000'},
        ':d': {'N': '99.9'}
    }
)

# ADD clause — atomically increment a counter (no race condition)
dynamodb.update_item(
    TableName='pipeline_audit',
    Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
    UpdateExpression='ADD retry_count :one',
    ExpressionAttributeValues={':one': {'N': '1'}}
)

# REMOVE clause — delete an attribute from the item
dynamodb.update_item(
    TableName='pipeline_audit',
    Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
    UpdateExpression='REMOVE temp_debug_payload'   # delete a field when done debugging
)

# ConditionExpression — optimistic locking: only update if still RUNNING
try:
    dynamodb.update_item(
        TableName='pipeline_audit',
        Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
        UpdateExpression='SET #s = :failed',
        ConditionExpression='#s = :running',      # only if currently RUNNING
        ExpressionAttributeNames={'#s': 'status'},
        ExpressionAttributeValues={':failed': {'S': 'FAILED'}, ':running': {'S': 'RUNNING'}}
    )
except dynamodb.exceptions.ConditionalCheckFailedException:
    print("Item was already updated by another process")

delete_item()

Deletes a single item by its key. Optionally use ConditionExpression to only delete if a condition is met — for example, only delete a watermark record if the pipeline is marked as inactive.

python

# Simple delete
dynamodb.delete_item(
    TableName='pipeline_watermark',
    Key={'pipeline_id': {'S': 'deprecated-pipeline'}}
)

# Conditional delete — only if the pipeline is marked inactive
try:
    dynamodb.delete_item(
        TableName='pipeline_watermark',
        Key={'pipeline_id': {'S': 'sales-etl-daily'}},
        ConditionExpression='is_active = :false',
        ExpressionAttributeValues={':false': {'BOOL': False}}
    )
except dynamodb.exceptions.ConditionalCheckFailedException:
    print("Pipeline is still active — not deleting watermark")

📦

Batch Operations — batch_write_item() & batch_get_item() BULK ▼

batch_write_item() — bulk audit log writes (up to 25 items)

batch_write_item() lets you write or delete up to 25 items per call across one or more tables in a single round trip. This is the standard pattern for writing pipeline audit records in bulk — e.g. writing one record per Spark partition processed. Important: batch writes do not support ConditionExpression, and any unprocessed items must be retried manually.

📦 Analogy

batch_write_item() is like mailing a bundle of 25 letters at once instead of one at a time. The post office (DynamoDB) delivers all 25 in one trip but may say "I couldn't deliver 3 of these today" (UnprocessedItems). You're responsible for resending those 3.

python

import time

def batch_write_audit_records(records):
    """Write a list of audit record dicts to DynamoDB in batches of 25."""
    chunk_size = 25
    for i in range(0, len(records), chunk_size):
        chunk = records[i:i + chunk_size]
        request_items = {
            'pipeline_audit': [
                {'PutRequest': {'Item': record}}
                for record in chunk
            ]
        }
        # Retry unprocessed items with exponential backoff
        delay = 0.5
        for attempt in range(5):
            response = dynamodb.batch_write_item(RequestItems=request_items)
            unprocessed = response.get('UnprocessedItems', {})
            if not unprocessed:
                break
            print(f"Retrying {len(unprocessed['pipeline_audit'])} unprocessed items...")
            request_items = unprocessed
            time.sleep(delay)
            delay *= 2


# Example: write audit records for 3 Spark partitions
audit_records = [
    {
        'pipeline_id':  {'S': 'sales-etl-daily'},
        'run_id':       {'S': f'run-2024-01-15-part-{p}'},
        'status':       {'S': 'SUCCEEDED'},
        'rows_written': {'N': str(p * 50000)}
    }
    for p in range(3)
]
batch_write_audit_records(audit_records)

⚠️ Always Handle UnprocessedItems

DynamoDB can return items in UnprocessedItems due to throttling or hot partitions — even if your table is on-demand mode. Always check this field and retry with backoff. Ignoring it silently drops writes.

batch_get_item() — bulk point reads (up to 100 items)

Fetches up to 100 items by their keys in a single call. Useful for looking up pipeline configs, watermarks, or status records for a set of pipeline IDs at once. Like batch_write_item(), it can return UnprocessedKeys that must be retried.

python

# Fetch watermark records for multiple pipelines at once
pipeline_ids = ['sales-etl-daily', 'inventory-etl', 'returns-etl']

response = dynamodb.batch_get_item(
    RequestItems={
        'pipeline_watermark': {
            'Keys': [
                {'pipeline_id': {'S': pid}}
                for pid in pipeline_ids
            ],
            'ProjectionExpression': 'pipeline_id, last_watermark_value'
        }
    }
)

for item in response['Responses']['pipeline_watermark']:
    pid = item['pipeline_id']['S']
    wm  = item['last_watermark_value']['S']
    print(f"{pid} → last watermark: {wm}")

# Always handle UnprocessedKeys
if response.get('UnprocessedKeys'):
    print("Warning: some keys were not processed — retry!")

🔍

Query and Scan — query(), scan() with paginators READ PATTERNS ▼

query() — by partition key (the right way to read)

query() fetches all items that share the same partition key, with optional filtering on the sort key. This is efficient — DynamoDB goes directly to that partition and returns results. Use this to get all runs for a specific pipeline, filtered by date range on the run_id sort key.

📦 Analogy

query() is like walking to a specific drawer in the filing cabinet (partition key = sales-etl-daily) and reading all the folders inside it, optionally filtered by date. You go directly to the right drawer — no rummaging through the whole cabinet.

python

from boto3.dynamodb.conditions import Key, Attr

# Get all runs for sales-etl-daily in January 2024
paginator = dynamodb.get_paginator('query')

pages = paginator.paginate(
    TableName='pipeline_audit',
    KeyConditionExpression='pipeline_id = :pid AND begins_with(run_id, :prefix)',
    ExpressionAttributeValues={
        ':pid':    {'S': 'sales-etl-daily'},
        ':prefix': {'S': 'run-2024-01'}     # sort key prefix filter
    },
    ScanIndexForward=False    # descending order = newest run first
)

all_runs = []
for page in pages:
    all_runs.extend(page['Items'])

print(f"Found {len(all_runs)} runs in January 2024")


# FilterExpression — post-query filter (applied AFTER fetching from partition)
# Use this for non-key attributes like status, but note it does NOT reduce RCU cost
pages = paginator.paginate(
    TableName='pipeline_audit',
    KeyConditionExpression='pipeline_id = :pid',
    FilterExpression='#s = :failed',     # filter to only FAILED runs
    ExpressionAttributeNames={'#s': 'status'},
    ExpressionAttributeValues={
        ':pid':    {'S': 'sales-etl-daily'},
        ':failed': {'S': 'FAILED'}
    }
)

failed_runs = [item for page in pages for item in page['Items']]
print(f"Failed runs: {len(failed_runs)}")

💡 Key Point

KeyConditionExpression operates on keys and is evaluated by DynamoDB before fetching — it reduces how much data is read. FilterExpression is applied after fetching, so it doesn't reduce read cost (RCUs) — only the result set size returned to your code.

scan() — full table read (use sparingly)

scan() reads every item in the table. It is expensive (every item costs a read), slow for large tables, and should be avoided in production hot paths. Acceptable use cases: one-time admin scripts, small config tables (<100 items), or tables where you genuinely need all items.

⚠️ Avoid scan() on Large Tables

A scan() on a table with 1 million items reads all 1 million items and charges you for every single one. On a pipeline audit table that grows daily, this gets expensive fast. Always design your access patterns around query() using partition + sort key.

python

# Scan is OK for small config/control tables — always use paginator
paginator = dynamodb.get_paginator('scan')

pages = paginator.paginate(
    TableName='pipeline_config',     # small table: ~20 pipeline configs
    FilterExpression='is_active = :true',
    ExpressionAttributeValues={':true': {'BOOL': True}}
)

active_pipelines = [item for page in pages for item in page['Items']]
print(f"Active pipelines: {[p['pipeline_id']['S'] for p in active_pipelines]}")

# Parallel scan — for large tables that absolutely must be scanned
# Split into N segments and scan each in a separate thread
TOTAL_SEGMENTS = 4
import concurrent.futures

def scan_segment(segment):
    items = []
    paginator = dynamodb.get_paginator('scan')
    for page in paginator.paginate(
        TableName='pipeline_audit',
        Segment=segment,
        TotalSegments=TOTAL_SEGMENTS
    ):
        items.extend(page['Items'])
    return items

with concurrent.futures.ThreadPoolExecutor(max_workers=TOTAL_SEGMENTS) as ex:
    results = list(ex.map(scan_segment, range(TOTAL_SEGMENTS)))
all_items = [item for segment in results for item in segment]

📋

Pipeline Audit Pattern — the complete DE use case PRODUCTION PATTERN ▼

DynamoDB as Pipeline State Store

The most important DynamoDB use case for Data Engineers is storing pipeline run state: one record per job run with start time, end time, status, row counts, DQ score, and error messages. This is your audit trail, your SLA monitoring source, and your retry/recovery control plane — all in one place.

☁️ Why DynamoDB (not RDS)?

Your pipeline code may run on EMR, Glue, Lambda, or EC2 across different regions and VPCs. DynamoDB has a public HTTPS endpoint accessible from anywhere via boto3 — no VPC peering, no connection pooling, no idle connections. It's the easiest metadata store to access from any pipeline runtime.

python — complete pipeline audit helper class

import boto3, uuid
from datetime import datetime, timezone

class PipelineAudit:
    """Write pipeline run records to DynamoDB. Used at start and end of every job."""

    TABLE = 'pipeline_audit'

    def __init__(self):
        self.ddb = boto3.client('dynamodb', region_name='us-east-1')

    def start_run(self, pipeline_id: str) -> str:
        """Call at job start. Returns the run_id to pass to end_run()."""
        run_id = f"run-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}-{str(uuid.uuid4())[:8]}"
        self.ddb.put_item(
            TableName=self.TABLE,
            Item={
                'pipeline_id': {'S': pipeline_id},
                'run_id':       {'S': run_id},
                'status':       {'S': 'RUNNING'},
                'start_time':   {'S': datetime.now(timezone.utc).isoformat()},
                'end_time':     {'NULL': True},
                'rows_read':    {'N': '0'},
                'rows_written': {'N': '0'},
                'rows_rejected':{'N': '0'},
                'dq_score':     {'N': '0'},
                'error_message':{'NULL': True}
            }
        )
        return run_id

    def end_run(self, pipeline_id: str, run_id: str, status: str,
                  rows_read=0, rows_written=0, rows_rejected=0,
                  dq_score=100.0, error_message=None):
        """Call at job end with final counts and status."""
        item_updates = {
            'status':        {'S': status},
            'end_time':      {'S': datetime.now(timezone.utc).isoformat()},
            'rows_read':     {'N': str(rows_read)},
            'rows_written':  {'N': str(rows_written)},
            'rows_rejected': {'N': str(rows_rejected)},
            'dq_score':      {'N': str(dq_score)},
            'error_message': {'S': error_message} if error_message else {'NULL': True}
        }
        update_expr = 'SET ' + ', '.join(
            f"#attr_{i} = :val_{i}" for i in range(len(item_updates))
        )
        attr_names  = {f"#attr_{i}": k for i, k in enumerate(item_updates)}
        attr_values = {f":val_{i}": v for i, v in enumerate(item_updates.values())}
        self.ddb.update_item(
            TableName=self.TABLE,
            Key={'pipeline_id': {'S': pipeline_id}, 'run_id': {'S': run_id}},
            UpdateExpression=update_expr,
            ExpressionAttributeNames=attr_names,
            ExpressionAttributeValues=attr_values
        )

    def get_last_run(self, pipeline_id: str) -> dict:
        """Get the most recent run record for a pipeline."""
        response = self.ddb.query(
            TableName=self.TABLE,
            KeyConditionExpression='pipeline_id = :pid',
            ExpressionAttributeValues={':pid': {'S': pipeline_id}},
            ScanIndexForward=False,   # newest first
            Limit=1
        )
        items = response.get('Items', [])
        return items[0] if items else None


# ── Usage in a Glue / EMR PySpark job ──────────────────────────
audit = PipelineAudit()
run_id = audit.start_run("sales-etl-daily")

try:
    # ... your Spark transform logic here ...
    rows_read, rows_written = 1482930, 1482930
    audit.end_run("sales-etl-daily", run_id, "SUCCEEDED",
                  rows_read=rows_read, rows_written=rows_written, dq_score=99.8)
except Exception as e:
    audit.end_run("sales-etl-daily", run_id, "FAILED", error_message=str(e))
    raise

Watermark Table Pattern

For incremental pipelines, DynamoDB is the perfect place to store the high watermark — the last processed timestamp or ID. Each pipeline run reads the watermark, processes everything since that point, then updates the watermark to the new maximum. This is the classic stateful incremental ETL pattern.

python — watermark read/write pattern

def get_watermark(pipeline_id: str, default: str = "1970-01-01T00:00:00Z") -> str:
    """Read last watermark. Return default if first run."""
    response = dynamodb.get_item(
        TableName='pipeline_watermark',
        Key={'pipeline_id': {'S': pipeline_id}}
    )
    if 'Item' in response:
        return response['Item']['last_watermark_value']['S']
    return default   # first ever run


def update_watermark(pipeline_id: str, new_watermark: str):
    """Save new watermark after a successful run."""
    dynamodb.put_item(
        TableName='pipeline_watermark',
        Item={
            'pipeline_id':          {'S': pipeline_id},
            'last_watermark_value':  {'S': new_watermark},
            'updated_timestamp':     {'S': datetime.now(timezone.utc).isoformat()}
        }
    )


# In your Spark job:
watermark = get_watermark("sales-etl-daily")
print(f"Processing records AFTER: {watermark}")

# df = spark.read.jdbc(...).filter(f"updated_at > '{watermark}'")
# new_max_ts = df.agg({"updated_at": "max"}).collect()[0][0]
new_max_ts = "2024-01-15T06:00:00Z"   # from your Spark result

# Only update watermark after job succeeds
update_watermark("sales-etl-daily", new_max_ts)
print(f"Watermark updated to: {new_max_ts}")

✅ Real-World Example

A production sales ETL runs nightly at 2AM. It calls get_watermark() → gets 2024-01-14T23:59:59Z → reads only the new rows from the source DB → processes 50,000 new records → calls update_watermark() with 2024-01-15T23:59:59Z. Next night, it only processes the delta again. Without this pattern, it would full-scan millions of rows every night.

29.30.14 — BOTO3 DEEP DIVE

CloudWatch APIs

Amazon CloudWatch is the observability backbone of every AWS data pipeline. As a Data Engineer you use it to publish custom pipeline metrics, create alarms that page your team on failures, query logs with SQL-like syntax, and build dashboards that show pipeline health at a glance. Mastering these APIs turns your pipelines from black boxes into fully observable systems.

📊

Metrics — put_metric_data(), get_metric_statistics(), get_metric_data() METRICS ▼

put_metric_data() — Publish Custom Pipeline Metrics

put_metric_data() lets you push custom metrics from your pipeline code into CloudWatch. Built-in AWS services (Glue, EMR, Lambda) publish their own metrics automatically, but things like rows processed, DQ failure count, or pipeline duration are business metrics that only your code knows. You publish these to a custom Namespace you define.

📦 Analogy

Think of CloudWatch Metrics like a fitness tracker. AWS services automatically record their "heartbeat", but your pipeline recording "I processed 5 million rows" is like you manually logging a workout. put_metric_data() is that manual log entry.

python — put_metric_data() basics

import boto3
from datetime import datetime, timezone

cw = boto3.client('cloudwatch', region_name='us-east-1')

# ── Publish a single metric: rows processed by this pipeline run ──
cw.put_metric_data(
    Namespace='DataPlatform/Pipelines',   # your custom namespace — use /  to organise
    MetricData=[
        {
            'MetricName': 'RowsProcessed',
            'Value'     : 5_432_100,
            'Unit'      : 'Count',
            'Timestamp' : datetime.now(timezone.utc),
            'Dimensions': [
                {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
                {'Name': 'Environment',  'Value': 'prod'}
            ]
        }
    ]
)
print("Metric published")

python — publishing multiple metrics in one call (up to 20)

import boto3, time
from datetime import datetime, timezone

cw = boto3.client('cloudwatch', region_name='us-east-1')

pipeline_name = 'orders-bronze-to-silver'
start_time    = time.time()

# ... your ETL logic runs here ...
rows_read     = 5_432_100
rows_written  = 5_431_990
rows_rejected = 110
dq_score      = 99.8
duration_secs = time.time() - start_time

# ── Publish all pipeline KPIs in one API call ──
ts = datetime.now(timezone.utc)
dims = [{'Name': 'PipelineName', 'Value': pipeline_name},
        {'Name': 'Environment',  'Value': 'prod'}]

cw.put_metric_data(
    Namespace='DataPlatform/Pipelines',
    MetricData=[
        {'MetricName': 'RowsRead',         'Value': rows_read,     'Unit': 'Count',   'Timestamp': ts, 'Dimensions': dims},
        {'MetricName': 'RowsWritten',      'Value': rows_written,  'Unit': 'Count',   'Timestamp': ts, 'Dimensions': dims},
        {'MetricName': 'RowsRejected',    'Value': rows_rejected, 'Unit': 'Count',   'Timestamp': ts, 'Dimensions': dims},
        {'MetricName': 'DQScore',         'Value': dq_score,      'Unit': 'Percent', 'Timestamp': ts, 'Dimensions': dims},
        {'MetricName': 'DurationSeconds', 'Value': duration_secs, 'Unit': 'Seconds', 'Timestamp': ts, 'Dimensions': dims},
    ]
)
print("All pipeline KPIs published to CloudWatch")

💡 Units Matter

Always set the correct Unit. CloudWatch supports: Count, Seconds, Milliseconds, Bytes, Megabytes, Percent, None. Using the right unit lets you build meaningful alarms (e.g. alarm if DurationSeconds > 3600).

⚠️ Limit: 20 metrics per call

put_metric_data() accepts a maximum of 20 MetricData entries per API call. If you have more, split them into batches of 20 and call the API multiple times.

get_metric_statistics() — Query a Single Metric Over Time

get_metric_statistics() retrieves aggregated data points for a single metric over a time range. Useful for checking "how many rows did this pipeline process in the last hour?" or "what was the average duration over the last 7 days?". You get back a list of timestamped data points.

python — get_metric_statistics()

import boto3
from datetime import datetime, timedelta, timezone

cw = boto3.client('cloudwatch', region_name='us-east-1')

# ── Get RowsProcessed for the last 24 hours, in 1-hour buckets ──
now       = datetime.now(timezone.utc)
one_day_ago = now - timedelta(hours=24)

response = cw.get_metric_statistics(
    Namespace  ='DataPlatform/Pipelines',
    MetricName ='RowsProcessed',
    Dimensions =[
        {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
        {'Name': 'Environment',  'Value': 'prod'}
    ],
    StartTime  = one_day_ago,
    EndTime    = now,
    Period     = 3600,           # 1 hour buckets in seconds
    Statistics = ['Sum', 'Maximum', 'Average'],
    Unit       = 'Count'
)

# Sort by time and print
datapoints = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
for dp in datapoints:
    print(f"{dp['Timestamp']:%Y-%m-%d %H:%M}  sum={dp['Sum']:,.0f}  max={dp['Maximum']:,.0f}")

💡 Period Must Align

Period must be a multiple of 60 seconds. Common values: 60 (1 min), 300 (5 min), 3600 (1 hour), 86400 (1 day). The total time range divided by Period gives you the number of data points returned.

get_metric_data() — Batch Query Multiple Metrics at Once

get_metric_data() is the more powerful successor to get_metric_statistics(). It lets you query multiple metrics simultaneously in one API call, supports metric math expressions (e.g. calculate rejection rate = rejected / read * 100), and is paginated for large result sets.

python — get_metric_data() with metric math

import boto3
from datetime import datetime, timedelta, timezone

cw  = boto3.client('cloudwatch', region_name='us-east-1')
now = datetime.now(timezone.utc)

dims = [
    {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
    {'Name': 'Environment',  'Value': 'prod'}
]

response = cw.get_metric_data(
    MetricDataQueries=[
        # ── Raw metric: rows read ──
        {
            'Id'    : 'm1',
            'Label' : 'Rows Read',
            'MetricStat': {
                'Metric': {'Namespace': 'DataPlatform/Pipelines', 'MetricName': 'RowsRead',      'Dimensions': dims},
                'Period': 3600,
                'Stat'  : 'Sum'
            }
        },
        # ── Raw metric: rows rejected ──
        {
            'Id'    : 'm2',
            'Label' : 'Rows Rejected',
            'MetricStat': {
                'Metric': {'Namespace': 'DataPlatform/Pipelines', 'MetricName': 'RowsRejected', 'Dimensions': dims},
                'Period': 3600,
                'Stat'  : 'Sum'
            }
        },
        # ── Metric math: rejection rate % = (rejected / read) * 100 ──
        {
            'Id'         : 'rejection_rate',
            'Label'      : 'Rejection Rate %',
            'Expression' : '(m2 / m1) * 100',   # metric math — references m1 and m2 above
            'ReturnData' : True
        },
    ],
    StartTime = now - timedelta(hours=24),
    EndTime   = now,
)

for result in response['MetricDataResults']:
    print(f"\n{result['Label']} ({result['Id']})")
    for ts, val in zip(result['Timestamps'], result['Values']):
        print(f"  {ts:%H:%M}  →  {val:,.2f}")

💡 Metric Math

Metric math lets you compute derived metrics (ratios, sums across pipelines, percentages) without storing them separately. Use Expression referencing other query IDs. Set 'ReturnData': False on raw metrics you only use for math (they won't appear in results).

🔔

Alarms — put_metric_alarm(), describe_alarms(), delete_alarms() ALARMS ▼

put_metric_alarm() — Create or Update an Alarm

CloudWatch Alarms watch a metric and trigger an action (SNS notification → email / Slack / PagerDuty) when the metric crosses a threshold. As a DE you create alarms for things like: Glue job duration exceeded, DLQ depth > 0, EMR step failed, or custom pipeline DQ score dropped below 95%.

📦 Analogy

An alarm is like a smoke detector. You set a threshold ("if smoke particle count exceeds X") and an action ("ring the bell and call the fire station"). You don't watch the sensor yourself — it watches automatically and calls you only when something is wrong.

python — put_metric_alarm() for pipeline failures

import boto3

cw = boto3.client('cloudwatch', region_name='us-east-1')

SNS_ALERT_ARN = 'arn:aws:sns:us-east-1:123456789012:data-platform-alerts'

# ── Alarm 1: Alert if DQ rejection rate exceeds 1% ──
cw.put_metric_alarm(
    AlarmName          = 'orders-pipeline-high-rejection-rate',
    AlarmDescription   = 'DQ rejection rate exceeded 1% — investigate data quality',
    Namespace          = 'DataPlatform/Pipelines',
    MetricName         = 'RowsRejected',
    Dimensions         = [
        {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
        {'Name': 'Environment',  'Value': 'prod'}
    ],
    Statistic          = 'Sum',
    Period             = 3600,              # evaluate over last 1 hour
    EvaluationPeriods  = 1,                # 1 period must breach → alarm
    Threshold          = 1000,             # alert if > 1000 rows rejected in the hour
    ComparisonOperator = 'GreaterThanThreshold',
    TreatMissingData   = 'notBreaching',   # no data = pipeline didn't run = OK
    AlarmActions       = [SNS_ALERT_ARN],
    OKActions          = [SNS_ALERT_ARN],    # notify when alarm recovers too
)

# ── Alarm 2: Alert if pipeline takes longer than 2 hours ──
cw.put_metric_alarm(
    AlarmName          = 'orders-pipeline-duration-breach',
    AlarmDescription   = 'Pipeline exceeded 2-hour SLA',
    Namespace          = 'DataPlatform/Pipelines',
    MetricName         = 'DurationSeconds',
    Dimensions         = [
        {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
        {'Name': 'Environment',  'Value': 'prod'}
    ],
    Statistic          = 'Maximum',
    Period             = 3600,
    EvaluationPeriods  = 1,
    Threshold          = 7200,             # 2 hours in seconds
    ComparisonOperator = 'GreaterThanThreshold',
    TreatMissingData   = 'notBreaching',
    AlarmActions       = [SNS_ALERT_ARN],
)

# ── Alarm 3: DLQ depth > 0 — something failed and landed in DLQ ──
cw.put_metric_alarm(
    AlarmName          = 'pipeline-dlq-has-messages',
    AlarmDescription   = 'Messages in DLQ — pipeline messages failed processing',
    Namespace          = 'AWS/SQS',         # built-in AWS namespace for SQS
    MetricName         = 'ApproximateNumberOfMessagesVisible',
    Dimensions         = [{'Name': 'QueueName', 'Value': 'pipeline-events-dlq'}],
    Statistic          = 'Sum',
    Period             = 300,               # check every 5 minutes
    EvaluationPeriods  = 1,
    Threshold          = 0,
    ComparisonOperator = 'GreaterThanThreshold',
    TreatMissingData   = 'notBreaching',
    AlarmActions       = [SNS_ALERT_ARN],
)

print("All alarms created")

💡 TreatMissingData Options

notBreaching — no data is treated as OK (most common for batch pipelines that don't run 24/7). breaching — no data triggers the alarm. ignore — keeps previous state. missing — alarm state becomes INSUFFICIENT_DATA.

Composite Alarms — Combine Multiple Alarms with AND / OR

A composite alarm fires based on the state of other alarms — not directly on a metric. This lets you build logic like: "alert only if BOTH the rejection rate is high AND the pipeline duration is long" (AND), or "alert if ANY of our 5 pipelines fail" (OR). This reduces alert noise.

python — put_composite_alarm()

# ── Fire only when BOTH duration AND rejection alarms are breaching ──
cw.put_composite_alarm(
    AlarmName        = 'orders-pipeline-critical-composite',
    AlarmDescription = 'Critical: both SLA breach AND high rejection rate simultaneously',
    AlarmRule        = (
        'ALARM("orders-pipeline-duration-breach") AND '
        'ALARM("orders-pipeline-high-rejection-rate")'
    ),
    AlarmActions = ['arn:aws:sns:us-east-1:123456789012:data-platform-critical'],
)

# ── Fire when ANY of the three child alarms breach (OR logic) ──
cw.put_composite_alarm(
    AlarmName  = 'any-pipeline-failure',
    AlarmRule  = (
        'ALARM("orders-pipeline-duration-breach") OR '
        'ALARM("orders-pipeline-high-rejection-rate") OR '
        'ALARM("pipeline-dlq-has-messages")'
    ),
    AlarmActions = ['arn:aws:sns:us-east-1:123456789012:data-platform-alerts'],
)

describe_alarms() — Check Alarm State Programmatically

Use describe_alarms() to check whether an alarm is currently in OK, ALARM, or INSUFFICIENT_DATA state. Useful in pipeline code to gate execution: "don't start today's load if yesterday's DQ alarm is still firing".

python — describe_alarms() and state check

import boto3

cw = boto3.client('cloudwatch', region_name='us-east-1')

# ── Check specific alarms by name ──
response = cw.describe_alarms(
    AlarmNames=[
        'orders-pipeline-high-rejection-rate',
        'orders-pipeline-duration-breach'
    ],
    AlarmTypes=['MetricAlarm']   # or 'CompositeAlarm'
)

for alarm in response['MetricAlarms']:
    name  = alarm['AlarmName']
    state = alarm['StateValue']       # 'OK', 'ALARM', or 'INSUFFICIENT_DATA'
    reason = alarm['StateReason']
    print(f"{name:50s}  state={state}  reason={reason[:60]}")

# ── Gate pattern: stop pipeline if any alarm is FIRING ──
firing = [a['AlarmName'] for a in response['MetricAlarms'] if a['StateValue'] == 'ALARM']
if firing:
    raise RuntimeError(f"Blocking pipeline start — alarms firing: {firing}")

print("All alarms OK — safe to proceed")

delete_alarms()

Remove alarms when decommissioning a pipeline. You can delete up to 100 alarms in one call.

python — delete_alarms()

cw.delete_alarms(
    AlarmNames=[
        'orders-pipeline-high-rejection-rate',
        'orders-pipeline-duration-breach',
        'pipeline-dlq-has-messages'
    ]
)
print("Alarms deleted")

📋

Logs — create_log_group(), create_log_stream(), put_log_events() LOGS ▼

Log Groups and Log Streams — Structure

CloudWatch Logs are organized in a two-level hierarchy. A Log Group is a container (e.g. one per pipeline or application). A Log Stream is a sequence of events within a group (e.g. one per pipeline run or per host). Think of it as: Log Group = a notebook, Log Stream = one chapter per run.

📦 Analogy

Log Group = a filing cabinet drawer (one per pipeline). Log Stream = a folder inside (one per run). Log Events = the pages in the folder (individual log lines). You always write to a specific folder (stream) inside a drawer (group).

python — create log group and stream

import boto3
from datetime import datetime, timezone

logs = boto3.client('logs', region_name='us-east-1')

LOG_GROUP  = '/data-platform/pipelines/orders-bronze-to-silver'
run_id     = 'run-2024-01-15-083000'
LOG_STREAM = f'prod/{run_id}'

# ── Create the log group (idempotent — safe to call even if it exists) ──
try:
    logs.create_log_group(
        logGroupName = LOG_GROUP,
        tags         = {'Pipeline': 'orders-bronze-to-silver', 'Env': 'prod'}
    )
except logs.exceptions.ResourceAlreadyExistsException:
    pass   # already exists — that's fine

# ── Set retention policy (don't keep logs forever — costs money) ──
logs.put_retention_policy(
    logGroupName    = LOG_GROUP,
    retentionInDays = 90    # keep 90 days, then auto-delete
)

# ── Create a new log stream for this run ──
logs.create_log_stream(
    logGroupName  = LOG_GROUP,
    logStreamName = LOG_STREAM
)
print(f"Log stream created: {LOG_GROUP}/{LOG_STREAM}")

put_log_events() — Write Structured Log Lines

put_log_events() writes one or more log events to a stream. Each event needs a Unix timestamp in milliseconds and a message string. For production pipelines, write structured JSON logs so you can query them later with Log Insights.

python — put_log_events() with structured JSON logging

import boto3, json, time
from datetime import datetime, timezone

logs      = boto3.client('logs', region_name='us-east-1')
LOG_GROUP  = '/data-platform/pipelines/orders-bronze-to-silver'
LOG_STREAM = 'prod/run-2024-01-15-083000'

def ts_ms():
    """Current time in milliseconds — required by put_log_events"""
    return int(time.time() * 1000)

# ── Build structured log events ──
events = [
    {
        'timestamp': ts_ms(),
        'message'  : json.dumps({
            'level'        : 'INFO',
            'pipeline'     : 'orders-bronze-to-silver',
            'run_id'       : 'run-2024-01-15-083000',
            'stage'        : 'extract',
            'rows_read'    : 5_432_100,
            'source_table' : 'orders_raw',
            'message'      : 'Extract completed successfully'
        })
    },
    {
        'timestamp': ts_ms() + 1,  # events must have strictly increasing timestamps
        'message'  : json.dumps({
            'level'          : 'WARN',
            'pipeline'       : 'orders-bronze-to-silver',
            'run_id'         : 'run-2024-01-15-083000',
            'stage'          : 'validate',
            'rows_rejected'  : 110,
            'rejection_reason': 'null_order_id',
            'message'        : '110 rows rejected — null order_id'
        })
    },
]

# ── First call: no sequenceToken needed ──
response = logs.put_log_events(
    logGroupName  = LOG_GROUP,
    logStreamName = LOG_STREAM,
    logEvents     = events
)

# ── Subsequent calls: must include the sequenceToken from previous response ──
next_token = response.get('nextSequenceToken')

more_events = [
    {'timestamp': ts_ms() + 2, 'message': json.dumps({'level':'INFO','stage':'load','rows_written':5_431_990,'message':'Load complete'})}
]

logs.put_log_events(
    logGroupName  = LOG_GROUP,
    logStreamName = LOG_STREAM,
    logEvents     = more_events,
    sequenceToken = next_token     # required for 2nd+ calls to same stream
)

⚠️ sequenceToken is Required for Subsequent Calls

After the first put_log_events() call, every subsequent call to the same stream must include the sequenceToken from the previous response. Missing it causes InvalidSequenceTokenException. Always capture response['nextSequenceToken'] and pass it forward.

💡 Batch Your Events

Each put_log_events() can carry up to 10,000 events or 1 MB total per call. In pipeline code, buffer log events during processing and flush in one batch at the end rather than calling the API for every single log line.

filter_log_events() — Search Across Log Streams

filter_log_events() searches across all streams in a log group using a filter pattern. Useful for programmatic debugging: "find all ERROR events from the orders pipeline in the last hour". Note: for complex queries, Log Insights (below) is faster.

python — filter_log_events() to find errors

import boto3
from datetime import datetime, timedelta, timezone

logs = boto3.client('logs', region_name='us-east-1')
now  = datetime.now(timezone.utc)

# ── Find all ERROR log events from the last 1 hour ──
paginator = logs.get_paginator('filter_log_events')
pages     = paginator.paginate(
    logGroupName  = '/data-platform/pipelines/orders-bronze-to-silver',
    startTime     = int((now - timedelta(hours=1)).timestamp() * 1000),
    endTime       = int(now.timestamp() * 1000),
    filterPattern = '"ERROR"',         # simple string match
)

for page in pages:
    for event in page['events']:
        ts  = datetime.fromtimestamp(event['timestamp']/1000, tz=timezone.utc)
        msg = event['message']
        print(f"[{ts:%H:%M:%S}] {msg[:120]}")

🔍

Log Insights — start_query(), get_query_results() LOG INSIGHTS ▼

What is Log Insights?

CloudWatch Log Insights lets you run SQL-like queries across your log groups without downloading logs. It's purpose-built for querying JSON-structured logs at scale. You use it to answer questions like: "how many rows did each pipeline run process this week?", "which runs had DQ rejections?", "what was the average pipeline duration by day?".

📦 Analogy

Log Insights is like Athena for your logs. Instead of moving logs to S3 and querying with SQL, you query the logs directly in CloudWatch using a simplified query language. It's asynchronous: you start a query and then poll for results.

start_query() and get_query_results() — Full Pattern

Log Insights is asynchronous like Athena. You call start_query() to submit the query, get back a queryId, then poll get_query_results() until the status is Complete. The query language uses commands like fields, filter, stats, sort, limit.

python — Log Insights: start → poll → parse results

import boto3, time
from datetime import datetime, timedelta, timezone

logs = boto3.client('logs', region_name='us-east-1')
now  = datetime.now(timezone.utc)

# ── 1. Submit the query ──
query_response = logs.start_query(
    logGroupName = '/data-platform/pipelines/orders-bronze-to-silver',
    startTime    = int((now - timedelta(days=7)).timestamp()),  # epoch seconds (not ms!)
    endTime      = int(now.timestamp()),
    queryString  = """
        fields @timestamp, run_id, stage, rows_read, rows_written, rows_rejected, level
        | filter level = "INFO" and stage = "load"
        | stats sum(rows_written) as total_rows by run_id
        | sort total_rows desc
        | limit 20
    """
)
query_id = query_response['queryId']
print(f"Query started: {query_id}")

# ── 2. Poll until complete ──
while True:
    result = logs.get_query_results(queryId=query_id)
    status = result['status']   # 'Running', 'Complete', 'Failed', 'Cancelled', 'Timeout'
    print(f"Status: {status}")
    if status in ('Complete', 'Failed', 'Cancelled', 'Timeout'):
        break
    time.sleep(2)

if status != 'Complete':
    raise RuntimeError(f"Log Insights query failed: {status}")

# ── 3. Parse results ──
# results is a list of rows, each row is a list of {field, value} dicts
rows = result['results']
print(f"\nTop pipeline runs by rows written:")
for row in rows:
    # Convert list of {field, value} to a dict for easy access
    record = {item['field']: item['value'] for item in row}
    print(f"  run_id={record.get('run_id','?'):30s}  rows_written={record.get('total_rows','?')}")

💡 Log Insights Query Language Essentials

fields @timestamp, run_id, level — select fields (use @ prefix for built-in fields)
filter level = "ERROR" — filter rows
stats sum(rows_written) as total by run_id — aggregation
sort total desc — sort results
limit 50 — cap result rows (max 10,000)
parse @message '* rows written' as rows — extract from unstructured text

Production Pattern — Automated Log Insights for SLA Monitoring

Combine Log Insights with CloudWatch metrics publishing to build a self-monitoring pipeline: query yesterday's logs every morning, compute SLA metrics (which runs finished late?), publish them as custom metrics, and trigger alarms if SLA breach count exceeds zero.

python — SLA monitoring via Log Insights

import boto3, json, time
from datetime import datetime, timedelta, timezone

logs = boto3.client('logs',        region_name='us-east-1')
cw   = boto3.client('cloudwatch', region_name='us-east-1')
now  = datetime.now(timezone.utc)

# ── Query: find any runs that exceeded 2h duration (SLA breach) ──
r = logs.start_query(
    logGroupName = '/data-platform/pipelines/orders-bronze-to-silver',
    startTime    = int((now - timedelta(hours=24)).timestamp()),
    endTime      = int(now.timestamp()),
    queryString  = """
        fields run_id, duration_seconds
        | filter stage = "complete" and duration_seconds > 7200
        | stats count() as sla_breaches
    """
)
query_id = r['queryId']
while True:
    res = logs.get_query_results(queryId=query_id)
    if res['status'] == 'Complete': break
    time.sleep(2)

breach_count = 0
if res['results']:
    record       = {item['field']: item['value'] for item in res['results'][0]}
    breach_count = int(record.get('sla_breaches', 0))

# ── Publish SLA breach count as a CloudWatch metric ──
cw.put_metric_data(
    Namespace='DataPlatform/SLA',
    MetricData=[{
        'MetricName': 'SLABreaches',
        'Value'     : breach_count,
        'Unit'      : 'Count',
        'Timestamp' : now,
        'Dimensions': [{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'}]
    }]
)
print(f"SLA breaches in last 24h: {breach_count} — metric published")

🏭

Pipeline Observability Patterns — Freshness, Volume, Anomaly, Dashboard PRODUCTION ▼

Data Freshness Monitoring

Data freshness = how recent the latest data in your target table is. Your SLA might say "gold layer orders table must be updated by 06:00 UTC every day". A freshness check queries the max load timestamp in your table and compares it to the expected freshness window. Publish the result as a metric and alarm on it.

python — data freshness check + publish metric

import boto3, time
from datetime import datetime, timezone

cw     = boto3.client('cloudwatch', region_name='us-east-1')
athena = boto3.client('athena',     region_name='us-east-1')

# ── 1. Query the target table for latest load timestamp ──
r = athena.start_query_execution(
    QueryString            = "SELECT MAX(load_dts) AS latest_load FROM gold.orders",
    QueryExecutionContext  = {'Database': 'gold'},
    ResultConfiguration    = {'OutputLocation': 's3://my-datalake/athena-results/'}
)
qid = r['QueryExecutionId']

while True:
    state = athena.get_query_execution(QueryExecutionId=qid)['QueryExecution']['Status']['State']
    if state in ('SUCCEEDED','FAILED'): break
    time.sleep(3)

result = athena.get_query_results(QueryExecutionId=qid)
latest_str = result['ResultSet']['Rows'][1]['Data'][0]['VarCharValue']
latest_load = datetime.fromisoformat(latest_str).replace(tzinfo=timezone.utc)

# ── 2. Calculate staleness in hours ──
staleness_hours = (datetime.now(timezone.utc) - latest_load).total_seconds() / 3600
print(f"Data staleness: {staleness_hours:.1f} hours")

# ── 3. Publish as CloudWatch metric ──
cw.put_metric_data(
    Namespace='DataPlatform/Freshness',
    MetricData=[{
        'MetricName': 'StalenessHours',
        'Value'     : staleness_hours,
        'Unit'      : 'Count',
        'Dimensions': [{'Name': 'Table', 'Value': 'gold.orders'}]
    }]
)
# Alarm on StalenessHours > 6 → pipeline likely missed its SLA window

Row Count Anomaly Detection

Publishing daily row counts as a CloudWatch metric lets you use Anomaly Detection — CloudWatch automatically learns the expected range for each time period and alerts you when today's count is statistically unusual. This catches silent data drops without you having to set hard thresholds.

python — enable anomaly detection alarm on RowsWritten

import boto3

cw = boto3.client('cloudwatch', region_name='us-east-1')

# ── First: train the anomaly detection model on your metric ──
cw.put_anomaly_detector(
    Namespace  = 'DataPlatform/Pipelines',
    MetricName = 'RowsWritten',
    Dimensions = [
        {'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
        {'Name': 'Environment',  'Value': 'prod'}
    ],
    Stat       = 'Sum',
    Configuration = {'ExcludedTimeRanges': []}   # optionally exclude maintenance windows
)

# ── Then: create an alarm that fires when RowsWritten is outside the band ──
cw.put_metric_alarm(
    AlarmName           = 'orders-rows-anomaly',
    AlarmDescription    = 'Row count is statistically anomalous — check data source',
    Metrics=[
        {
            'Id': 'm1',
            'MetricStat': {
                'Metric': {'Namespace':'DataPlatform/Pipelines','MetricName':'RowsWritten',
                           'Dimensions':[{'Name':'PipelineName','Value':'orders-bronze-to-silver'}]},
                'Period':86400, 'Stat':'Sum'
            }
        },
        {
            'Id'        : 'ad1',
            'Expression': 'ANOMALY_DETECTION_BAND(m1, 2)',  # 2 = standard deviations
            'Label'     : 'Expected Band'
        }
    ],
    ComparisonOperator = 'LessThanLowerOrGreaterThanUpperThreshold',
    ThresholdMetricId  = 'ad1',
    EvaluationPeriods  = 1,
    TreatMissingData   = 'breaching',   # missing data (pipeline didn't run) = alarm
    AlarmActions       = ['arn:aws:sns:us-east-1:123456789012:data-platform-alerts']
)
print("Anomaly detection alarm created")

Complete CloudWatch API Reference for Data Engineers

API	What it does	Key Parameters
`put_metric_data()`	Publish custom metrics	Namespace, MetricData (list, max 20)
`get_metric_statistics()`	Query one metric over time	Namespace, MetricName, StartTime, EndTime, Period, Statistics
`get_metric_data()`	Query multiple metrics + math	MetricDataQueries (list with Id, MetricStat/Expression)
`put_metric_alarm()`	Create/update threshold alarm	AlarmName, Threshold, ComparisonOperator, AlarmActions (SNS ARN)
`put_composite_alarm()`	Alarm from other alarm states	AlarmRule (AND/OR logic on alarm names)
`put_anomaly_detector()`	Enable ML anomaly detection	Namespace, MetricName, Stat
`describe_alarms()`	Check alarm state (OK/ALARM)	AlarmNames, AlarmTypes
`delete_alarms()`	Remove alarms	AlarmNames (list, max 100)
`create_log_group()`	Create log container	logGroupName, tags
`put_retention_policy()`	Set log expiry	logGroupName, retentionInDays
`create_log_stream()`	Create log stream (per run)	logGroupName, logStreamName
`put_log_events()`	Write log lines	logGroupName, logStreamName, logEvents, sequenceToken
`filter_log_events()`	Search log events	logGroupName, filterPattern, startTime, endTime
`start_query()`	Start Log Insights query	logGroupName, startTime, endTime, queryString
`get_query_results()`	Poll + fetch Insights results	queryId

29.30.15 — BOTO3 DEEP DIVE

STS APIs

AWS Security Token Service (STS) issues temporary credentials that expire automatically. For data engineers, STS is the backbone of cross-account access — you assume a role in another account and get short-lived keys to read their S3, Glue, or Redshift without ever sharing long-term credentials.

🪪

get_caller_identity() — Who Am I? IDENTITY ▼

What it does and why you need it

get_caller_identity() returns the Account ID, User ID, and ARN of the currently authenticated identity — whether that is an IAM user, an IAM role, or an assumed role session. It requires no parameters and makes no changes. It is the AWS equivalent of whoami on Linux.

🪪 Analogy

Imagine you are a contractor working at a client's office. Before you touch any of their servers you want to confirm: "Which badge am I wearing right now? Am I logged in as myself or as a client role?" get_caller_identity() is exactly that — it tells you which identity boto3 is currently operating as, so you can verify the right role is active before running a destructive operation.

python — get_caller_identity()

import boto3

sts = boto3.client('sts', region_name='us-east-1')

identity = sts.get_caller_identity()

print("Account  :", identity['Account'])   # e.g. '123456789012'
print("UserId   :", identity['UserId'])    # e.g. 'AROAEXAMPLEID:my-session'
print("ARN      :", identity['Arn'])       # e.g. 'arn:aws:sts::123456789012:assumed-role/GlueExecutionRole/my-session'

✅ Real Use Case

At the start of every production pipeline, call get_caller_identity() and log the result to CloudWatch. If a pipeline ever misbehaves in production you can immediately see "was it running as the right role, in the right account?" rather than guessing.

python — guard: abort if wrong account

import boto3, sys

EXPECTED_ACCOUNT = '123456789012'  # prod account ID

sts      = boto3.client('sts')
identity = sts.get_caller_identity()

if identity['Account'] != EXPECTED_ACCOUNT:
    print(f"ABORT: running in account {identity['Account']}, expected {EXPECTED_ACCOUNT}")
    sys.exit(1)

print("Account verified — proceeding with pipeline")

🔄

assume_role() — Become Another Role ASSUME ROLE ▼

How assume_role() works — the 3-step model

assume_role() is how one AWS identity temporarily gains the permissions of another IAM role. You call STS, it verifies your identity has permission to assume the target role, and it returns temporary credentials (Access Key, Secret Key, Session Token) that expire after 15 minutes to 12 hours.

🎭 Analogy

Think of it as an actor putting on a costume. Your main IAM identity is the actor. The target role is the costume. While wearing the costume (within the session duration) you have all the permissions of that role — not your original permissions. When the session expires the costume comes off automatically.

🔑

RoleArn

Full ARN of the role to assume. Provided by the target account admin.

🏷️

RoleSessionName

A label for this session — appears in CloudTrail logs. Use your pipeline name + run ID for traceability.

⏱️

DurationSeconds

Default 3600 (1 hour). Max depends on role's MaxSessionDuration setting (up to 43200 = 12 hours).

🔒

ExternalId

Optional extra secret. Used in cross-account vendor patterns to prevent confused deputy attacks.

python — assume_role() and build a new boto3 session

import boto3

sts = boto3.client('sts', region_name='us-east-1')

# ── Step 1: Assume the target role ──
response = sts.assume_role(
    RoleArn         = 'arn:aws:iam::999888777666:role/DataLakeReadRole',
    RoleSessionName = 'orders-pipeline-cross-account-read',
    DurationSeconds = 3600   # 1 hour
)

# ── Step 2: Extract the temporary credentials ──
creds = response['Credentials']
# creds contains: AccessKeyId, SecretAccessKey, SessionToken, Expiration

print("Temp creds expire at:", creds['Expiration'])

# ── Step 3: Build a boto3 session using the temp credentials ──
assumed_session = boto3.Session(
    aws_access_key_id     = creds['AccessKeyId'],
    aws_secret_access_key = creds['SecretAccessKey'],
    aws_session_token     = creds['SessionToken'],
    region_name           = 'us-east-1'
)

# ── Now use this session to create service clients in the target account ──
s3_target    = assumed_session.client('s3')
glue_target  = assumed_session.client('glue')

# ── Read data from the target account's S3 bucket ──
obj = s3_target.get_object(
    Bucket = 'partner-data-lake-999888777666',
    Key    = 'bronze/orders/2024-01-15/orders.parquet'
)
print("Read", len(obj['Body'].read()), "bytes from target account S3")

# ── List Glue tables in target account's catalog ──
tables_resp = glue_target.get_tables(DatabaseName='partner_bronze_db')
for t in tables_resp['TableList']:
    print("  table:", t['Name'])

⚠️ All Three Credential Fields Are Required

When building a boto3 session from assumed-role credentials you must pass all three: aws_access_key_id, aws_secret_access_key, AND aws_session_token. Omitting the session token causes AuthFailure or InvalidClientTokenId errors, even though the access key and secret look valid.

Token expiry — handling long-running pipelines

Temporary credentials expire. If your pipeline runs longer than the session duration the boto3 clients will start throwing ExpiredTokenException. The pattern for long pipelines is to check expiry before each API call and re-assume the role when needed.

python — auto-refreshing assumed role helper

import boto3
from datetime import datetime, timezone, timedelta

class AssumedRoleSession:
    """Wrapper that auto-refreshes STS credentials before expiry."""

    def __init__(self, role_arn, session_name, duration=3600, refresh_before_secs=300):
        self.role_arn            = role_arn
        self.session_name        = session_name
        self.duration            = duration
        self.refresh_before_secs = refresh_before_secs  # refresh 5 min before expiry
        self.sts                 = boto3.client('sts')
        self._session            = None
        self._expiration         = None

    def _refresh(self):
        resp  = self.sts.assume_role(
            RoleArn         = self.role_arn,
            RoleSessionName = self.session_name,
            DurationSeconds = self.duration
        )
        creds            = resp['Credentials']
        self._expiration = creds['Expiration']
        self._session    = boto3.Session(
            aws_access_key_id     = creds['AccessKeyId'],
            aws_secret_access_key = creds['SecretAccessKey'],
            aws_session_token     = creds['SessionToken']
        )
        print(f"[STS] Role assumed — expires {self._expiration}")

    def session(self):
        now = datetime.now(tz=timezone.utc)
        if self._session is None or (self._expiration - now) < timedelta(seconds=self.refresh_before_secs):
            self._refresh()
        return self._session

# ── Usage ──
role_session = AssumedRoleSession(
    role_arn     = 'arn:aws:iam::999888777666:role/DataLakeReadRole',
    session_name = 'long-running-pipeline'
)

# Each call to .session() auto-refreshes if expiry is near
s3 = role_session.session().client('s3')
# ... do work ...
s3 = role_session.session().client('s3')  # safe to call again hours later

🏢

Cross-Account Data Access Pattern — Full Pipeline PRODUCTION ▼

Architecture — How cross-account access works end to end

Large organisations split AWS accounts by team, domain, or environment. Your pipeline runs in Account A (the source account) but the data lives in Account B (the target account). STS assume_role() bridges the two accounts — no VPN, no credential sharing, no permanent access.

ACCOUNT A (your pipeline account — 123456789012) IAM Role: GlueExecutionRole └─ Policy: sts:AssumeRole on arn:aws:iam::999888777666:role/DataLakeReadRole │ STS AssumeRole call │ ▼ ACCOUNT B (data lake account — 999888777666) IAM Role: DataLakeReadRole └─ Trust Policy: allow 123456789012 / GlueExecutionRole to assume this role └─ Permission Policy: s3:GetObject on s3://partner-data-lake-999888777666/* glue:GetTable, glue:GetDatabase on partner_bronze_db │ Temporary creds (1 hour) │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ boto3.Session(AccessKeyId, SecretAccessKey, SessionToken) │ │ s3_client → reads from s3://partner-data-lake-999888777666│ │ glue_client → reads partner_bronze_db tables │ └──────────────────────────────────────────────────────────────┘

💡 Two Things Must Be Set Up in Account B

1. The role's Trust Policy must list Account A's role as a principal allowed to assume it.
2. The role's Permission Policy must grant the actual resource actions (s3:GetObject, glue:GetTable etc.). Both are needed — the trust is the door, the permission is the key.

Full Production Cross-Account Pipeline — boto3 code

A complete pattern: assume role in target account → list S3 objects → download Parquet → read Glue table metadata → write results back to source account S3.

python — complete cross-account pipeline

import boto3
import io
from botocore.exceptions import ClientError

# ════════════════════════════════════════════════
# CONFIG
# ════════════════════════════════════════════════
SOURCE_ACCOUNT     = '123456789012'
TARGET_ACCOUNT     = '999888777666'
TARGET_ROLE_ARN    = f'arn:aws:iam::{TARGET_ACCOUNT}:role/DataLakeReadRole'
TARGET_BUCKET      = 'partner-data-lake-999888777666'
TARGET_GLUE_DB     = 'partner_bronze_db'
OUTPUT_BUCKET      = 'my-pipeline-output-123456789012'   # source account bucket
OUTPUT_KEY         = 'silver/orders/cross_account_result.parquet'

# ════════════════════════════════════════════════
# STEP 1 — Verify current identity (sanity check)
# ════════════════════════════════════════════════
sts      = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Running as: {identity['Arn']} in account {identity['Account']}")

# ════════════════════════════════════════════════
# STEP 2 — Assume role in target account
# ════════════════════════════════════════════════
try:
    response = sts.assume_role(
        RoleArn         = TARGET_ROLE_ARN,
        RoleSessionName = 'orders-cross-account-read-2024',
        DurationSeconds = 3600
    )
except ClientError as e:
    print(f"Failed to assume role: {e.response['Error']['Code']}: {e.response['Error']['Message']}")
    raise

creds = response['Credentials']
print(f"Assumed role — session expires: {creds['Expiration']}")

# ════════════════════════════════════════════════
# STEP 3 — Build clients in target account
# ════════════════════════════════════════════════
target_session = boto3.Session(
    aws_access_key_id     = creds['AccessKeyId'],
    aws_secret_access_key = creds['SecretAccessKey'],
    aws_session_token     = creds['SessionToken'],
    region_name           = 'us-east-1'
)

s3_target   = target_session.client('s3')
glue_target = target_session.client('glue')

# ════════════════════════════════════════════════
# STEP 4 — List objects in target S3 (with paginator)
# ════════════════════════════════════════════════
paginator = s3_target.get_paginator('list_objects_v2')
pages     = paginator.paginate(Bucket=TARGET_BUCKET, Prefix='bronze/orders/2024-01-15/')

keys = []
for page in pages:
    for obj in page.get('Contents', []):
        keys.append(obj['Key'])
print(f"Found {len(keys)} objects in target bucket")

# ════════════════════════════════════════════════
# STEP 5 — Read Glue table metadata from target account
# ════════════════════════════════════════════════
tables_resp = glue_target.get_tables(DatabaseName=TARGET_GLUE_DB)
print("\nGlue tables in target account:")
for t in tables_resp['TableList']:
    print(f"  {t['Name']} — {t.get('StorageDescriptor',{}).get('Location','?')}")

# ════════════════════════════════════════════════
# STEP 6 — Write result back to SOURCE account S3
# (use default boto3, not target_session)
# ════════════════════════════════════════════════
s3_source = boto3.client('s3')  # default creds → source account
s3_source.put_object(
    Bucket  = OUTPUT_BUCKET,
    Key     = OUTPUT_KEY,
    Body    = str({'keys_found': len(keys), 'tables': [t['Name'] for t in tables_resp['TableList']]}).encode()
)
print(f"\nResults written to s3://{OUTPUT_BUCKET}/{OUTPUT_KEY}")

☸️

assume_role_with_web_identity() — EKS Pod Identity KUBERNETES ▼

What it is and when you use it

When Spark runs on Amazon EKS (Kubernetes), driver and executor pods need AWS credentials to access S3, Glue, etc. The modern approach is IRSA — IAM Roles for Service Accounts. Kubernetes automatically injects a web identity token into each pod, and boto3 exchanges that token for temporary credentials using assume_role_with_web_identity(). You do not call this manually — boto3 does it automatically when IRSA is configured. But you need to understand what is happening under the hood.

☸️ Analogy

Think of the Kubernetes service account token as a temporary visitor badge issued by the cluster. The pod shows this badge to STS and STS exchanges it for an AWS IAM role session. The pod never holds permanent AWS credentials — only the badge which is auto-rotated.

python — how boto3 auto-uses IRSA (for understanding, not manual usage)

# When IRSA is configured on the EKS pod, boto3 automatically does this:
# 1. Reads the web identity token from the file at $AWS_WEB_IDENTITY_TOKEN_FILE
# 2. Reads the role ARN from $AWS_ROLE_ARN environment variable
# 3. Calls assume_role_with_web_identity() transparently
# 4. Returns a boto3 session with the role's permissions

# You just write normal boto3 code — IRSA handles the credential chain:
import boto3

s3 = boto3.client('s3')              # automatically uses IRSA creds on EKS
response = s3.list_buckets()
print([b['Name'] for b in response['Buckets']])

# ── If you ever need to call it manually (rare) ──
sts = boto3.client('sts')

with open('/var/run/secrets/eks.amazonaws.com/serviceaccount/token') as f:
    web_identity_token = f.read()

response = sts.assume_role_with_web_identity(
    RoleArn              = 'arn:aws:iam::123456789012:role/SparkDriverRole',
    RoleSessionName      = 'spark-driver-pod',
    WebIdentityToken     = web_identity_token,
    DurationSeconds      = 3600
)

creds = response['Credentials']
spark_session = boto3.Session(
    aws_access_key_id     = creds['AccessKeyId'],
    aws_secret_access_key = creds['SecretAccessKey'],
    aws_session_token     = creds['SessionToken']
)

☸️ EKS Best Practice

On EKS, never pass AWS credentials as environment variables in pod specs. Always use IRSA. The Kubernetes service account is annotated with the IAM role ARN (eks.amazonaws.com/role-arn) and EKS automatically injects the token file. This is the approved zero-credential pattern for Spark on EKS.

📋

STS API Quick Reference for Data Engineers REFERENCE ▼

Complete STS API summary

API	What it does	Key Parameters	Returns
`get_caller_identity()`	Return current identity info	None	Account, UserId, Arn
`assume_role()`	Get temp creds for an IAM role	RoleArn, RoleSessionName, DurationSeconds, ExternalId	Credentials (AccessKeyId, SecretAccessKey, SessionToken, Expiration)
`assume_role_with_web_identity()`	Get temp creds using OIDC token (EKS/IRSA)	RoleArn, RoleSessionName, WebIdentityToken	Credentials + AssumedRoleUser
`assume_role_with_saml()`	Get temp creds using SAML SSO assertion	RoleArn, PrincipalArn, SAMLAssertion	Credentials
`get_session_token()`	MFA-based temp creds for IAM user	DurationSeconds, SerialNumber, TokenCode (MFA)	Credentials
`get_federation_token()`	Temp creds with policy scoping (for app users)	Name, Policy, DurationSeconds	Credentials + FederatedUser

💡 The Two You Will Use Most

As a data engineer, 99% of your STS usage is two APIs: get_caller_identity() for sanity-checking which identity is active, and assume_role() for cross-account access. The others (SAML, web identity, federation) are handled by infrastructure/platform teams.

Common STS Errors and Fixes

Error Code	Cause	Fix
`AccessDenied`	Your role does not have `sts:AssumeRole` permission on the target role ARN	Add `sts:AssumeRole` to your role's permission policy for the target ARN
`AccessDenied` (trust policy)	Target role's trust policy does not list your identity as a trusted principal	Add your role ARN to the target role's trust policy (done by target account admin)
`ExpiredTokenException`	Using a client built from expired temp credentials	Re-assume the role, rebuild the boto3 Session with fresh creds
`InvalidClientTokenId`	Access key in the credentials is wrong or stale	Ensure you are passing all three fields: AccessKeyId + SecretAccessKey + SessionToken
`ValidationError: RoleSessionName`	Session name has invalid characters (spaces, slashes)	Use only alphanumeric, hyphen, underscore, dot in RoleSessionName

29.30.16 · Boto3 Deep Dive

EventBridge APIs

Amazon EventBridge is the event bus of AWS. As a data engineer you use it to trigger pipelines on a schedule, react to S3/Glue/EMR events, and publish your own custom pipeline events so downstream systems can react. All of this is scriptable via boto3.

📅

EventBridge Core Concepts Foundation ▼

What is EventBridge?

EventBridge is a serverless event router. Events flow in → rules filter them → targets (Lambda, SQS, Glue, Step Functions…) execute. Think of it as an intelligent if-this-then-that engine for AWS.

🧠 Analogy

EventBridge is like a post office with sorting rules. Letters (events) arrive at the post office (event bus). Sorting rules (rules) check the address pattern. Matching letters are delivered to the right mailbox (target). You never touch the letters manually.

EVENT FLOW Source EventBridge Target ────── ─────────── ────── S3 (file arrived) → default bus → Rule: "S3 PutObject on raw/" Glue Job (failed) → default bus → Rule: "Glue FAILED state" → SNS Alert Your pipeline code → custom bus → Rule: "pipeline.completed" → Lambda Cron schedule → default bus → Rule: "rate(1 day)" → Glue Job

Key Terms

Term	What it is	Data Engineering Context
Event Bus	Channel that receives events. Default bus = AWS service events. Custom bus = your own events.	Use default bus for S3/Glue/EMR events. Create custom bus for pipeline events.
Event	JSON payload describing something that happened. Max 256 KB.	S3 object created, Glue job state change, your custom pipeline.completed event.
Rule	Pattern-match expression that filters events and routes to targets.	"Run Glue job daily at 06:00 UTC" or "Alert SNS when Glue job state = FAILED".
Target	AWS resource that EventBridge invokes when a rule matches. Up to 5 targets per rule.	Lambda, SQS, SNS, Step Functions, Glue Job, EMR (via Lambda), Kinesis.
Schedule	Cron or rate expression embedded in a rule. No source event needed.	`rate(1 day)`, `cron(0 6 * * ? *)` — trigger Glue/EMR daily.

📤

put_events() — Publishing Custom Pipeline Events Most Used ▼

Why publish custom events?

put_events() lets your pipeline announce itself to the rest of the system. Instead of hardcoding "after step A, call step B", you publish an event like pipeline.completed and let downstream rules decide what to do. This decouples producers from consumers.

📦 Real Use Case

Your Glue job finishes writing Silver layer Parquet. It publishes {"detail-type": "silver.layer.ready", "source": "com.mycompany.pipeline"}. A rule triggers a Lambda that starts the Gold layer aggregation job — without the Silver job knowing anything about Gold.

put_events() API

Sends up to 10 events per API call. Each event has these key fields:

Field	Required	Description
`Source`	Yes	Who sent it. Convention: `com.yourcompany.service`
`DetailType`	Yes	Human-readable event category. Used in rule pattern matching.
`Detail`	Yes	JSON string with the event payload. Put your pipeline metadata here.
`EventBusName`	No	Omit = default bus. Specify name/ARN for custom bus.
`Time`	No	Event timestamp. Defaults to now.
`Resources`	No	List of ARNs related to this event (S3 bucket, table, etc.)

Python · boto3

import boto3, json
from datetime import datetime, timezone

events_client = boto3.client('events', region_name='us-east-1')

# ── Publish a custom pipeline completion event ──
response = events_client.put_events(
    Entries=[
        {
            'Source': 'com.mycompany.data-pipeline',           # your app identifier
            'DetailType': 'pipeline.silver.completed',         # event category
            'Detail': json.dumps({                            # payload as JSON string
                'pipeline_name': 'customer_silver_etl',
                'run_id': 'run_2024_01_15_001',
                'status': 'SUCCESS',
                'rows_written': 1_450_000,
                'output_s3_path': 's3://my-lake/silver/customers/',
                'completed_at': datetime.now(timezone.utc).isoformat()
            }),
            'EventBusName': 'data-platform-bus',            # custom bus (omit = default)
            'Resources': [
                'arn:aws:s3:::my-lake'
            ]
        }
    ]
)

# ── Check for failures (partial failure is possible) ──
failed = response.get('FailedEntryCount', 0)
if failed > 0:
    for entry in response['Entries']:
        if 'ErrorCode' in entry:
            print(f"Failed: {entry['ErrorCode']} — {entry['ErrorMessage']}")
else:
    event_id = response['Entries'][0]['EventId']
    print(f"Event published: {event_id}")

💡 Partial Failures

put_events() never throws an exception for rejected events — it returns them in Entries with an ErrorCode. Always check FailedEntryCount and loop through entries. The position in Entries matches your input order.

Batch Publishing (multiple events in one call)

Up to 10 events per call, total payload ≤ 256 KB. For high-volume scenarios, chunk your events.

Python · boto3

import boto3, json

events_client = boto3.client('events')

def publish_events_batch(events_list, bus_name='default'):
    """Publish events in chunks of 10 (EventBridge limit)."""
    chunk_size = 10
    total_failed = 0

    for i in range(0, len(events_list), chunk_size):
        chunk = events_list[i:i + chunk_size]
        entries = [
            {
                'Source': evt['source'],
                'DetailType': evt['detail_type'],
                'Detail': json.dumps(evt['detail']),
                'EventBusName': bus_name
            }
            for evt in chunk
        ]
        response = events_client.put_events(Entries=entries)
        total_failed += response['FailedEntryCount']

    print(f"Published {len(events_list)} events, {total_failed} failed")

# ── Usage ──
pipeline_events = [
    {'source': 'com.co.pipeline', 'detail_type': 'table.loaded', 'detail': {'table': 'orders', 'rows': 50000}},
    {'source': 'com.co.pipeline', 'detail_type': 'table.loaded', 'detail': {'table': 'customers', 'rows': 12000}},
]
publish_events_batch(pipeline_events, bus_name='data-platform-bus')

📋

create_rule() — Schedule & Event Pattern Rules Rule Management ▼

Two Types of Rules

Type	When it fires	DE Use Case	Key Parameter
Schedule Rule	On a cron or rate schedule. No incoming event needed.	Trigger Glue job daily at 06:00 UTC	`ScheduleExpression`
Event Pattern Rule	When an event on the bus matches a JSON pattern.	Alert when Glue job state changes to FAILED	`EventPattern`

Creating a Schedule Rule (cron/rate)

Python · boto3

import boto3

events_client = boto3.client('events')

# ── Rate-based schedule: every 1 day ──
response = events_client.create_rule(
    Name='daily-silver-etl-trigger',
    ScheduleExpression='rate(1 day)',       # runs every 24h
    State='ENABLED',
    Description='Triggers Silver ETL Glue job every day',
    EventBusName='default'                 # schedule rules always on default bus
)
rule_arn = response['RuleArn']
print(f"Created rule: {rule_arn}")

# ── Cron-based schedule: 06:00 UTC every weekday ──
events_client.create_rule(
    Name='weekday-gold-etl-trigger',
    ScheduleExpression='cron(0 6 ? * MON-FRI *)',  # Mon–Fri 06:00 UTC
    State='ENABLED',
    Description='Gold layer aggregation, weekdays only'
)

# ── Common cron expressions ──
# cron(Minutes Hours Day-of-month Month Day-of-week Year)
# cron(0 6 * * ? *)       → every day at 06:00 UTC
# cron(0 */6 * * ? *)     → every 6 hours
# cron(0 8 1 * ? *)       → 1st of every month at 08:00 UTC
# cron(0 6 ? * MON-FRI *) → weekdays at 06:00 UTC

⚠️ EventBridge Cron Quirk

EventBridge cron uses 6 fields (not 5 like Unix cron). The 6th field is Year. Also, Day-of-month and Day-of-week cannot both be specified — one must be ?. So cron(0 6 * * MON-FRI *) is INVALID; use cron(0 6 ? * MON-FRI *).

Creating an Event Pattern Rule

Event pattern rules match JSON structure of incoming events. You specify which fields must have which values. Partial match is sufficient — unspecified fields are ignored.

Python · boto3

import boto3, json

events_client = boto3.client('events')

# ── Rule: fire when any Glue job reaches FAILED or TIMEOUT state ──
glue_failure_pattern = {
    "source": ["aws.glue"],                        # only Glue events
    "detail-type": ["Glue Job State Change"],          # only state change events
    "detail": {
        "state": ["FAILED", "TIMEOUT", "ERROR"]       # only failure states
    }
}

events_client.create_rule(
    Name='glue-job-failure-alert',
    EventPattern=json.dumps(glue_failure_pattern),
    State='ENABLED',
    Description='Fires when any Glue job fails'
)

# ── Rule: fire only for a specific Glue job name ──
specific_job_pattern = {
    "source": ["aws.glue"],
    "detail-type": ["Glue Job State Change"],
    "detail": {
        "jobName": ["customer-silver-etl"],           # specific job only
        "state": ["SUCCEEDED"]                        # only on success
    }
}

events_client.create_rule(
    Name='customer-etl-success-trigger',
    EventPattern=json.dumps(specific_job_pattern),
    State='ENABLED',
    Description='Triggers Gold job after Customer Silver ETL succeeds'
)

# ── Rule: match your own custom events ──
custom_event_pattern = {
    "source": ["com.mycompany.data-pipeline"],
    "detail-type": ["pipeline.silver.completed"]
}

events_client.create_rule(
    Name='silver-complete-to-gold-trigger',
    EventPattern=json.dumps(custom_event_pattern),
    EventBusName='data-platform-bus',              # custom bus
    State='ENABLED'
)

Enabling / Disabling Rules

Python · boto3

# Disable a rule (e.g. maintenance window, non-business days)
events_client.disable_rule(Name='daily-silver-etl-trigger')

# Re-enable it
events_client.enable_rule(Name='daily-silver-etl-trigger')

# Delete rule (must remove targets first)
events_client.remove_targets(Rule='daily-silver-etl-trigger', Ids=['1'])
events_client.delete_rule(Name='daily-silver-etl-trigger')

🎯

put_targets() — Wiring Rules to Actions Targets ▼

What is a Target?

A target is what EventBridge invokes when a rule fires. Each rule can have up to 5 targets. As a data engineer your most common targets are Lambda (to orchestrate), SQS (to buffer), and SNS (to alert).

λ

Lambda

Trigger any logic — start Glue, EMR step, call API. Most flexible.

📨

SQS

Buffer events for reliable processing. Good for high-volume triggers.

📢

SNS

Fan-out alerts — email, Slack, PagerDuty on pipeline failures.

🔄

Step Functions

Start a state machine for complex multi-step pipeline orchestration.

put_targets() — Lambda Target

Python · boto3

import boto3, json

events_client = boto3.client('events')
lambda_client = boto3.client('lambda')

RULE_NAME = 'daily-silver-etl-trigger'
LAMBDA_ARN = 'arn:aws:lambda:us-east-1:123456789012:function:trigger-glue-job'

# ── Add Lambda as target ──
response = events_client.put_targets(
    Rule=RULE_NAME,
    Targets=[
        {
            'Id': '1',                              # unique ID within this rule (string)
            'Arn': LAMBDA_ARN,                      # target ARN
            'Input': json.dumps({                  # static JSON passed to Lambda as event
                'pipeline': 'customer-silver-etl',
                'trigger_source': 'eventbridge-schedule'
            })
        }
    ]
)
if response['FailedEntryCount'] > 0:
    print("Target registration failed:", response['FailedEntries'])

# ── Grant EventBridge permission to invoke Lambda ──
# (EventBridge needs lambda:InvokeFunction permission on the function)
try:
    lambda_client.add_permission(
        FunctionName=LAMBDA_ARN,
        StatementId='eventbridge-invoke-permission',
        Action='lambda:InvokeFunction',
        Principal='events.amazonaws.com',
        SourceArn=f'arn:aws:events:us-east-1:123456789012:rule/{RULE_NAME}'
    )
except lambda_client.exceptions.ResourceConflictException:
    pass  # permission already exists, that's fine

⚠️ Don't Forget the Permission

Just adding a Lambda as a target is NOT enough. You must also call lambda.add_permission() to grant EventBridge the right to invoke it. Otherwise the rule fires silently and nothing happens — a very common mistake.

put_targets() — SQS and SNS Targets

Python · boto3

import boto3, json

events_client = boto3.client('events')

SQS_ARN = 'arn:aws:sqs:us-east-1:123456789012:pipeline-trigger-queue'
SNS_ARN = 'arn:aws:sns:us-east-1:123456789012:pipeline-alerts'

# ── Route Glue failure events to BOTH SNS (alert) and SQS (retry queue) ──
events_client.put_targets(
    Rule='glue-job-failure-alert',
    Targets=[
        {
            'Id': 'alert-ops-team',
            'Arn': SNS_ARN,                          # SNS for email/Slack alert
            'InputTransformer': {                    # reshape event before sending
                'InputPathsMap': {
                    'job': '$.detail.jobName',
                    'state': '$.detail.state'
                },
                'InputTemplate': '"ALERT: Glue job <job> entered state <state>"'
            }
        },
        {
            'Id': 'queue-for-retry',
            'Arn': SQS_ARN,                          # SQS for retry handling
        }
    ]
)
# Note: For SQS standard queue, EventBridge needs sqs:SendMessage permission
# Add this to the SQS queue policy (not via boto3 add_permission)

💡 InputTransformer

InputTransformer lets you extract specific fields from the event and shape them before passing to the target. Use $.detail.fieldName (JSONPath) to pick fields. Wrap the template value in escaped quotes for string output.

🔍

list_rules() · list_targets_by_rule() · delete_rule() Management ▼

Listing & Inspecting Rules

Python · boto3

import boto3

events_client = boto3.client('events')

# ── List all rules (paginated) ──
paginator = events_client.get_paginator('list_rules')
for page in paginator.paginate(EventBusName='default'):
    for rule in page['Rules']:
        print(f"{rule['Name']:40} {rule['State']:10} {rule.get('ScheduleExpression', rule.get('EventPattern', ''))[:60]}")

# ── List all rules with a name prefix ──
for page in paginator.paginate(NamePrefix='daily-', EventBusName='default'):
    for rule in page['Rules']:
        print(rule['Name'], rule['State'])

# ── List targets attached to a specific rule ──
response = events_client.list_targets_by_rule(Rule='daily-silver-etl-trigger')
for target in response['Targets']:
    print(f"Target ID: {target['Id']}  ARN: {target['Arn']}")

# ── Get full rule details ──
rule = events_client.describe_rule(Name='daily-silver-etl-trigger')
print("Schedule:", rule.get('ScheduleExpression'))
print("State:",    rule['State'])
print("ARN:",      rule['Arn'])

Safely Deleting a Rule

You cannot delete a rule that still has targets. Always remove targets first.

Python · boto3

import boto3

events_client = boto3.client('events')

def delete_rule_safely(rule_name, bus_name='default'):
    """Remove all targets then delete the rule."""
    # Step 1: get all target IDs
    response = events_client.list_targets_by_rule(Rule=rule_name, EventBusName=bus_name)
    target_ids = [t['Id'] for t in response['Targets']]

    # Step 2: remove targets
    if target_ids:
        events_client.remove_targets(
            Rule=rule_name,
            EventBusName=bus_name,
            Ids=target_ids
        )
        print(f"Removed {len(target_ids)} targets")

    # Step 3: delete rule
    events_client.delete_rule(Name=rule_name, EventBusName=bus_name)
    print(f"Deleted rule: {rule_name}")

delete_rule_safely('daily-silver-etl-trigger')

🏗️

Event-Driven Pipeline Trigger Patterns Production Pattern ▼

Pattern A — Schedule → Lambda → Glue

The most common DE pattern: a cron rule fires a Lambda that starts a Glue job with dynamic arguments.

EventBridge cron rule → Lambda (trigger-glue-job) → glue.start_job_run(JobName='silver-etl', Arguments={...}) → Glue job runs → on SUCCEEDED: publishes custom event to EventBridge → next rule fires Gold aggregation

Python · Lambda handler (trigger-glue-job)

import boto3, json
from datetime import datetime, timezone

glue_client = boto3.client('glue')
events_client = boto3.client('events')

def lambda_handler(event, context):
    """EventBridge schedule → Lambda → Glue start."""
    pipeline = event.get('pipeline', 'customer-silver-etl')
    run_date = datetime.now(timezone.utc).strftime('%Y-%m-%d')

    try:
        response = glue_client.start_job_run(
            JobName=pipeline,
            Arguments={
                '--run_date': run_date,
                '--trigger_source': 'eventbridge-schedule'
            }
        )
        run_id = response['JobRunId']
        print(f"Started {pipeline} run: {run_id}")
        return {'statusCode': 200, 'jobRunId': run_id}

    except Exception as e:
        # Publish failure event so another rule can alert ops
        events_client.put_events(Entries=[{
            'Source': 'com.co.pipeline-launcher',
            'DetailType': 'pipeline.launch.failed',
            'Detail': json.dumps({'pipeline': pipeline, 'error': str(e)})
        }])
        raise

Pattern B — S3 Event → EventBridge → Lambda

S3 sends object-created events to EventBridge (if enabled on the bucket). A rule filters for the right prefix and fires a Lambda to process the file.

Python · boto3 — Rule setup

import boto3, json

events_client = boto3.client('events')

# S3 sends events like: source=aws.s3, detail-type="Object Created"
# detail.bucket.name = your bucket, detail.object.key = the S3 key
s3_event_pattern = {
    "source": ["aws.s3"],
    "detail-type": ["Object Created"],
    "detail": {
        "bucket": {
            "name": ["my-raw-data-lake"]               # specific bucket
        },
        "object": {
            "key": [{"prefix": "raw/customers/"}]    # specific prefix
        }
    }
}

events_client.create_rule(
    Name='raw-customer-file-arrived',
    EventPattern=json.dumps(s3_event_pattern),
    State='ENABLED',
    Description='Fires when a new file lands in raw/customers/'
)
# Then put_targets() → Lambda that reads the file and triggers Glue

# Pre-req: enable EventBridge notifications on the S3 bucket
s3_client = boto3.client('s3')
s3_client.put_bucket_notification_configuration(
    Bucket='my-raw-data-lake',
    NotificationConfiguration={
        'EventBridgeConfiguration': {}  # empty dict = enable all events to EventBridge
    }
)

Pattern C — Chain of Pipeline Stages via Custom Events

Each stage publishes a completion event → the next stage is triggered by a rule. Fully decoupled, fully event-driven.

CHAINED EVENT-DRIVEN PIPELINE [Bronze ETL completes] → put_events: "bronze.completed" → EventBridge rule: match "bronze.completed" → Lambda: trigger Silver Glue job [Silver ETL completes] → put_events: "silver.completed" → EventBridge rule: match "silver.completed" → Lambda: trigger Gold aggregation Glue job [Gold ETL completes] → put_events: "gold.completed" → EventBridge rule: match "gold.completed" → SNS: notify BI team that data is ready

💡 Why This Is Better Than Airflow for Simple Chains

For simple linear pipelines with no complex branching, EventBridge chains are zero-infrastructure — no scheduler process to maintain, no metadata DB, no webserver. Airflow is better when you need retries, backfill, complex dependencies, or visibility into run history. Use the right tool.

📊

EventBridge API Quick Reference Reference ▼

Complete API Cheat Sheet

API Call	What it does	Key Parameters	Returns
`put_events()`	Publish 1–10 custom events to a bus	Entries[]: Source, DetailType, Detail, EventBusName	FailedEntryCount, Entries[EventId or ErrorCode]
`create_rule()`	Create a schedule or event-pattern rule	Name, ScheduleExpression OR EventPattern, State, EventBusName	RuleArn
`put_targets()`	Attach targets (Lambda/SQS/SNS) to a rule	Rule, Targets[]: Id, Arn, Input/InputTransformer	FailedEntryCount
`list_rules()`	List all rules (paginated)	NamePrefix, EventBusName, NextToken	Rules[], NextToken
`describe_rule()`	Get full details of one rule	Name, EventBusName	Rule object with all fields
`list_targets_by_rule()`	Get all targets for a rule	Rule, EventBusName	Targets[]
`enable_rule()`	Enable a disabled rule	Name, EventBusName	—
`disable_rule()`	Pause a rule without deleting it	Name, EventBusName	—
`remove_targets()`	Detach targets from a rule (required before delete)	Rule, Ids[], EventBusName	FailedEntryCount
`delete_rule()`	Delete a rule (must have no targets)	Name, EventBusName	—
`create_event_bus()`	Create a custom event bus	Name	EventBusArn
`list_event_buses()`	List all event buses	NamePrefix	EventBuses[]
`delete_event_bus()`	Delete a custom event bus	Name	—

☁️ Common Errors

ResourceNotFoundException — rule or bus name does not exist. ValidationException — invalid cron expression (remember 6-field format) or event pattern is not valid JSON. LimitExceededException — more than 300 rules per bus (soft limit, requestable to increase). ManagedRuleException — trying to modify a rule created by an AWS service (not yours to touch).

29.30.17 · Boto3 Deep Dive

RDS / Redshift Data APIs

The RDS Data API and Redshift Data API let you run SQL against Aurora Serverless / Redshift over HTTPS using boto3 — no JDBC driver, no persistent connection, no VPC networking required from your client. This is the standard way Lambda functions and orchestration code run SQL without managing database connections.

🔌

Why a "Data API" instead of a normal DB connection? Foundation ▼

The Problem with Traditional Connections

Normally, to run SQL from Python you open a TCP connection with a driver like psycopg2 — this needs the database inside your VPC (or a public endpoint), a network path, a connection pool, and credentials passed at connect time. For short-lived compute like Lambda, opening/closing thousands of DB connections causes connection storms and exhausts the database's max-connections limit.

🧠 Analogy

A normal DB connection is like calling someone on a dedicated phone line you have to dial, hold open, and hang up every time. The Data API is like sending a text message through a receptionist (the HTTPS endpoint) — you send the SQL, the receptionist runs it on the database for you and texts back the result. No phone line to manage.

How the Data API Works

You call execute_statement() over the standard AWS API (HTTPS + IAM auth) — boto3 handles this like any other AWS service call. AWS internally manages a connection pool to the database. No driver, no VPC access needed from your Lambda/script, and authentication is via IAM or Secrets Manager instead of hardcoded DB passwords.

TRADITIONAL JDBC DATA API (RDS / Redshift) Lambda ──TCP:5432──▶ Aurora Lambda ──HTTPS──▶ AWS Data API ──▶ Aurora/Redshift (needs VPC, (no VPC needed, conn pooling, IAM auth, psycopg2 driver) no driver)

RDS Data API vs Redshift Data API

Aspect	RDS Data API	Redshift Data API
Applies to	Aurora Serverless (PostgreSQL / MySQL compatible)	Redshift provisioned clusters & Redshift Serverless
Auth	Secrets Manager ARN (`secretArn`)	Secrets Manager ARN OR temporary IAM credentials (`DbUser`)
Transactions	`begin_transaction()` / `commit_transaction()`	Not exposed the same way — each `execute_statement()` is its own unit
Async by default?	No — synchronous response	Yes — must poll `describe_statement()`
Typical DE use	Metadata/audit DB writes from Lambda	Running transforms/UNLOADs on the warehouse from orchestration code

💡 Key Point

The Redshift Data API is asynchronous — execute_statement() returns immediately with an Id, and you must poll until the status is FINISHED. The RDS Data API is synchronous for simple statements — the result comes back in the same call (though it can also be used asynchronously for long-running statements).

🐬

RDS Data API (Aurora Serverless) Core API ▼

execute_statement()

Runs a single SQL statement against an Aurora Serverless database. Requires the cluster ARN, the secret ARN holding credentials, the database name, and the SQL string. Use named parameters (:param) instead of string formatting to avoid SQL injection.

Python · boto3

import boto3

rds_data = boto3.client('rds-data', region_name='us-east-1')

CLUSTER_ARN = 'arn:aws:rds:us-east-1:123456789012:cluster:my-aurora-cluster'
SECRET_ARN  = 'arn:aws:secretsmanager:us-east-1:123456789012:secret:rds-creds-AbCdEf'
DATABASE    = 'pipeline_metadata'

# ── Run a parameterized INSERT (audit log row) ──
response = rds_data.execute_statement(
    resourceArn=CLUSTER_ARN,
    secretArn=SECRET_ARN,
    database=DATABASE,
    sql="""
        INSERT INTO pipeline_audit (run_id, pipeline_name, status, rows_processed)
        VALUES (:run_id, :pipeline_name, :status, :rows_processed)
    """,
    parameters=[
        {'name': 'run_id', 'value': {'stringValue': 'run_2024_01_15_001'}},
        {'name': 'pipeline_name', 'value': {'stringValue': 'customer_silver_etl'}},
        {'name': 'status', 'value': {'stringValue': 'SUCCESS'}},
        {'name': 'rows_processed', 'value': {'longValue': 1450000}},
    ]
)

print("Rows affected:", response['numberOfRecordsUpdated'])

Reading results — SELECT with execute_statement()

For SELECT queries, set includeResultMetadata=True to get column names back, and parse response['records'] — each record is a list of typed value dicts (stringValue, longValue, doubleValue, booleanValue, isNull).

Python · boto3

# ── SELECT and parse results into list of dicts ──
response = rds_data.execute_statement(
    resourceArn=CLUSTER_ARN,
    secretArn=SECRET_ARN,
    database=DATABASE,
    sql="SELECT run_id, pipeline_name, status, rows_processed FROM pipeline_audit ORDER BY run_id DESC LIMIT 5",
    includeResultMetadata=True
)

# Build column name list from metadata
columns = [col['name'] for col in response['columnMetadata']]

# Helper to extract the actual value regardless of type key
def extract_value(field):
    if field.get('isNull'):
        return None
    for key in ('stringValue', 'longValue', 'doubleValue', 'booleanValue'):
        if key in field:
            return field[key]
    return None

rows = []
for record in response['records']:
    row = {columns[i]: extract_value(field) for i, field in enumerate(record)}
    rows.append(row)

print(rows)
# [{'run_id': 'run_2024_01_15_001', 'pipeline_name': 'customer_silver_etl', ...}]

batch_execute_statement()

Runs the same SQL statement multiple times with different parameter sets in one call — ideal for bulk inserts (e.g., writing many audit rows or control-table entries at once) without looping individual execute_statement() calls.

Python · boto3

# ── Bulk insert multiple audit rows in one call ──
param_sets = [
    [
        {'name': 'run_id', 'value': {'stringValue': 'run_001'}},
        {'name': 'status', 'value': {'stringValue': 'SUCCESS'}},
    ],
    [
        {'name': 'run_id', 'value': {'stringValue': 'run_002'}},
        {'name': 'status', 'value': {'stringValue': 'FAILED'}},
    ],
]

response = rds_data.batch_execute_statement(
    resourceArn=CLUSTER_ARN,
    secretArn=SECRET_ARN,
    database=DATABASE,
    sql="UPDATE pipeline_audit SET status = :status WHERE run_id = :run_id",
    parameterSets=param_sets
)
print(len(response['updateResults']), "statements executed")

📦 Real Use Case

A "Metadata-Driven Multi-Pipeline" run finishes processing 20 source tables. Instead of 20 separate execute_statement() calls to update the watermark table, one batch_execute_statement() updates all 20 rows in a single round trip.

begin_transaction() / commit_transaction()

For multi-statement atomic operations (e.g., update a watermark table AND write an audit row — both must succeed or both must fail), wrap calls in a transaction. Pass the returned transactionId into each execute_statement() call, then commit (or roll back) at the end.

Python · boto3

# ── Atomic: update watermark + insert audit row together ──
tx = rds_data.begin_transaction(
    resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE
)
tx_id = tx['transactionId']

try:
    rds_data.execute_statement(
        resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE,
        transactionId=tx_id,
        sql="UPDATE watermark_table SET last_value = :wm WHERE pipeline_id = :pid",
        parameters=[
            {'name': 'wm', 'value': {'stringValue': '2024-01-15T06:00:00Z'}},
            {'name': 'pid', 'value': {'longValue': 42}},
        ]
    )
    rds_data.execute_statement(
        resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE,
        transactionId=tx_id,
        sql="INSERT INTO pipeline_audit (run_id, status) VALUES (:rid, 'SUCCESS')",
        parameters=[{'name': 'rid', 'value': {'stringValue': 'run_2024_01_15_001'}}]
    )
    rds_data.commit_transaction(resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, transactionId=tx_id)
except Exception as e:
    rds_data.rollback_transaction(resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, transactionId=tx_id)
    raise

💡 Key Point

If you never explicitly commit or roll back, the transaction stays open and eventually times out automatically. Always wrap transaction logic in try/except and roll back on failure — this is the same idempotent-pipeline principle from Module 23.6.

🟥

Redshift Data API Core API ▼

execute_statement() — submit SQL to Redshift

Identify the target with ClusterIdentifier (provisioned) or WorkgroupName (Redshift Serverless), the Database name, and either DbUser (temporary IAM credentials) or SecretArn (Secrets Manager). This call returns immediately with a statement Id — it does not wait for the query to finish.

Python · boto3

import boto3

redshift_data = boto3.client('redshift-data', region_name='us-east-1')

# ── Submit a SQL statement (returns immediately) ──
response = redshift_data.execute_statement(
    ClusterIdentifier='my-redshift-cluster',
    Database='analytics',
    DbUser='etl_service_user',             # IAM-based auth (no password)
    Sql="""
        SELECT pipeline_name, COUNT(*) AS run_count, SUM(rows_processed) AS total_rows
        FROM pipeline_audit
        WHERE run_date = CURRENT_DATE
        GROUP BY pipeline_name
    """
)

statement_id = response['Id']
print("Submitted, statement Id:", statement_id)

🧠 Analogy

Submitting a Redshift Data API statement is like dropping a letter in a mailbox — you get a tracking number (Id) immediately, but the letter (your query) is still being processed. You must check the tracking number later to see if it's "delivered."

describe_statement() — poll until FINISHED

Because execution is async, poll describe_statement() with a short delay until Status becomes FINISHED, FAILED, or ABORTED. This is the exact "manual waiter" pattern referenced in 29.30.4 — Redshift has no built-in waiter for this.

Python · boto3

import time

# ── Poll describe_statement until terminal state ──
while True:
    desc = redshift_data.describe_statement(Id=statement_id)
    status = desc['Status']   # PICKED | STARTED | FINISHED | FAILED | ABORTED
    print("Current status:", status)

    if status == 'FINISHED':
        break
    elif status in ('FAILED', 'ABORTED'):
        raise Exception(f"Query {status}: {desc.get('Error', 'unknown error')}")

    time.sleep(1)  # short backoff between polls

get_statement_result() — fetch results

Once FINISHED, call get_statement_result() to retrieve Records (paginated via NextToken) and ColumnMetadata (column names/types) — combine them the same way as the RDS Data API result parsing above.

Python · boto3

# ── Fetch and parse results into list of dicts ──
result = redshift_data.get_statement_result(Id=statement_id)

columns = [col['name'] for col in result['ColumnMetadata']]

def extract_value(field):
    if field.get('isNull'):
        return None
    for key in ('stringValue', 'longValue', 'doubleValue', 'booleanValue'):
        if key in field:
            return field[key]
    return None

rows = []
for record in result['Records']:
    rows.append({columns[i]: extract_value(f) for i, f in enumerate(record)})

# ── Handle pagination for large result sets ──
while 'NextToken' in result and result['NextToken']:
    result = redshift_data.get_statement_result(Id=statement_id, NextToken=result['NextToken'])
    for record in result['Records']:
        rows.append({columns[i]: extract_value(f) for i, f in enumerate(record)})

import pandas as pd
df = pd.DataFrame(rows)
print(df)

batch_execute_statement() — multiple SQL statements

Runs several different SQL statements sequentially as one logical unit (e.g., TRUNCATE then COPY then ANALYZE). Each sub-statement gets its own Id, queryable individually via describe_statement() with Id in the form parentId:index.

Python · boto3

# ── Truncate + COPY + ANALYZE as one batch ──
response = redshift_data.batch_execute_statement(
    ClusterIdentifier='my-redshift-cluster',
    Database='analytics',
    DbUser='etl_service_user',
    Sqls=[
        "TRUNCATE TABLE staging.customers",
        """
            COPY staging.customers
            FROM 's3://my-lake/silver/customers/'
            IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
            FORMAT AS PARQUET
        """,
        "ANALYZE staging.customers",
    ]
)
batch_id = response['Id']
print("Batch Id:", batch_id)
# Poll describe_statement(Id=batch_id) — when FINISHED, all sub-statements ran

📦 Real Use Case

This is the classic Module 14 → Redshift load pattern from boto3: Spark writes Silver Parquet to S3, then this single batch_execute_statement() call truncates the staging table, loads it via COPY, and refreshes statistics with ANALYZE — all orchestrated from Airflow with no JDBC connection.

list_statements() — statement history

Returns past statement executions for auditing/debugging — filter by ClusterIdentifier, Database, Status, or a StatementName you assigned. Useful for finding "what ran in the last hour and did it fail?" without a custom audit table.

Python · boto3

# ── Find recent failed statements for troubleshooting ──
response = redshift_data.list_statements(
    ClusterIdentifier='my-redshift-cluster',
    Status='FAILED',
    MaxResults=20
)

for stmt in response['Statements']:
    print(stmt['Id'], stmt['QueryString'][:60], stmt['Status'], stmt.get('Error'))

🔁

Full Redshift Data API Pattern (End-to-End) Pattern ▼

execute → poll → fetch, wrapped as a reusable function

Production code never repeats the submit/poll/fetch boilerplate inline — wrap it in a helper with exponential backoff (per 29.30.2) so any pipeline step can run SQL and get a DataFrame back in one call.

Python · boto3

import boto3, time
import pandas as pd

redshift_data = boto3.client('redshift-data', region_name='us-east-1')

def run_redshift_sql(sql, cluster_id, database, db_user, max_wait_seconds=300):
    """Execute SQL on Redshift Data API and return a pandas DataFrame."""
    # 1. Submit
    resp = redshift_data.execute_statement(
        ClusterIdentifier=cluster_id, Database=database, DbUser=db_user, Sql=sql
    )
    stmt_id = resp['Id']

    # 2. Poll with exponential backoff
    waited, delay = 0, 1
    while True:
        desc = redshift_data.describe_statement(Id=stmt_id)
        status = desc['Status']
        if status == 'FINISHED':
            break
        if status in ('FAILED', 'ABORTED'):
            raise RuntimeError(f"Redshift query {status}: {desc.get('Error')}")
        if waited >= max_wait_seconds:
            raise TimeoutError(f"Statement {stmt_id} did not finish in {max_wait_seconds}s")
        time.sleep(delay)
        waited += delay
        delay = min(delay * 2, 10)  # cap backoff at 10s

    # 3. Fetch (handles non-SELECT statements with no result set)
    if not desc.get('HasResultSet', False):
        return pd.DataFrame()  # e.g. COPY / TRUNCATE return nothing

    result = redshift_data.get_statement_result(Id=stmt_id)
    columns = [c['name'] for c in result['ColumnMetadata']]

    def _val(f):
        if f.get('isNull'): return None
        for k in ('stringValue','longValue','doubleValue','booleanValue'):
            if k in f: return f[k]
        return None

    rows = [{columns[i]: _val(f) for i, f in enumerate(rec)} for rec in result['Records']]
    while 'NextToken' in result and result['NextToken']:
        result = redshift_data.get_statement_result(Id=stmt_id, NextToken=result['NextToken'])
        rows += [{columns[i]: _val(f) for i, f in enumerate(rec)} for rec in result['Records']]

    return pd.DataFrame(rows)

# ── Usage ──
df = run_redshift_sql(
    sql="SELECT pipeline_name, run_count FROM v_daily_pipeline_summary",
    cluster_id='my-redshift-cluster', database='analytics', db_user='etl_service_user'
)
print(df)

💡 Why This Matters

This is the same poll pattern you already used for Glue (get_job_run), Athena (get_query_execution), and EMR steps in 29.30.6–29.30.8 — boto3's async AWS APIs almost always follow submit → poll → fetch. Once you've internalized this pattern once, you can apply it everywhere.

📊

RDS / Redshift Data API Quick Reference Reference ▼

Complete API Cheat Sheet

API Call	Service	What it does	Key Parameters
`execute_statement()`	rds-data	Run one SQL statement (sync result)	resourceArn, secretArn, database, sql, parameters[]
`batch_execute_statement()`	rds-data	Run same SQL with multiple parameter sets	resourceArn, secretArn, database, sql, parameterSets[]
`begin_transaction()`	rds-data	Start a multi-statement transaction	resourceArn, secretArn, database
`commit_transaction()` / `rollback_transaction()`	rds-data	End the transaction	resourceArn, secretArn, transactionId
`execute_statement()`	redshift-data	Submit SQL asynchronously, returns Id	ClusterIdentifier/WorkgroupName, Database, DbUser/SecretArn, Sql
`describe_statement()`	redshift-data	Poll status of a submitted statement	Id → Status (PICKED/STARTED/FINISHED/FAILED/ABORTED)
`get_statement_result()`	redshift-data	Fetch result rows (paginated)	Id, NextToken → Records, ColumnMetadata
`batch_execute_statement()`	redshift-data	Run multiple different SQL statements as one unit	ClusterIdentifier, Database, DbUser, Sqls[]
`list_statements()`	redshift-data	List statement execution history	ClusterIdentifier, Status, StatementName
`cancel_statement()`	redshift-data	Cancel a running statement	Id

☁️ Common Errors

ValidationException — malformed SQL or wrong parameter type (e.g., sending stringValue for an integer column). BadRequestException (Redshift) — invalid ClusterIdentifier/DbUser combination, or the cluster is paused. StatementTimeoutException — query exceeded the Data API's max runtime (currently capped — long-running transforms should go through Spark/Glue, not the Data API). AccessDeniedException — the IAM role lacks redshift-data:* or rds-data:* permissions, or GetSecretValue on the linked secret.

💡 When NOT to Use the Data API

The Data API has payload size limits (~100 KB result per page, ~1 MB total for RDS) and a max execution time. For large transforms or bulk loads, use COPY/UNLOAD (Module 29.10) or run the heavy work in Spark/Glue — use the Data API for control-plane SQL: audit writes, watermark updates, small lookups, and triggering COPY/MERGE statements.

MODULE 29 — REAL WORLD PATTERNS

Pipeline Patterns P1 – P8

These are the 8 production-grade end-to-end pipeline architectures that every senior Data Engineer must know. Each pattern combines multiple AWS services with boto3, error handling, audit logging, and observability. Study these as complete blueprints.

🪣

Pattern 1 — File Arrival Batch Pipeline

S3 → SQS → Lambda → Glue → DynamoDB → SNS ▼

Architecture Overview

A file lands in S3. That event triggers a chain reaction: SQS buffers the notification, Lambda validates and kicks off a Glue ETL job, Glue transforms and writes Parquet, a Crawler updates the Glue Catalog, Lambda writes an audit record to DynamoDB, and SNS sends a success notification. This is the most common batch ingestion pattern in AWS data platforms.

S3 file arrives │ ▼ S3 Event Notification ──→ SQS Queue (buffer + decoupling) │ ▼ Lambda (polls SQS) ├── head_object() → validate file exists + get size ├── start_job_run() → trigger Glue ETL Job │ └── Glue reads S3 raw → transforms → writes Parquet to S3 silver/ ├── start_crawler() → update Glue Catalog partitions ├── put_item() → write audit record to DynamoDB └── publish() → SNS success / failure notification

Step-by-Step Boto3 Code

python

# ── Pattern 1: File Arrival Batch Pipeline ──────────────────────────────
# Lambda handler — triggered by SQS which receives S3 event notifications

import boto3, json, time
from datetime import datetime, timezone
from botocore.exceptions import ClientError

s3      = boto3.client('s3')
glue    = boto3.client('glue')
dynamo  = boto3.resource('dynamodb')
sns     = boto3.client('sns')

GLUE_JOB_NAME  = 'raw-to-silver-transform'
GLUE_CRAWLER   = 'silver-catalog-crawler'
AUDIT_TABLE    = 'pipeline-audit'
SNS_TOPIC_ARN  = 'arn:aws:sns:us-east-1:123456789:pipeline-alerts'

def lambda_handler(event, context):
    # ── Step 1: Parse S3 key from SQS message ──────────────────────────
    for record in event['Records']:
        body       = json.loads(record['body'])
        s3_event   = body['Records'][0]
        bucket     = s3_event['s3']['bucket']['name']
        key        = s3_event['s3']['object']['key']
        run_id     = context.aws_request_id

        try:
            # ── Step 2: Validate file exists and is non-zero ────────────────
            head = s3.head_object(Bucket=bucket, Key=key)
            file_size = head['ContentLength']
            if file_size == 0:
                raise ValueError(f"Empty file: s3://{bucket}/{key}")
            print(f"✅ File validated: {key} ({file_size} bytes)")

            # ── Step 3: Start Glue ETL Job ──────────────────────────────────
            glue_response = glue.start_job_run(
                JobName=GLUE_JOB_NAME,
                Arguments={
                    '--source_bucket': bucket,
                    '--source_key':    key,
                    '--run_id':        run_id
                }
            )
            job_run_id = glue_response['JobRunId']
            print(f"🚀 Glue job started: {job_run_id}")

            # ── Step 4: Poll Glue job until terminal state ──────────────────
            while True:
                run_detail = glue.get_job_run(JobName=GLUE_JOB_NAME, RunId=job_run_id)
                state = run_detail['JobRun']['JobRunState']
                if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
                    break
                time.sleep(15)

            # ── Step 5: Start Glue Crawler to update catalog ─────────────── 
            if state == 'SUCCEEDED':
                glue.start_crawler(Name=GLUE_CRAWLER)
                # poll crawler until READY
                while True:
                    crawler_state = glue.get_crawler(Name=GLUE_CRAWLER)['Crawler']['State']
                    if crawler_state == 'READY':
                        break
                    time.sleep(10)

            # ── Step 6: Write audit record to DynamoDB ───────────────────── 
            table = dynamo.Table(AUDIT_TABLE)
            table.put_item(Item={
                'run_id':         run_id,
                'job_name':       GLUE_JOB_NAME,
                'source_key':     key,
                'status':         state,
                'glue_run_id':    job_run_id,
                'file_size_bytes': file_size,
                'timestamp':      datetime.now(timezone.utc).isoformat()
            })

            # ── Step 7: Publish SNS notification ─────────────────────────── 
            msg = f"Pipeline {'SUCCESS' if state == 'SUCCEEDED' else 'FAILURE'}\nFile: s3://{bucket}/{key}\nGlue run: {job_run_id}\nStatus: {state}"
            sns.publish(TopicArn=SNS_TOPIC_ARN, Subject=f"Pipeline {state}", Message=msg)

        except ClientError as e:
            print(f"❌ ClientError: {e.response['Error']['Code']} — {e.response['Error']['Message']}")
            sns.publish(TopicArn=SNS_TOPIC_ARN, Subject="Pipeline FAILED", Message=str(e))
            raise  # re-raise so SQS retries / routes to DLQ

💡 Key Design Decisions

SQS decouples S3 from Lambda — if Lambda is throttled, messages queue safely. Re-raising the exception lets SQS route to the DLQ after max retries. The Glue job run ID ties the audit record to the exact Glue execution for traceability.

⏰

Pattern 2 — Daily Scheduled Batch on EMR

EventBridge → Lambda → EMR → Redshift ▼

Architecture Overview

A cron-based EventBridge rule fires Lambda every morning. Lambda reads pipeline config from DynamoDB, spins up an EMR cluster, submits a Spark step that reads S3 and writes to Redshift, polls until complete, then terminates the cluster and publishes metrics + alerts. Cost-efficient because the cluster lives only for the job duration.

EventBridge cron (e.g. "0 3 * * ? *" — 3 AM UTC daily) │ ▼ Lambda ├── get_item() → read pipeline config from DynamoDB ├── run_job_flow() → spin up EMR cluster (Spark + Hadoop) ├── add_job_flow_steps() → submit Spark step (S3 → Redshift) ├── poll describe_step() with waiter until COMPLETED ├── terminate_job_flows() → shut down cluster ├── put_metric_data() → publish rows_processed to CloudWatch └── publish() → SNS success / failure alert

Full Boto3 Code

python

# ── Pattern 2: Daily Scheduled Batch on EMR ─────────────────────────────
import boto3, time, json
from datetime import datetime, timezone
from botocore.exceptions import ClientError

emr    = boto3.client('emr')
dynamo = boto3.resource('dynamodb')
cw     = boto3.client('cloudwatch')
sns    = boto3.client('sns')

SNS_ARN   = 'arn:aws:sns:us-east-1:123456789:emr-alerts'
AUDIT_TBL = 'pipeline-audit'

def lambda_handler(event, context):
    run_id    = context.aws_request_id
    start_ts  = datetime.now(timezone.utc)
    cluster_id = None

    try:
        # ── Step 1: Read pipeline config from DynamoDB ──────────────────
        table  = dynamo.Table('pipeline-config')
        config = table.get_item(Key={'pipeline_id': 'daily-s3-to-redshift'})['Item']
        s3_input  = config['s3_input_path']
        rs_table  = config['redshift_table']
        emr_release = config.get('emr_release', 'emr-7.1.0')

        # ── Step 2: Spin up EMR cluster ─────────────────────────────────
        cluster_response = emr.run_job_flow(
            Name=f"daily-pipeline-{run_id[:8]}",
            ReleaseLabel=emr_release,
            Applications=[{'Name': 'Spark'}, {'Name': 'Hadoop'}],
            Instances={
                'MasterInstanceType':  'm5.xlarge',
                'SlaveInstanceType':   'm5.2xlarge',
                'InstanceCount':        3,
                'Ec2KeyName':          'my-keypair',
                'KeepJobFlowAliveWhenNoSteps': True  # keeps cluster up to add steps
            },
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            AutoTerminationPolicy={'IdleTimeout': 3600},  # auto-kill if idle 1 hour
            Configurations=[
                {'Classification': 'spark-defaults',
                 'Properties': {'spark.sql.shuffle.partitions': '200',
                                'spark.executor.memory': '8g'}}
            ],
            LogUri='s3://my-logs/emr/',
            VisibleToAllUsers=True
        )
        cluster_id = cluster_response['JobFlowId']
        print(f"🚀 Cluster started: {cluster_id}")

        # ── Step 3: Wait for cluster to be in WAITING state ─────────────
        waiter = emr.get_waiter('cluster_running')
        waiter.wait(ClusterId=cluster_id)

        # ── Step 4: Submit Spark step ────────────────────────────────────
        step_response = emr.add_job_flow_steps(
            JobFlowId=cluster_id,
            Steps=[{
                'Name': 's3-to-redshift-transform',
                'ActionOnFailure': 'CONTINUE',
                'HadoopJarStep': {
                    'Jar': 'command-runner.jar',
                    'Args': [
                        'spark-submit', '--deploy-mode', 'cluster',
                        '--py-files', 's3://my-bucket/deps/utils.zip',
                        's3://my-bucket/jobs/transform.py',
                        '--input', s3_input,
                        '--output-table', rs_table,
                        '--run-id', run_id
                    ]
                }
            }]
        )
        step_id = step_response['StepIds'][0]

        # ── Step 5: Poll step until complete ─────────────────────────────
        step_waiter = emr.get_waiter('step_complete')
        step_waiter.wait(ClusterId=cluster_id, StepId=step_id,
                         WaiterConfig={'Delay': 30, 'MaxAttempts': 120})

        step_detail = emr.describe_step(ClusterId=cluster_id, StepId=step_id)
        final_state = step_detail['Step']['Status']['State']
        print(f"Step final state: {final_state}")

        # ── Step 6: Terminate cluster ─────────────────────────────────── 
        emr.terminate_job_flows(JobFlowIds=[cluster_id])

        # ── Step 7: Publish CloudWatch metric ────────────────────────────
        duration_s = (datetime.now(timezone.utc) - start_ts).total_seconds()
        cw.put_metric_data(
            Namespace='DataPipelines',
            MetricData=[
                {'MetricName': 'PipelineDurationSeconds', 'Value': duration_s,
                 'Unit': 'Seconds', 'Dimensions': [{'Name': 'Pipeline', 'Value': 'daily-s3-redshift'}]},
                {'MetricName': 'PipelineSuccess',
                 'Value': 1 if final_state == 'COMPLETED' else 0,
                 'Unit': 'Count',
                 'Dimensions': [{'Name': 'Pipeline', 'Value': 'daily-s3-redshift'}]}
            ]
        )

        # ── Step 8: SNS alert ────────────────────────────────────────────
        sns.publish(
            TopicArn=SNS_ARN,
            Subject=f"EMR Pipeline {final_state}",
            Message=f"Cluster: {cluster_id}\nStep: {step_id}\nDuration: {duration_s:.0f}s\nState: {final_state}"
        )

    except ClientError as e:
        if cluster_id:
            emr.terminate_job_flows(JobFlowIds=[cluster_id])  # always clean up
        sns.publish(TopicArn=SNS_ARN, Subject="EMR Pipeline FAILED", Message=str(e))
        raise

⚠️ Always Terminate

Always terminate the cluster in the except block too — if you don't, a failed run leaves a running cluster burning money indefinitely.

🗂️

Pattern 3 — Metadata-Driven Multi-Pipeline

DynamoDB control table → dynamic Glue runs ▼

What is Metadata-Driven ETL?

Instead of hardcoding source/target paths in each job, you store pipeline configurations in a DynamoDB control table. One orchestrator Lambda reads all active pipelines, loops through them, starts the correct Glue job with dynamic arguments, polls completion, and writes per-pipeline audit records. Adding a new pipeline = adding a row to DynamoDB — no code change needed.

DynamoDB: pipeline-config table ┌─────────────────────────────────────────────────────────────┐ │ pipeline_id │ source_path │ target_table │ is_active │ │ sales-etl │ s3://raw/sales/ │ gold.sales │ true │ │ orders-etl │ s3://raw/orders/ │ gold.orders │ true │ │ returns-etl │ s3://raw/returns/ │ gold.returns │ false │ └─────────────────────────────────────────────────────────────┘ │ ▼ Orchestrator Lambda (triggered by EventBridge) ├── scan() → get all is_active=true pipelines ├── for each pipeline: │ ├── start_job_run() with dynamic --source and --target args │ ├── poll get_job_run() until SUCCEEDED/FAILED │ ├── put_metric_data() → rows_processed per pipeline │ └── put_item() → audit record in DynamoDB └── on any failure → publish() to SNS DLQ topic

Boto3 Orchestrator Code

python

# ── Pattern 3: Metadata-Driven Multi-Pipeline Orchestrator ──────────────
import boto3, time
from boto3.dynamodb.conditions import Attr
from datetime import datetime, timezone
from botocore.exceptions import ClientError
import traceback

glue   = boto3.client('glue')
dynamo = boto3.resource('dynamodb')
cw     = boto3.client('cloudwatch')
sns    = boto3.client('sns')

GLUE_JOB     = 'generic-etl-job'  # one reusable Glue job, parameterized
CONFIG_TABLE = 'pipeline-config'
AUDIT_TABLE  = 'pipeline-audit'
SNS_ARN      = 'arn:aws:sns:us-east-1:123456789:pipeline-dlq'

def get_active_pipelines():
    """Scan DynamoDB config table for all active pipelines."""
    table   = dynamo.Table(CONFIG_TABLE)
    results = []
    response = table.scan(FilterExpression=Attr('is_active').eq(True))
    results.extend(response['Items'])
    while 'LastEvaluatedKey' in response:
        response = table.scan(
            FilterExpression=Attr('is_active').eq(True),
            ExclusiveStartKey=response['LastEvaluatedKey']
        )
        results.extend(response['Items'])
    return results

def run_pipeline(pipeline_cfg, run_id):
    """Start and poll one Glue job run. Return (state, rows_processed)."""
    pid     = pipeline_cfg['pipeline_id']
    start_t = datetime.now(timezone.utc)

    response = glue.start_job_run(
        JobName=GLUE_JOB,
        Arguments={
            '--pipeline_id':   pid,
            '--source_path':   pipeline_cfg['source_path'],
            '--target_table':  pipeline_cfg['target_table'],
            '--run_id':        run_id
        }
    )
    job_run_id = response['JobRunId']
    print(f"  [{pid}] Glue run started: {job_run_id}")

    # Poll until terminal state
    while True:
        detail = glue.get_job_run(JobName=GLUE_JOB, RunId=job_run_id)
        state  = detail['JobRun']['JobRunState']
        if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
            break
        time.sleep(20)

    duration = (datetime.now(timezone.utc) - start_t).total_seconds()
    rows     = int(detail['JobRun'].get('ExecutionTime', 0))  # or read from custom metric
    err_msg  = detail['JobRun'].get('ErrorMessage', '')

    # Write per-pipeline audit record
    audit = dynamo.Table(AUDIT_TABLE)
    audit.put_item(Item={
        'run_id':          run_id + '#' + pid,
        'pipeline_id':     pid,
        'glue_run_id':     job_run_id,
        'status':          state,
        'duration_s':      str(duration),
        'error_message':   err_msg,
        'timestamp':       datetime.now(timezone.utc).isoformat()
    })

    # Publish pipeline-level CloudWatch metric
    cw.put_metric_data(
        Namespace='DataPipelines',
        MetricData=[{
            'MetricName': 'PipelineSuccess',
            'Value':      1 if state == 'SUCCEEDED' else 0,
            'Unit':       'Count',
            'Dimensions': [{'Name': 'PipelineId', 'Value': pid}]
        }]
    )

    return state, err_msg

def lambda_handler(event, context):
    run_id    = context.aws_request_id
    pipelines = get_active_pipelines()
    print(f"Found {len(pipelines)} active pipelines")
    failures  = []

    for cfg in pipelines:
        try:
            state, err = run_pipeline(cfg, run_id)
            if state != 'SUCCEEDED':
                failures.append({'pipeline_id': cfg['pipeline_id'], 'error': err})
        except Exception as e:
            failures.append({'pipeline_id': cfg['pipeline_id'], 'error': str(e)})

    if failures:
        sns.publish(TopicArn=SNS_ARN, Subject="Multi-Pipeline Failures",
                    Message=f"Failed pipelines:\n{failures}")

    print(f"Run complete. Failures: {len(failures)}/{len(pipelines)}")

🧠 Analogy

Think of the DynamoDB config table as a playlist. The orchestrator is a music player that plays each song (pipeline) in the playlist. To add a new song, you just add a row — you don't rewrite the player.

🔄

Pattern 4 — CDC Streaming Pipeline

DMS/Debezium → MSK → Spark → Delta MERGE ▼

Architecture Overview

CDC (Change Data Capture) captures every INSERT, UPDATE, and DELETE from a source database and streams them as events. Debezium or AWS DMS reads the database transaction log and publishes events to an MSK (Kafka) topic. Spark Structured Streaming consumes those events and uses MERGE INTO on a Delta table to apply inserts, updates, and deletes in near-real-time.

Source DB (PostgreSQL / MySQL / Oracle) │ Transaction log (WAL / binlog / redo log) ▼ DMS or Debezium Connector │ CDC events: {op: "c/u/d/r", before: {...}, after: {...}} ▼ MSK Topic: "db.public.orders" ← Schema Registry (Avro) │ ▼ Spark Structured Streaming (EMR / Databricks) ├── readStream from Kafka topic ├── parse Avro / JSON payload ├── parse op field → INSERT / UPDATE / DELETE ├── foreachBatch: │ ├── filter inserts + updates → MERGE INTO delta.orders │ └── filter deletes → DELETE from delta.orders └── checkpoint to S3 (exactly-once guarantee)

PySpark CDC Streaming Code

python

# ── Pattern 4: CDC Streaming Pipeline with Delta MERGE ──────────────────
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, get_json_object
from pyspark.sql.types import StructType, StructField, StringType, LongType
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("cdc-streaming") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

KAFKA_BROKERS   = "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092"
KAFKA_TOPIC     = "db.public.orders"
DELTA_TABLE_PATH = "s3://my-lake/silver/orders/"
CHECKPOINT_PATH  = "s3://my-lake/checkpoints/orders-cdc/"

# Schema for the CDC after-image payload
order_schema = StructType([
    StructField("order_id",   LongType(),   False),
    StructField("customer_id",LongType(),   True),
    StructField("amount",      StringType(), True),
    StructField("status",      StringType(), True),
    StructField("updated_at",  StringType(), True)
])

# Read from MSK (Kafka)
raw_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BROKERS) \
    .option("subscribe", KAFKA_TOPIC) \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 10000) \
    .load()

# Parse CDC envelope (Debezium JSON format)
parsed = raw_stream.select(
    get_json_object(col("value").cast("string"), "$.op").alias("op"),
    from_json(
        get_json_object(col("value").cast("string"), "$.after"),
        order_schema
    ).alias("after"),
    get_json_object(col("value").cast("string"), "$.before.order_id").alias("delete_id")
)

def apply_cdc_batch(batch_df, batch_id):
    """Apply one micro-batch of CDC events to Delta table."""
    batch_df.cache()

    delta_tbl = DeltaTable.forPath(spark, DELTA_TABLE_PATH)

    # ── Apply INSERTS and UPDATES (op = 'c', 'u', 'r') ─────────────
    upsert_df = batch_df \
        .filter(col("op").isin('c', 'u', 'r')) \
        .select("after.*")

    if upsert_df.count() > 0:
        delta_tbl.alias("tgt").merge(
            upsert_df.alias("src"),
            "tgt.order_id = src.order_id"
        ).whenMatchedUpdateAll() \
         .whenNotMatchedInsertAll() \
         .execute()

    # ── Apply DELETES (op = 'd') ──────────────────────────────────── 
    delete_ids = batch_df \
        .filter(col("op") == 'd') \
        .select(col("delete_id").cast(LongType()).alias("order_id"))

    if delete_ids.count() > 0:
        delta_tbl.alias("tgt").merge(
            delete_ids.alias("src"),
            "tgt.order_id = src.order_id"
        ).whenMatchedDelete() \
         .execute()

    batch_df.unpersist()
    print(f"✅ Batch {batch_id} applied to Delta")

# Start the streaming query
query = parsed.writeStream \
    .foreachBatch(apply_cdc_batch) \
    .option("checkpointLocation", CHECKPOINT_PATH) \
    .trigger(processingTime="2 minutes") \
    .start()

query.awaitTermination()

💡 CDC op field values

c = create (INSERT), u = update (UPDATE), d = delete (DELETE), r = read (snapshot/initial load). Always handle all four.

🔍

Pattern 5 — Athena Query Automation

start → poll → parse → DataFrame ▼

Why Automate Athena?

Athena has no built-in boto3 waiter. You must manually poll get_query_execution() until the state is SUCCEEDED or FAILED, then use a paginator to fetch results. This pattern is used in Lambda, Glue Python Shell jobs, and Airflow DAGs to run SQL on S3 data and convert results to a DataFrame for further processing.

Complete Athena Automation Pattern

python

# ── Pattern 5: Athena Query Automation ──────────────────────────────────
import boto3, time
from botocore.exceptions import ClientError

athena = boto3.client('athena')

OUTPUT_LOCATION = 's3://my-athena-results/query-results/'
DATABASE        = 'silver_db'
WORKGROUP       = 'primary'

def run_athena_query(sql: str, database: str = DATABASE) -> list[dict]:
    """
    Run an Athena query and return results as a list of dicts.
    Raises RuntimeError on query failure.
    """
    # ── Step 1: Start query ──────────────────────────────────────────
    response = athena.start_query_execution(
        QueryString=sql,
        QueryExecutionContext={'Database': database},
        ResultConfiguration={'OutputLocation': OUTPUT_LOCATION},
        WorkGroup=WORKGROUP
    )
    qid = response['QueryExecutionId']
    print(f"⏳ Athena query started: {qid}")

    # ── Step 2: Poll until terminal state (no built-in waiter!) ─────
    delay = 2
    while True:
        detail = athena.get_query_execution(QueryExecutionId=qid)
        state  = detail['QueryExecution']['Status']['State']

        if state == 'SUCCEEDED':
            print(f"✅ Query succeeded: {qid}")
            break
        elif state in ('FAILED', 'CANCELLED'):
            reason = detail['QueryExecution']['Status'].get('StateChangeReason', 'Unknown')
            raise RuntimeError(f"Athena query {state}: {reason}")

        # Exponential backoff capped at 30s
        time.sleep(delay)
        delay = min(delay * 1.5, 30)

    # ── Step 3: Paginate results ──────────────────────────────────── 
    paginator = athena.get_paginator('get_query_results')
    pages     = paginator.paginate(QueryExecutionId=qid)

    rows    = []
    headers = None

    for page in pages:
        result_set = page['ResultSet']

        if headers is None:
            # First row of first page = column names
            headers = [c['Label'] for c in result_set['ResultSetMetadata']['ColumnInfo']]
            data_rows = result_set['Rows'][1:]  # skip header row
        else:
            data_rows = result_set['Rows']

        for row in data_rows:
            values = [cell.get('VarCharValue', None) for cell in row['Data']]
            rows.append(dict(zip(headers, values)))

    print(f"📊 Fetched {len(rows)} rows")
    return rows

# ── Usage ────────────────────────────────────────────────────────────────
sql = """
    SELECT customer_id, SUM(amount) AS total_spent
    FROM silver_db.orders
    WHERE order_date >= DATE('2024-01-01')
    GROUP BY customer_id
    ORDER BY total_spent DESC
    LIMIT 1000
"""

results = run_athena_query(sql)

# Convert to pandas (in Lambda / Glue Python Shell)
import pandas as pd
df = pd.DataFrame(results)
print(df.head())

# Or write to S3 as Parquet
df.to_parquet('/tmp/top_customers.parquet')
s3 = boto3.client('s3')
s3.upload_file('/tmp/top_customers.parquet', 'my-bucket', 'gold/top_customers.parquet')

⚠️ Header Row Gotcha

Athena's get_query_results includes the column headers as the very first row in the first page. Always skip Rows[0] of the first page, or you'll have the column names mixed in with your data.

🔀

Pattern 6 — Cross-Account Data Access

STS AssumeRole → multi-account S3/Glue ▼

Why Cross-Account?

Large enterprises split data into multiple AWS accounts — a raw data account, a processing account, a consumers account. Your pipeline running in Account A needs to read from S3 in Account B and write results back. The solution is STS AssumeRole: your code assumes a role in the target account, gets temporary credentials, and uses them to build boto3 clients for that account.

Account A (your pipeline) Account B (data source) ┌────────────────────────┐ ┌─────────────────────────┐ │ Lambda / Glue / EMR │ │ S3: raw data bucket │ │ │ │ Glue Catalog: source db │ │ sts.assume_role() ────┼────────▶│ IAM Role: cross-acct-role│ │ │ │ (trust policy allows A) │ │ temp_creds = response │◀────────┼─ returns temp credentials│ │ │ └─────────────────────────┘ │ s3_client = Session()─┼──reads──▶ s3://account-b-bucket/ │ glue_client = ... │ └────────────────────────┘

Complete Cross-Account Pattern

python

# ── Pattern 6: Cross-Account Data Access via STS AssumeRole ─────────────
import boto3
from botocore.exceptions import ClientError

def get_cross_account_session(target_role_arn: str, session_name: str = 'CrossAccountSession'):
    """
    Assume a role in a different AWS account and return a boto3 Session
    with temporary credentials valid for up to 1 hour.
    """
    sts = boto3.client('sts')

    # Verify who we are (useful for debugging)
    identity = sts.get_caller_identity()
    print(f"Caller identity: {identity['Arn']}")

    try:
        assumed = sts.assume_role(
            RoleArn=target_role_arn,
            RoleSessionName=session_name,
            DurationSeconds=3600  # 1 hour max
        )
    except ClientError as e:
        if e.response['Error']['Code'] == 'AccessDenied':
            raise PermissionError(
                f"Cannot assume role {target_role_arn}. Check trust policy."
            ) from e
        raise

    creds = assumed['Credentials']
    session = boto3.Session(
        aws_access_key_id=creds['AccessKeyId'],
        aws_secret_access_key=creds['SecretAccessKey'],
        aws_session_token=creds['SessionToken'],
        region_name='us-east-1'
    )
    print(f"✅ Assumed role in target account. Expires: {creds['Expiration']}")
    return session


# ── Usage: read from Account B, write results to Account A ──────────────
TARGET_ROLE = 'arn:aws:iam::999999999999:role/cross-account-data-reader'
TARGET_BUCKET = 'account-b-raw-data'
SOURCE_PREFIX = 'orders/year=2024/month=01/'

# Get session with target account credentials
target_session = get_cross_account_session(TARGET_ROLE)

# Build clients in target account
s3_target    = target_session.client('s3')
glue_target  = target_session.client('glue')

# List files in Account B's S3
paginator = s3_target.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=TARGET_BUCKET, Prefix=SOURCE_PREFIX)

file_list = []
for page in pages:
    for obj in page.get('Contents', []):
        file_list.append(f"s3://{TARGET_BUCKET}/{obj['Key']}")
print(f"Found {len(file_list)} files in Account B")

# Read Glue table definition from Account B's catalog
glue_table = glue_target.get_table(DatabaseName='source_db', Name='orders')
print(f"Schema: {glue_table['Table']['StorageDescriptor']['Columns']}")

# Write results back to Account A (default boto3 session = Account A)
s3_source = boto3.client('s3')  # uses default Account A creds
s3_source.put_object(
    Bucket='account-a-processed',
    Key='cross-account-results/manifest.json',
    Body=str(file_list).encode()
)
print("✅ Results written to Account A")

🔑 Trust Policy Required

The role in Account B must have a trust policy that allows Account A's principal (the Lambda/Glue/EMR role ARN) to call sts:AssumeRole. Without the trust policy, you get AccessDenied no matter what the permission policy says.

✅

Pattern 7 — Data Quality Gate Pipeline

Glue DQ → score check → DynamoDB → CloudWatch → SNS ▼

Architecture Overview

After a Glue ETL job completes, a Data Quality gate runs to validate the output. If the DQ score is below threshold, the pipeline stops — preventing bad data from reaching downstream consumers. DQ results are stored in DynamoDB for auditing, a metric is published to CloudWatch, and SNS sends an alert on failure. This is a fail-fast, fail-loud design.

Glue ETL Job completes → writes to S3 silver/ │ ▼ Glue Data Quality Ruleset Evaluation │ Rules: Completeness ≥ 0.99 │ Uniqueness (order_id) = 1.0 │ RowCount ≥ 1000 ▼ DQ score computed ├── score ≥ threshold → ✅ pipeline continues └── score < threshold → ❌ pipeline STOPS ├── put_item() → DQ result to DynamoDB ├── put_metric_data() → DQ score to CloudWatch └── publish() → SNS alert with failure details

DQ Gate Boto3 Code

python

# ── Pattern 7: Data Quality Gate ────────────────────────────────────────
import boto3, time, json
from datetime import datetime, timezone
from botocore.exceptions import ClientError
from decimal import Decimal

glue   = boto3.client('glue')
dynamo = boto3.resource('dynamodb')
cw     = boto3.client('cloudwatch')
sns    = boto3.client('sns')

RULESET_NAME   = 'orders-silver-ruleset'
GLUE_DATABASE  = 'silver_db'
GLUE_TABLE     = 'orders'
DQ_THRESHOLD   = 0.95   # 95% rules must pass
AUDIT_TABLE    = 'pipeline-dq-audit'
SNS_ARN        = 'arn:aws:sns:us-east-1:123456789:dq-alerts'

def run_dq_gate(run_id: str, pipeline_name: str) -> bool:
    """
    Run Glue DQ evaluation. Returns True if passed, False if failed.
    Writes results to DynamoDB and CloudWatch.
    """
    # ── Step 1: Start DQ evaluation ─────────────────────────────────
    eval_response = glue.start_data_quality_ruleset_evaluation_run(
        DataSource={
            'GlueTable': {'DatabaseName': GLUE_DATABASE, 'TableName': GLUE_TABLE}
        },
        Role='arn:aws:iam::123456789:role/GlueServiceRole',
        RulesetNames=[RULESET_NAME]
    )
    eval_run_id = eval_response['RunId']
    print(f"⏳ DQ evaluation started: {eval_run_id}")

    # ── Step 2: Poll until complete ──────────────────────────────────
    while True:
        status = glue.get_data_quality_ruleset_evaluation_run(RunId=eval_run_id)
        state  = status['Status']
        if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
            break
        time.sleep(15)

    # ── Step 3: Parse DQ results ─────────────────────────────────────
    result_ids = status.get('ResultIds', [])
    passed_rules  = 0
    total_rules   = 0
    failed_detail = []

    for result_id in result_ids:
        result = glue.get_data_quality_result(ResultId=result_id)
        rule_results = result.get('RuleResults', [])
        for rule in rule_results:
            total_rules += 1
            if rule['Result'] == 'PASS':
                passed_rules += 1
            else:
                failed_detail.append({
                    'rule':    rule.get('Name', 'unknown'),
                    'result':  rule['Result'],
                    'message': rule.get('EvaluationMessage', '')
                })

    dq_score = (passed_rules / total_rules) if total_rules > 0 else 0.0
    passed   = dq_score >= DQ_THRESHOLD
    print(f"DQ Score: {dq_score:.2%} ({passed_rules}/{total_rules} rules passed)")

    # ── Step 4: Write DQ audit record to DynamoDB ───────────────────
    table = dynamo.Table(AUDIT_TABLE)
    table.put_item(Item={
        'run_id':         run_id,
        'pipeline_name':  pipeline_name,
        'eval_run_id':    eval_run_id,
        'dq_score':       Decimal(str(round(dq_score, 4))),
        'passed_rules':   passed_rules,
        'total_rules':    total_rules,
        'passed':         passed,
        'failed_rules':   json.dumps(failed_detail),
        'timestamp':      datetime.now(timezone.utc).isoformat()
    })

    # ── Step 5: Publish DQ metric to CloudWatch ──────────────────────
    cw.put_metric_data(
        Namespace='DataPipelines',
        MetricData=[{
            'MetricName': 'DQScore',
            'Value':      dq_score * 100,
            'Unit':       'Percent',
            'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}]
        }]
    )

    # ── Step 6: Alert on failure ──────────────────────────────────────
    if not passed:
        msg = (
            f"❌ DQ GATE FAILED for {pipeline_name}\n"
            f"Score: {dq_score:.2%} (threshold: {DQ_THRESHOLD:.0%})\n"
            f"Failed rules:\n{json.dumps(failed_detail, indent=2)}"
        )
        sns.publish(TopicArn=SNS_ARN, Subject=f"DQ Failure: {pipeline_name}", Message=msg)

    return passed

# ── Usage in pipeline ─────────────────────────────────────────────────── 
import uuid
run_id = str(uuid.uuid4())

dq_passed = run_dq_gate(run_id, 'orders-silver-pipeline')

if not dq_passed:
    print("🛑 Pipeline halted due to DQ failure. Check DynamoDB audit table.")
    exit(1)  # Glue job will fail — prevents downstream table from being updated

print("✅ DQ gate passed. Proceeding to Gold layer.")

🔁

Pattern 8 — Error Recovery Pipeline

catch → audit → CloudWatch → SNS → retry → DLQ ▼

Architecture Overview

Any production pipeline will fail. The question is: what happens when it does? This pattern implements a complete error recovery architecture: classify the error, write it to DynamoDB, log it to CloudWatch, alert via SNS, retry with exponential backoff, and after max retries route to a Dead Letter Queue. Operations can inspect the DLQ and trigger manual or automated re-runs.

Pipeline Step Fails │ ▼ except ClientError ├── parse error code (ThrottlingException / AccessDenied / etc.) │ ├── 1. put_item() → DynamoDB audit table (error_code, message, stack_trace) ├── 2. put_log_events() → CloudWatch log stream ├── 3. put_metric_data() → CloudWatch metric (PipelineFailure count) ├── 4. publish() → SNS failure topic (email / Slack / PagerDuty) │ ├── Recoverable? (ThrottlingException, transient network) │ ├── YES → exponential backoff retry (2s, 4s, 8s, 16s, 30s max) │ │ Max 5 attempts total │ └── NO → skip retry, go straight to DLQ │ └── Max retries exceeded? └── send_message() → SQS DLQ with full error context └── CloudWatch Alarm on DLQ depth → SNS on-call alert

Complete Error Recovery Framework

python

# ── Pattern 8: Error Recovery Pipeline ──────────────────────────────────
import boto3, time, json, traceback, uuid
from datetime import datetime, timezone
from botocore.exceptions import ClientError

dynamo = boto3.resource('dynamodb')
logs   = boto3.client('logs')
cw     = boto3.client('cloudwatch')
sns    = boto3.client('sns')
sqs    = boto3.client('sqs')

AUDIT_TABLE   = 'pipeline-errors'
LOG_GROUP     = '/data-pipelines/errors'
LOG_STREAM    = 'pipeline-error-stream'
SNS_ARN       = 'arn:aws:sns:us-east-1:123456789:pipeline-oncall'
DLQ_URL       = 'https://sqs.us-east-1.amazonaws.com/123456789/pipeline-dlq'
MAX_RETRIES   = 5

# Errors that are recoverable (worth retrying)
RECOVERABLE_ERRORS = {
    'ThrottlingException', 'ServiceUnavailableException',
    'ProvisionedThroughputExceededException', 'RequestExpired',
    'InternalError', 'InternalServiceError'
}

def log_error_to_dynamo(run_id, pipeline_name, error_code, error_msg, attempt):
    """Write structured error record to DynamoDB."""
    table = dynamo.Table(AUDIT_TABLE)
    table.put_item(Item={
        'error_id':       str(uuid.uuid4()),
        'run_id':         run_id,
        'pipeline_name':  pipeline_name,
        'error_code':     error_code,
        'error_message':  error_msg,
        'attempt_number': attempt,
        'is_recoverable': error_code in RECOVERABLE_ERRORS,
        'timestamp':      datetime.now(timezone.utc).isoformat()
    })

def log_error_to_cloudwatch(pipeline_name, error_msg):
    """Push error message to CloudWatch Logs."""
    try:
        # Get or create log stream sequence token
        streams = logs.describe_log_streams(
            logGroupName=LOG_GROUP, logStreamNamePrefix=LOG_STREAM
        )['logStreams']
        seq_token = streams[0].get('uploadSequenceToken') if streams else None

        kwargs = {
            'logGroupName':  LOG_GROUP,
            'logStreamName': LOG_STREAM,
            'logEvents': [{
                'timestamp': int(datetime.now(timezone.utc).timestamp() * 1000),
                'message':   json.dumps({'pipeline': pipeline_name, 'error': error_msg})
            }]
        }
        if seq_token:
            kwargs['sequenceToken'] = seq_token

        logs.put_log_events(**kwargs)
    except Exception as e:
        print(f"Warning: CloudWatch log write failed: {e}")  # don't crash on logging failure

def send_to_dlq(run_id, pipeline_name, error_msg):
    """Send failed job to Dead Letter Queue for manual inspection / replay."""
    sqs.send_message(
        QueueUrl=DLQ_URL,
        MessageBody=json.dumps({
            'run_id':        run_id,
            'pipeline_name': pipeline_name,
            'error':         error_msg,
            'timestamp':     datetime.now(timezone.utc).isoformat(),
            'action':        'REQUIRES_MANUAL_REVIEW'
        }),
        MessageAttributes={
            'pipeline': {'DataType': 'String', 'StringValue': pipeline_name}
        }
    )
    print(f"📬 Sent to DLQ: {pipeline_name}")

def run_with_recovery(pipeline_fn, run_id: str, pipeline_name: str):
    """
    Execute pipeline_fn with full error recovery:
    retry recoverable errors with exponential backoff,
    route to DLQ after max retries.
    """
    delay = 2

    for attempt in range(1, MAX_RETRIES + 1):
        try:
            print(f"▶ Attempt {attempt}/{MAX_RETRIES}: {pipeline_name}")
            pipeline_fn()
            print(f"✅ Pipeline succeeded on attempt {attempt}")
            return True  # success

        except ClientError as e:
            error_code = e.response['Error']['Code']
            error_msg  = e.response['Error']['Message']
            tb         = traceback.format_exc()

            print(f"❌ Attempt {attempt} failed: [{error_code}] {error_msg}")

            # 1. Write to DynamoDB audit
            log_error_to_dynamo(run_id, pipeline_name, error_code, error_msg, attempt)

            # 2. Write to CloudWatch Logs
            log_error_to_cloudwatch(pipeline_name, f"[{error_code}] {error_msg}")

            # 3. Increment CloudWatch failure metric
            cw.put_metric_data(
                Namespace='DataPipelines',
                MetricData=[{'MetricName': 'PipelineFailure', 'Value': 1,
                             'Unit': 'Count',
                             'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}]}]
            )

            # 4. Check if recoverable and if retries remain
            if error_code not in RECOVERABLE_ERRORS:
                print(f"🛑 Non-recoverable error: {error_code}. Skipping retries.")
                sns.publish(TopicArn=SNS_ARN, Subject=f"Non-recoverable: {pipeline_name}",
                            Message=f"[{error_code}] {error_msg}\n\n{tb}")
                send_to_dlq(run_id, pipeline_name, error_msg)
                return False

            if attempt == MAX_RETRIES:
                print(f"🛑 Max retries ({MAX_RETRIES}) reached.")
                sns.publish(TopicArn=SNS_ARN, Subject=f"Max retries: {pipeline_name}",
                            Message=f"Gave up after {MAX_RETRIES} attempts.\n[{error_code}] {error_msg}")
                send_to_dlq(run_id, pipeline_name, error_msg)
                return False

            # 5. Wait with exponential backoff before retry
            print(f"⏳ Retrying in {delay}s...")
            time.sleep(delay)
            delay = min(delay * 2, 30)  # cap at 30s

# ── Usage ────────────────────────────────────────────────────────────────
def my_pipeline():
    # your actual boto3 / Spark code here
    glue = boto3.client('glue')
    glue.start_job_run(JobName='my-etl-job')

run_id = str(uuid.uuid4())
success = run_with_recovery(my_pipeline, run_id, 'my-etl-pipeline')
if not success:
    exit(1)  # signal failure to orchestrator (Airflow / EventBridge)

✅ Production Principle

The run_with_recovery wrapper is reusable across all pipelines. Pass any callable as pipeline_fn. The error classification, audit, alerting, and DLQ routing all happen automatically. This is the kind of framework that differentiates senior engineers.

Pattern Summary — All 8 Patterns at a Glance

#	Pattern	Trigger	Key Services	Use When
P1	File Arrival Batch	S3 Event → SQS	Lambda, Glue, DynamoDB, SNS	Files land unpredictably
P2	Scheduled EMR Batch	EventBridge cron	EMR, Redshift, CloudWatch, SNS	Daily/hourly large-scale Spark
P3	Metadata-Driven Multi	EventBridge	DynamoDB, Glue, CloudWatch	Many similar pipelines
P4	CDC Streaming	Continuous	MSK, Spark Streaming, Delta	Near-real-time DB sync
P5	Athena Automation	On-demand	Athena, S3, Pandas	SQL on S3 in Lambda/Glue
P6	Cross-Account Access	Any	STS, S3, Glue	Multi-account enterprise setup
P7	DQ Gate	Post-ETL	Glue DQ, DynamoDB, CloudWatch	Prevent bad data in Gold layer
P8	Error Recovery	On failure	DynamoDB, CW Logs, SNS, SQS DLQ	Every production pipeline

MODULE 29 — COMPLETE

Module 29 Summary

You have now covered the complete AWS + Boto3 toolkit for production Data Engineering. Here is a quick recap of every area covered.

MODULE 29 — WHAT YOU MASTERED STORAGE & SECURITY ✅ S3 — data lake foundation, multipart upload, SSE-KMS, S3 Select ✅ IAM — least privilege, role assumption, cross-account patterns ✅ KMS — envelope encryption, CMK rotation ✅ Secrets Manager — credential injection in pipelines ✅ Parameter Store — pipeline config hierarchy COMPUTE & ETL ✅ Glue — Data Catalog, Crawlers, ETL Jobs, Bookmarks, Data Quality ✅ EMR — Spark cluster lifecycle, Serverless, Spot, Bootstrap ✅ Athena — serverless SQL, partition pruning, Iceberg, Federated ✅ Lake Formation — fine-grained access, LF-Tags, column masking ✅ Redshift — COPY/UNLOAD, distribution styles, Spectrum, WLM MESSAGING & EVENTING ✅ MSK — Managed Kafka, Schema Registry, MSK Connect ✅ DynamoDB — pipeline audit and metadata tables ✅ RDS — JDBC partitioned reads, Glue crawler source ✅ EventBridge — cron scheduling, event-driven triggers ✅ SQS — decoupling, DLQ, consume-process-delete ✅ SNS — fan-out, failure alerts ✅ Lambda — S3 trigger, Glue/EMR launcher, lightweight ETL ✅ CloudWatch — custom metrics, alarms, Log Insights ✅ VPC — private subnets, VPC Endpoints, security groups BOTO3 API MASTERY ✅ Fundamentals — session/client/resource, auth, retry config ✅ Error Handling — ClientError parsing, exponential backoff, tenacity ✅ Paginators — never write NextToken loops again ✅ Waiters — built-in + custom WaiterModel + manual polling ✅ S3, Glue, Athena, EMR, Lambda APIs — complete reference ✅ Secrets Mgr, SQS, SNS, DynamoDB APIs — complete reference ✅ CloudWatch, STS, EventBridge, RDS/Redshift APIs — complete reference REAL-WORLD PATTERNS ✅ 8 end-to-end pipeline architectures with full boto3 code

AWS Cloud + Boto3Data Engineering

Amazon S3 — The Data Lake Foundation

AWS IAM — Identity and Access Management

AWS KMS — Key Management Service

AWS Secrets Manager

AWS Systems Manager — Parameter Store

AWS Glue — Serverless ETL & Data Catalog

AWS EMR — Spark in the Cloud

Amazon Athena — Serverless SQL on S3

AWS Lake Formation — Fine-Grained Data Governance

Amazon Redshift

Amazon MSKManaged Streaming for Apache Kafka

Amazon DynamoDB

Amazon RDS

Amazon EventBridge

Amazon SQS — Simple Queue Service

Amazon SNS — Simple Notification Service

AWS Lambda — Serverless Functions for Data Pipelines

Amazon CloudWatch — Observability for Data Pipelines

AWS VPC for Data Engineers

AWS Cost Optimizationfor Data Engineers

AWS Data Governance

Terraform for Data Engineers (IaC)

Data Quality Engineering

Streaming Pipelines

CDC (Change Data Capture) Pipelines

Delta Lake & Apache Iceberg

Data Modeling for Data Engineers

Spark Performance Engineering

Pipeline Observability & Reliability

Boto3 Fundamentals

Glue APIs

Athena APIs

EMR APIs

Lambda APIs

SQS APIs

SNS APIs

DynamoDB APIs

CloudWatch APIs

STS APIs

EventBridge APIs

RDS / Redshift Data APIs

Pipeline Patterns P1 – P8

Module 29 Summary

AWS Cloud + Boto3
Data Engineering

Amazon MSK
Managed Streaming for Apache Kafka

AWS Cost Optimization
for Data Engineers