AWS Cloud + Boto3
Data Engineering
This is the largest and most practical module in the entire course. It covers every AWS service a Data Engineer uses daily — from S3 and Glue to EMR, MSK, Lambda, and CloudWatch — plus a complete Boto3 API reference with production-grade code patterns, error handling, and 8 real-world pipeline architectures.
Amazon S3 — The Data Lake Foundation
S3 (Simple Storage Service) is where almost every modern data pipeline begins and ends. Every Spark job you write on EMR, Databricks, or Glue is ultimately reading from and writing to S3. Understanding S3 deeply — buckets, storage classes, partitioning layout, performance tricks, and security — is non-negotiable for a Data Engineer.
S3 stores data as objects inside buckets. A bucket is a globally unique top-level container (like s3://my-company-data-lake). An object is just a file — a Parquet file, a CSV, a JSON blob, an image — identified by a key (its full path inside the bucket). There is no real "folder" structure; S3 is a flat key-value store, but tools like the console and Spark simulate folders using the / character in the key name.
raw/sales/2024/01/15/data.parquet. The warehouse robot (S3) can instantly find any box by its label, but it doesn't actually organize boxes into physical shelves — the "folder" look is just how the label is written.
import boto3
s3 = boto3.client("s3")
# An object's "path" is really just its key string
bucket = "my-company-data-lake"
key = "raw/sales/2024/01/15/transactions.parquet"
# Upload a local file as an object
s3.upload_file("transactions.parquet", bucket, key)
# s3://my-company-data-lake/raw/sales/2024/01/15/transactions.parquet
print(f"s3://{bucket}/{key}")
Since S3 has no real folders, the console and APIs use a prefix (everything before the last /) plus a delimiter (usually /) to group keys and make them look like folders. This is critical when listing objects — using Prefix + Delimiter in list_objects_v2 lets you list only "files in this folder" instead of every object in the entire bucket.
response = s3.list_objects_v2(
Bucket="my-company-data-lake",
Prefix="raw/sales/2024/01/", # acts like a "folder"
Delimiter="/" # stops at the next "folder" level
)
# CommonPrefixes shows the "sub-folders" (day=01, day=02, ...)
for p in response.get("CommonPrefixes", []):
print(p["Prefix"])
Every object has system metadata (size, last modified, ETag, storage class) and can have custom metadata (key-value pairs you attach, like source-system: salesforce). Tags are separate from metadata — they're used for cost allocation, lifecycle rules, and access control, and can be changed without re-uploading the object (metadata changes require a copy).
# Custom metadata is set at upload time
s3.put_object(
Bucket=bucket, Key=key, Body=open("data.parquet","rb"),
Metadata={"source-system": "salesforce", "pipeline": "sales-etl"}
)
# Tags can be added/changed anytime — used for cost allocation & lifecycle
s3.put_object_tagging(
Bucket=bucket, Key=key,
Tagging={"TagSet": [
{"Key": "environment", "Value": "production"},
{"Key": "team", "Value": "data-engineering"}
]}
)
S3 offers multiple storage classes that trade retrieval speed for cost. As a DE, choosing the right class for raw, processed, and archival data is one of the easiest ways to cut storage cost by 50-90%.
| Storage Class | Use Case | Retrieval | Relative Cost |
|---|---|---|---|
| Standard | Frequently accessed data (Bronze/Silver layers) | Instant | Highest |
| Intelligent-Tiering | Unknown/changing access patterns | Instant | Auto-optimized |
| Standard-IA | Monthly reports, less-frequent reads | Instant | Lower |
| One Zone-IA | Re-creatable data, single-AZ ok | Instant | Lower still |
| Glacier Instant Retrieval | Archives accessed quarterly | Instant | Low |
| Glacier Flexible Retrieval | Compliance archives, rare access | Minutes-hours | Very low |
| Glacier Deep Archive | 7+ year regulatory retention | 12+ hours | Cheapest |
A lifecycle policy automatically moves objects between storage classes (or deletes them) based on age — without any pipeline code. A common DE pattern: keep raw data on Standard for 30 days, move to Standard-IA for 90 days, then Glacier Deep Archive after a year, then expire after 7 years for compliance.
s3.put_bucket_lifecycle_configuration(
Bucket="my-company-data-lake",
LifecycleConfiguration={
"Rules": [{
"ID": "raw-zone-tiering",
"Filter": {"Prefix": "raw/"},
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 365, "StorageClass": "GLACIER_DEEP_ARCHIVE"}
],
"Expiration": {"Days": 2555} # ~7 years
}]
}
)
Even though S3 has no real folders, a consistent key-naming convention is essential so Spark, Glue, and Athena can discover and partition data correctly. A typical convention separates the zone, domain/table name, and partition values.
s3://my-company-data-lake/
├── bronze/
│ └── sales/orders/
│ └── year=2024/month=01/day=15/part-0001.parquet
├── silver/
│ └── sales/orders_cleaned/
│ └── year=2024/month=01/day=15/part-0001.parquet
└── gold/
└── sales/daily_revenue/
└── year=2024/month=01/part-0001.parquet
Hive-style partitioning encodes both the column name and its value in the path: year=2024/month=01/day=15/. Glue Crawlers and Spark automatically recognize this pattern and turn year, month, day into queryable columns — without you needing to read the file to know its date.
df.write \
.partitionBy("year", "month", "day") \
.mode("append") \
.parquet("s3://my-company-data-lake/silver/sales/orders_cleaned/")
# Resulting path: .../orders_cleaned/year=2024/month=01/day=15/part-xxxx.parquet
WHERE year=2024 AND month=1, Athena/Spark can skip reading every other partition entirely — this is partition pruning, and it's one of the biggest cost and speed wins in a data lake.
Dynamic partitioning means Spark decides the partition values from the data itself (e.g., each row's year/month column determines where it lands) rather than you hardcoding a single partition. Partition discovery is the process by which Glue Catalog or Spark scans the S3 prefix tree and registers each year=.../month=.../day=... combination as a partition in the metastore so queries can use it.
# After writing new partitions, refresh metastore so Athena/Spark SQL sees them
spark.sql("MSCK REPAIR TABLE silver.orders_cleaned")
# Or, more efficiently with Glue, add only the new partition via boto3
glue = boto3.client("glue")
glue.batch_create_partition(
DatabaseName="silver", TableName="orders_cleaned",
PartitionInputList=[{
"Values": ["2024", "01", "15"],
"StorageDescriptor": {"Location": "s3://my-company-data-lake/silver/sales/orders_cleaned/year=2024/month=01/day=15/"}
}]
)
This is the medallion architecture applied to S3 zones. Each zone is a separate top-level prefix (often even a separate bucket for stricter access control) representing a stage of data quality.
A single-zone lake puts bronze/silver/gold as prefixes inside one bucket — simpler to manage, but harder to apply different access policies per zone. A multi-zone (multi-bucket) design gives each zone its own bucket — company-raw, company-silver, company-gold — enabling stricter IAM policies (e.g., only the ingestion role can write to raw; analysts can only read gold).
| Aspect | Single-Zone (one bucket) | Multi-Zone (per-zone buckets) |
|---|---|---|
| Access control granularity | Prefix-based IAM policies | Bucket-level IAM, simpler policies |
| Lifecycle policies | Per-prefix rules | Per-bucket rules, cleaner |
| Operational simplicity | Fewer buckets to manage | More resources, more Terraform |
| Common at | Small-medium teams | Large enterprises, multi-account |
For files larger than ~100 MB, AWS recommends multipart upload — splitting the file into parts (5 MB - 5 GB each) and uploading them in parallel, then telling S3 to assemble them. This is faster, more resilient (a failed part can be retried alone), and required for files over 5 GB.
# 1. Initiate
mpu = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = mpu["UploadId"]
parts = []
try:
# 2. Upload each part (can be done in parallel with ThreadPoolExecutor)
for i, chunk in enumerate(file_chunks, start=1):
resp = s3.upload_part(
Bucket=bucket, Key=key, UploadId=upload_id,
PartNumber=i, Body=chunk
)
parts.append({"PartNumber": i, "ETag": resp["ETag"]})
# 3. Complete — S3 assembles the object from parts
s3.complete_multipart_upload(
Bucket=bucket, Key=key, UploadId=upload_id,
MultipartUpload={"Parts": parts}
)
except Exception:
# 4. Always abort on failure to avoid storage charges for orphaned parts
s3.abort_multipart_upload(Bucket=bucket, Key=key, UploadId=upload_id)
raise
In practice, you rarely write raw multipart code — boto3's high-level upload_file() / TransferConfig handles multipart and threading automatically. But for custom pipelines (e.g., uploading hundreds of small files), wrapping uploads in a ThreadPoolExecutor dramatically speeds things up since each upload is mostly waiting on network I/O.
from concurrent.futures import ThreadPoolExecutor
import boto3, glob
s3 = boto3.client("s3")
bucket = "my-company-data-lake"
def upload_one(local_path):
key = f"bronze/sales/{local_path.split('/')[-1]}"
s3.upload_file(local_path, bucket, key)
return key
files = glob.glob("data/*.parquet")
# Upload up to 10 files concurrently
with ThreadPoolExecutor(max_workers=10) as ex:
results = list(ex.map(upload_one, files))
print(f"Uploaded {len(results)} files")
Modern S3 scales request rates automatically per prefix, but workloads with extremely high throughput (thousands of requests/sec) still benefit from spreading keys across multiple prefixes rather than writing everything under one hot prefix. Avoid sequential key names like 00001.parquet, 00002.parquet for very high-throughput write patterns — instead use a hash prefix or date-based prefixes that naturally spread the load.
Spark reads S3 files in splits. Very large single files (multi-GB CSVs) can become a bottleneck if they aren't splittable (e.g., gzip-compressed CSV is NOT splittable — one executor reads the whole file). Prefer splittable formats (Parquet, ORC, uncompressed or bzip2/snappy) and write data as many medium-sized files rather than one giant file.
S3 Select lets you run a simple SQL-like filter directly on a CSV/JSON/Parquet object stored in S3, and S3 returns only the matching rows/columns — without downloading the whole object. This is useful for lightweight Lambda functions that need a small slice of a large file.
resp = s3.select_object_content(
Bucket=bucket, Key="bronze/sales/orders.csv",
ExpressionType="SQL",
Expression="SELECT s.order_id, s.amount FROM S3Object s WHERE s.amount > 1000",
InputSerialization={"CSV": {"FileHeaderInfo": "USE"}},
OutputSerialization={"CSV": {}}
)
for event in resp["Payload"]:
if "Records" in event:
print(event["Records"]["Payload"].decode())
Transfer Acceleration routes uploads/downloads through Amazon's global CloudFront edge network instead of going directly to the bucket's region — useful when uploading from locations geographically far from your bucket's region (e.g., uploading from Asia to a US bucket). It's enabled per-bucket and used via a special endpoint (bucket.s3-accelerate.amazonaws.com).
The small files problem happens when a pipeline writes thousands of tiny files (a few KB each) instead of fewer, larger ones — often from streaming jobs with frequent micro-batches, or over-partitioned Spark writes. Each small file has metadata overhead (S3 LIST/GET calls, Spark task scheduling overhead), so listing and reading thousands of small files is dramatically slower than reading a handful of large ones.
coalesce()/repartition(), rewrite as fewer large files.repartition(n) before writing so each output file is the right size.For Parquet on S3 read by Spark/Athena, the sweet spot is typically 128 MB to 1 GB per file. Files this size align well with Spark's default split size and HDFS-style block sizes, giving each task a meaningful chunk of work without making any single task too slow.
# Estimate target partitions so each output file ≈ 256MB
total_size_mb = 5120 # e.g., 5 GB dataset
target_file_mb = 256
num_files = max(1, total_size_mb // target_file_mb)
df.repartition(num_files).write.mode("overwrite").parquet(
"s3://my-company-data-lake/silver/sales/orders_cleaned/"
)
A bucket policy is a resource-based JSON policy attached directly to the bucket, controlling who (which principals — users, roles, accounts) can perform which actions (read, write, delete) on the bucket and its objects. Bucket policies are essential for cross-account access and for blanket rules like "deny all unencrypted uploads."
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyUnencryptedUploads",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::my-company-data-lake/*",
"Condition": {
"StringNotEquals": {"s3:x-amz-server-side-encryption": "aws:kms"}
}
}]
}
While bucket policies are attached to the resource (the bucket), IAM policies are attached to the principal (a user, role, or group) and define what S3 actions that principal can perform — across any bucket the policy allows. A Glue job's execution role, for example, gets an IAM policy granting s3:GetObject on the raw bucket and s3:PutObject on the silver bucket.
S3 Block Public Access is an account/bucket-level setting (now on by default for new buckets) that overrides any policy or ACL that would make objects public. For data lakes — which almost always contain sensitive business data — this should remain enabled at the account level, with access granted only via IAM roles, never public URLs.
S3 supports encryption at rest in three main flavors:
| Method | Key Management | Use Case |
|---|---|---|
| SSE-S3 | AWS manages keys entirely (AES-256) | Default, simplest, free |
| SSE-KMS | You manage keys via AWS KMS (CMK) | Audit trail, key rotation, fine-grained access control |
| Client-Side | You encrypt before upload, S3 stores ciphertext | Maximum control, used for highly regulated data |
s3.put_object(
Bucket=bucket, Key=key, Body=data,
ServerSideEncryption="aws:kms",
SSEKMSKeyId="arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
)
When versioning is enabled on a bucket, every PUT to the same key creates a new version instead of overwriting — old versions remain accessible (and deletable separately). This protects against accidental overwrites/deletes, and is required for cross-region replication and certain compliance needs.
s3.put_bucket_versioning(
Bucket="my-company-data-lake",
VersioningConfiguration={"Status": "Enabled"}
)
NoncurrentVersionExpiration) — without that rule, costs silently grow.
Cross-Region Replication (CRR) automatically copies new objects from a bucket in one region to a bucket in another region — used for disaster recovery, data residency requirements, or serving compute in multiple regions. Same-Region Replication (SRR) does the same within a region — common for separating production data into a compliance/audit account.
S3 can emit an event notification whenever an object is created, deleted, or restored. These events can trigger a Lambda function directly, or be sent to SQS (for buffering/decoupling) or SNS (for fan-out to multiple subscribers). This is the backbone of event-driven ingestion — "file lands → pipeline starts automatically."
s3.put_bucket_notification_configuration(
Bucket="my-company-data-lake",
NotificationConfiguration={
"LambdaFunctionConfigurations": [{
"LambdaFunctionArn": "arn:aws:lambda:us-east-1:123456789012:function:start-ingest-pipeline",
"Events": ["s3:ObjectCreated:*"],
"Filter": {"Key": {"FilterRules": [
{"Name": "prefix", "Value": "bronze/sales/"},
{"Name": "suffix", "Value": ".parquet"}
]}}
}]
}
)
S3 Inventory generates a daily/weekly report (a CSV/ORC/Parquet file) listing all objects in a bucket along with metadata (size, storage class, encryption status, last modified). For buckets with millions of objects, this is far cheaper and faster than calling list_objects_v2 repeatedly — and the report itself can be queried with Athena to audit storage usage, find unencrypted objects, or detect old files for cleanup.
AWS IAM — Identity and Access Management
IAM controls who can do what across every AWS service your pipelines touch. Every Glue job, EMR cluster, Lambda function, and Airflow worker runs as an IAM role — not as "you." Getting IAM right is the difference between a pipeline that works in dev and breaks in prod with AccessDenied, and a pipeline that's secure, auditable, and portable across accounts.
IAM has four core building blocks. A user represents a person or application with long-term credentials. A group is a collection of users that share permissions. A role is an identity without permanent credentials — it's assumed temporarily by a user, service, or another account. A policy is the actual JSON document that defines permissions (allow/deny on specific actions and resources), and it's attached to users, groups, or roles.
| Concept | Has Long-Term Credentials? | Typical Use in Data Engineering |
|---|---|---|
| User | Yes (access key/secret) | Local dev only — avoid in production |
| Group | N/A (contains users) | Organizing human users by team |
| Role | No — temporary creds via STS | Glue/EMR/Lambda execution, cross-account access |
| Policy | N/A (a document) | Attached to roles to grant least-privilege access |
A managed policy is a standalone, reusable policy document (AWS-managed like AmazonS3ReadOnlyAccess, or customer-managed) that can be attached to multiple roles. An inline policy is embedded directly inside a single role/user/group — it exists only as part of that identity and is deleted if the identity is deleted.
| Aspect | Managed Policy | Inline Policy |
|---|---|---|
| Reusability | Attach to many roles | Tied to one identity |
| Versioning | Has version history | No version history |
| Best for | Standard permission sets shared across teams | One-off, tightly-scoped exceptions for a single role |
iam = boto3.client("iam")
# Managed policy — reusable, attach by ARN
iam.attach_role_policy(
RoleName="glue-etl-role",
PolicyArn="arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
)
# Inline policy — embedded directly in this one role only
iam.put_role_policy(
RoleName="glue-etl-role",
PolicyName="AllowWriteToSilverBucket",
PolicyDocument=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::company-silver/*"
}]
})
)
Most IAM policies are identity-based — attached to a user/role, defining what that identity can do. A resource-based policy is attached to the resource itself (S3 bucket policies, KMS key policies, Lake Formation resource shares, SQS queue policies) and defines who can access that resource — including identities from other AWS accounts. Both types are evaluated together; access requires no explicit deny and at least one explicit allow.
Every AWS compute service that runs your code needs an execution role — a role the service assumes on your behalf to act with specific permissions. You never put access keys inside a Glue job or Lambda function; instead, you attach an IAM role, and AWS automatically injects temporary credentials into the running environment.
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "glue.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}
boto3.client("s3") with no credentials specified — boto3 automatically finds the temporary credentials injected by the assumed glue-etl-role via the instance metadata / environment.
Least privilege means granting only the exact permissions a role needs — nothing more. Instead of giving a Glue job's role s3:* on every bucket, scope it precisely: s3:GetObject on the raw bucket prefix it reads, s3:PutObject on the silver bucket prefix it writes, and nothing else.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadRawSalesOnly",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::company-raw",
"arn:aws:s3:::company-raw/sales/*"
]
},
{
"Sid": "WriteSilverSalesOnly",
"Effect": "Allow",
"Action": ["s3:PutObject"],
"Resource": "arn:aws:s3:::company-silver/sales/*"
}
]
}
"Action": "s3:*" and "Resource": "*" to "make it work" is the #1 IAM mistake. A single compromised job credential then has access to every bucket in the account, including production databases' backups.
Large organizations often separate environments into different AWS accounts (e.g., data-raw-account, data-processing-account, analytics-account) for blast-radius isolation. Cross-account access lets a role in Account B assume a role in Account A — Account A's role trust policy explicitly lists Account B as a trusted principal.
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::222222222222:role/processing-glue-role"},
"Action": "sts:AssumeRole",
"Condition": {"StringEquals": {"sts:ExternalId": "shared-secret-id"}}
}]
}
ExternalId condition is a security best-practice for third-party cross-account access — it prevents the "confused deputy" problem where another customer of the same third party could trick it into assuming your role.
A service-linked role is a special pre-defined role that an AWS service creates and manages automatically — you can't edit its permissions, and only that service can assume it. Examples relevant to DEs: the role AWS creates for EMR managed scaling, or the role behind Lake Formation's data access. They simplify setup because AWS guarantees the permissions are exactly what the service needs — no more, no less.
STS (Security Token Service) issues short-lived credentials (default 1 hour, configurable up to the role's max session duration) when a principal "assumes" a role. sts.assume_role() returns an AccessKeyId, SecretAccessKey, and SessionToken — all three are required together and expire automatically.
sts = boto3.client("sts")
response = sts.assume_role(
RoleArn="arn:aws:iam::333333333333:role/cross-account-reader",
RoleSessionName="de-pipeline-session",
DurationSeconds=3600
)
creds = response["Credentials"]
# Build a new session using the temporary credentials
session = boto3.Session(
aws_access_key_id=creds["AccessKeyId"],
aws_secret_access_key=creds["SecretAccessKey"],
aws_session_token=creds["SessionToken"]
)
s3_other_account = session.client("s3")
s3_other_account.list_objects_v2(Bucket="other-account-bucket")
This combines the cross-account trust policy (above) with assume_role(): Account B's pipeline calls sts.assume_role() targeting Account A's role ARN. Because Account A's trust policy lists Account B as trusted, STS issues temporary credentials scoped to whatever permissions Account A's role has — letting Account B's Spark job read Account A's S3 bucket or Glue Catalog without ever holding Account A's long-term credentials.
Session tags are key-value pairs passed during assume_role() that get attached to the resulting temporary session. Policies can then reference aws:PrincipalTag/... in their Condition blocks — enabling attribute-based access control (ABAC). For example, a single shared role can be assumed by multiple teams, but a session tag like team=sales restricts that specific session to only the sales team's S3 prefix.
response = sts.assume_role(
RoleArn="arn:aws:iam::444444444444:role/shared-team-role",
RoleSessionName="sales-team-session",
Tags=[{"Key": "team", "Value": "sales"}]
)
# A policy condition like:
# "Resource": "arn:aws:s3:::company-data/${aws:PrincipalTag/team}/*"
# automatically scopes this session to company-data/sales/* only
AWS KMS — Key Management Service
KMS is how AWS handles encryption at rest for almost every data service — S3, Glue, Redshift, DynamoDB, Secrets Manager, and more. As a Data Engineer, you don't write cryptography code — you tell AWS which key to use and KMS does all the heavy lifting. Understanding how KMS works lets you build compliant, auditable, encrypted data pipelines without touching low-level crypto APIs.
When you enable encryption on an AWS service (e.g. S3, Glue, CloudWatch Logs) without specifying your own key, AWS automatically creates and manages a key on your behalf. These are called AWS managed keys. You can't rotate them manually, can't restrict who uses them beyond service-level controls, and can't share them across accounts. They are free and zero-configuration — great for default encryption, but limited for compliance requirements.
aws/s3, aws/glue. No cost. No manual rotation. No cross-account. Limited audit.For production data pipelines, you should always create CMKs for sensitive data. CMKs let you: (1) restrict which IAM roles can decrypt data, (2) audit every key usage via CloudTrail, (3) rotate keys on a schedule, (4) disable a key to immediately revoke access to all encrypted data, (5) share keys cross-account for data mesh architectures.
import boto3
kms = boto3.client("kms", region_name="us-east-1")
# Create a symmetric CMK for encrypting data lake data
response = kms.create_key(
Description="CMK for production data lake encryption",
KeyUsage="ENCRYPT_DECRYPT", # default — symmetric AES-256 key
KeySpec="SYMMETRIC_DEFAULT", # AES-256 GCM
Policy="""
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Enable IAM Root",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:root"},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow Glue and EMR roles to use the key",
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::123456789012:role/GlueExecutionRole",
"arn:aws:iam::123456789012:role/EMRJobRole"
]
},
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "*"
}
]
}
"""
)
key_id = response["KeyMetadata"]["KeyId"]
key_arn = response["KeyMetadata"]["Arn"]
print(f"Created CMK: {key_arn}")
# Create a human-readable alias so you don't have to remember the UUID
kms.create_alias(
AliasName="alias/data-lake-cmk",
TargetKeyId=key_id
)
# Now you can reference it as alias/data-lake-cmk in all service configs
alias/data-lake-cmk is far better than a raw UUID like mrk-abc123... in your configs, and the alias can be pointed to a different key if you ever need to rotate manually.
KMS is designed for small payloads (up to 4 KB per API call). You obviously can't send a 10 GB Parquet file to KMS to encrypt it — that would be impossibly slow and expensive. This is why AWS uses envelope encryption — a two-key system where the actual data is encrypted locally with a fast symmetric key, and only that small key is sent to KMS for protection.
Here is exactly what happens when S3, Glue, or Redshift encrypts your data with a CMK:
import boto3
import os
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
kms = boto3.client("kms")
CMK_ID = "alias/data-lake-cmk"
# ─── ENCRYPT ───────────────────────────────────────────────
# Step 1: Ask KMS for a data key
dk_resp = kms.generate_data_key(KeyId=CMK_ID, KeySpec="AES_256")
plaintext_data_key = dk_resp["Plaintext"] # 32 bytes — USE then DESTROY
encrypted_data_key = dk_resp["CiphertextBlob"] # store alongside your data
# Step 2: Encrypt your actual data LOCALLY (fast, no KMS call)
data_to_encrypt = b"sensitive customer data here"
nonce = os.urandom(12) # 96-bit nonce for AES-GCM
aesgcm = AESGCM(plaintext_data_key)
ciphertext = aesgcm.encrypt(nonce, data_to_encrypt, None)
# Step 3: Clear the plaintext key from memory
del plaintext_data_key
# Store: ciphertext + nonce + encrypted_data_key together
# ─── DECRYPT ───────────────────────────────────────────────
# Step 1: Call KMS to decrypt the encrypted data key
dk_plain = kms.decrypt(CiphertextBlob=encrypted_data_key)["Plaintext"]
# Step 2: Decrypt locally
aesgcm2 = AESGCM(dk_plain)
decrypted = aesgcm2.decrypt(nonce, ciphertext, None)
print(decrypted) # b"sensitive customer data here"
# Step 3: Clear plaintext key
del dk_plain
# Note: In practice S3/Glue/Redshift do ALL of this for you automatically!
When you upload an object to S3 with SSE-KMS, S3 calls KMS to get a data key, encrypts your object with it, and stores only the encrypted data key in the object's metadata. On download, S3 calls kms:Decrypt on your behalf. The caller needs both an S3 read permission AND a kms:Decrypt permission — this double gate is powerful for access control in a data lake.
import boto3
s3 = boto3.client("s3")
CMK_ID = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"
# Upload with SSE-KMS using your CMK
s3.upload_file(
Filename="sales_data.parquet",
Bucket="my-data-lake",
Key="gold/sales/2024/sales_data.parquet",
ExtraArgs={
"ServerSideEncryption": "aws:kms",
"SSEKMSKeyId": CMK_ID
}
)
# Alternatively via put_object
s3.put_object(
Bucket="my-data-lake",
Key="gold/users/users.parquet",
Body=b"parquet bytes here",
ServerSideEncryption="aws:kms",
SSEKMSKeyId=CMK_ID
)
Best practice is to deny all uploads that don't use SSE-KMS via a bucket policy. This ensures no developer accidentally stores raw unencrypted data in your secure data lake bucket — the upload will simply fail with a 403.
import json, boto3
s3 = boto3.client("s3")
BUCKET = "my-data-lake"
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:key/abcd-1234"
# Bucket policy that DENIES any upload without SSE-KMS using our CMK
policy = {
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyNonKMSUploads",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": f"arn:aws:s3:::{BUCKET}/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "aws:kms"
}
}
},
{
"Sid": "DenyWrongKMSKey",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:PutObject",
"Resource": f"arn:aws:s3:::{BUCKET}/*",
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption-aws-kms-key-id": CMK_ARN
}
}
}
]
}
s3.put_bucket_policy(
Bucket=BUCKET,
Policy=json.dumps(policy)
)
print("Bucket policy applied — unencrypted uploads will now be denied.")
kms:GenerateDataKey and kms:Decrypt on the CMK or their writes will fail with AccessDenied.
AWS Glue can encrypt three things with your CMK: (1) job bookmarks (state tracking for incremental loads), (2) CloudWatch logs emitted by the Glue job, and (3) metadata stored in the Glue Data Catalog. You configure all of this through a Security Configuration — a reusable encryption config you attach to Glue jobs and crawlers.
import boto3
glue = boto3.client("glue")
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"
# Create a security configuration
glue.create_security_configuration(
Name="data-lake-security-config",
EncryptionConfiguration={
"S3Encryption": [
{
"S3EncryptionMode": "SSE-KMS", # encrypt all S3 output
"KmsKeyArn": CMK_ARN
}
],
"CloudWatchEncryption": {
"CloudWatchEncryptionMode": "SSE-KMS", # encrypt job logs
"KmsKeyArn": CMK_ARN
},
"JobBookmarksEncryption": {
"JobBookmarksEncryptionMode": "CSE-KMS", # encrypt incremental state
"KmsKeyArn": CMK_ARN
}
}
)
# Attach security configuration when creating / updating a Glue job
glue.create_job(
Name="sales-etl-job",
Role="arn:aws:iam::123456789012:role/GlueExecutionRole",
Command={"Name": "glueetl", "ScriptLocation": "s3://scripts/etl.py", "PythonVersion": "3"},
SecurityConfiguration="data-lake-security-config", # ← attach here
GlueVersion="4.0",
NumberOfWorkers=10,
WorkerType="G.1X"
)
kms:GenerateDataKey, kms:Decrypt, and kms:DescribeKey on the CMK. If these are missing, the Glue job will fail with AccessDeniedException before it even reads a single row.
Redshift encrypts data at the block level — every data block on disk is encrypted with a hierarchy of keys: a cluster encryption key (CEK) wraps block-level keys, and your CMK wraps the CEK. The result is that no Redshift data is readable without a live, enabled CMK. Encryption is set at cluster creation time — you cannot enable it on an existing unencrypted cluster without a snapshot-restore cycle.
import boto3
redshift = boto3.client("redshift")
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"
redshift.create_cluster(
ClusterIdentifier="prod-dwh",
NodeType="ra3.4xlarge",
NumberOfNodes=4,
MasterUsername="admin",
MasterUserPassword="ChangeMe123!", # use Secrets Manager in prod!
DBName="analytics",
Encrypted=True, # enable encryption
KmsKeyId=CMK_ARN, # use our CMK — not AWS managed key
VpcSecurityGroupIds=["sg-abc123"],
ClusterSubnetGroupName="redshift-subnet-group"
)
print("Encrypted Redshift cluster creation started.")
By default, Secrets Manager encrypts secrets with an AWS managed key (aws/secretsmanager). For production pipelines, use your own CMK — this gives you audit trails of every secret access in CloudTrail and the ability to revoke access by disabling the key.
import boto3, json
sm = boto3.client("secretsmanager")
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"
# Create a secret encrypted with our CMK
sm.create_secret(
Name="prod/postgresql/credentials",
Description="PostgreSQL DB credentials for ETL pipeline",
KmsKeyId=CMK_ARN, # ← associate our CMK here
SecretString=json.dumps({
"host": "prod-db.cluster-xyz.us-east-1.rds.amazonaws.com",
"port": 5432,
"username": "etl_user",
"password": "SuperSecret123!",
"dbname": "analytics"
})
)
print("Secret stored with CMK encryption.")
KMS can automatically rotate a CMK every 90 days to 2560 days (you choose). When rotation happens, AWS generates new cryptographic material for the key, but keeps the same Key ID and ARN — your configs don't need to change. Old data encrypted with the previous key material is automatically re-wrapped when accessed. This is the recommended approach for most use cases.
import boto3
kms = boto3.client("kms")
KEY_ID = "alias/data-lake-cmk"
# Enable automatic rotation (rotates every 365 days by default)
kms.enable_key_rotation(
KeyId=KEY_ID,
RotationPeriodInDays=365 # annual rotation (90–2560 days)
)
# Check rotation status
status = kms.get_key_rotation_status(KeyId=KEY_ID)
print(status["KeyRotationEnabled"]) # True
print(status.get("NextRotationDate")) # e.g. 2025-01-15T00:00:00Z
# You can also trigger a manual on-demand rotation right now
kms.rotate_key_on_demand(KeyId=KEY_ID)
print("On-demand rotation triggered.")
Automatic rotation keeps the same Key ID. But sometimes you need a completely new key — for example when a key is compromised, or when compliance requires a new key for each year's data. In that case you create a new CMK, update your alias to point to it, and old data encrypted under the previous key is still decryptable because KMS retains old key material. New data is encrypted with the new key.
import boto3
kms = boto3.client("kms")
# Step 1: Create a brand new CMK
new_key = kms.create_key(
Description="Data lake CMK v2 — 2025 rotation",
KeyUsage="ENCRYPT_DECRYPT",
KeySpec="SYMMETRIC_DEFAULT"
)
new_key_id = new_key["KeyMetadata"]["KeyId"]
# Step 2: Move the alias to the new key
# (all configs referencing "alias/data-lake-cmk" now use the new key)
kms.update_alias(
AliasName="alias/data-lake-cmk",
TargetKeyId=new_key_id
)
print(f"Alias now points to new key: {new_key_id}")
# Old data encrypted with old key is still readable.
# New data is encrypted with new_key_id.
# Old key should be scheduled for deletion after data migration.
# Step 3: Schedule old key for deletion (7–30 day waiting period)
OLD_KEY_ID = "mrk-previous-key-id"
kms.schedule_key_deletion(
KeyId=OLD_KEY_ID,
PendingWindowInDays=30 # 30-day grace period to recover if needed
)
| AWS Service | How KMS is Used | Config Location | IAM Permission Needed |
|---|---|---|---|
| Amazon S3 | SSE-KMS on each object upload | Upload ExtraArgs or bucket default encryption | kms:GenerateDataKey, kms:Decrypt |
| AWS Glue | Job bookmarks, logs, S3 output, catalog | Security Configuration attached to job | kms:GenerateDataKey, kms:Decrypt, kms:DescribeKey |
| Amazon Redshift | Block-level cluster encryption | Cluster creation parameter KmsKeyId | IAM role association at cluster level |
| Secrets Manager | Encrypts secret value at rest | KmsKeyId on create_secret() | kms:Decrypt, kms:GenerateDataKey |
| CloudWatch Logs | Encrypt log group at rest | associate-kms-key on log group | kms:GenerateDataKey, kms:Decrypt |
| Amazon MSK | Encrypts data at rest on broker | Cluster encryption config at creation | Automatic via MSK service role |
| Amazon RDS | Storage volume encryption | KmsKeyId at DB creation | Automatic via RDS service role |
import boto3
kms = boto3.client("kms")
# List all CMKs and their aliases
for page in kms.get_paginator("list_keys").paginate():
for key in page["Keys"]:
detail = kms.describe_key(KeyId=key["KeyId"])["KeyMetadata"]
if detail["KeyManager"] == "CUSTOMER": # only our CMKs
print(detail["KeyId"], detail["Description"], detail["KeyState"])
# Disable a key (emergency — data becomes unreadable)
kms.disable_key(KeyId="alias/data-lake-cmk")
# Re-enable a key
kms.enable_key(KeyId="alias/data-lake-cmk")
# Check if a key policy allows your Glue role to use it
policy = kms.get_key_policy(
KeyId="alias/data-lake-cmk",
PolicyName="default"
)["Policy"]
print(policy) # JSON string — inspect for your role ARN
KMS charges $1/month per CMK plus $0.03 per 10,000 API calls. For a busy data lake with thousands of Glue jobs writing millions of objects to S3, KMS costs can add up. Use S3 Bucket Keys to dramatically reduce KMS API calls — instead of calling KMS for every object, S3 generates a bucket-level key locally and reuses it for many objects, reducing KMS calls by up to 99%.
import boto3
s3 = boto3.client("s3")
BUCKET = "my-data-lake"
CMK_ARN = "arn:aws:kms:us-east-1:123456789012:alias/data-lake-cmk"
# Enable default SSE-KMS with Bucket Key on the bucket
s3.put_bucket_encryption(
Bucket=BUCKET,
ServerSideEncryptionConfiguration={
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": CMK_ARN
},
"BucketKeyEnabled": True # ← this is the cost saver
}]
}
)
print("Bucket Key enabled — KMS API costs reduced by ~99%.")
AWS Secrets Manager
Every pipeline needs credentials — database passwords, API keys, Snowflake tokens, Kafka credentials. Hard-coding them is a security disaster. AWS Secrets Manager is the production solution: store secrets centrally, retrieve them in Python at runtime, rotate them automatically, and audit every access via CloudTrail. It integrates natively with Glue, Lambda, Airflow, and any boto3-based pipeline.
Secrets Manager stores any sensitive string — most commonly a JSON object with multiple fields like {"username": "db_user", "password": "db_pass", "host": "db.example.com"}. You can also store plain strings (e.g. a raw API token). Every secret has a name (the lookup key), a value (the secret payload), and optional metadata like a description, tags, and KMS key.
DB_PASSWORD = "my_super_secret_123" — hardcoded credentials in code or environment variables that get committed to git. This is how data breaches happen. Always fetch from Secrets Manager at runtime.
import boto3, json
sm = boto3.client("secretsmanager", region_name="us-east-1")
# Store a database credential as JSON — the standard pattern
secret_value = json.dumps({
"username": "pipeline_user",
"password": "S3cr3tP@ssw0rd!",
"host": "prod-rds.cluster-abc.us-east-1.rds.amazonaws.com",
"port": 5432,
"dbname": "analytics"
})
response = sm.create_secret(
Name="prod/rds/pipeline-user", # hierarchical name
Description="RDS credentials for the nightly ETL pipeline",
SecretString=secret_value,
KmsKeyId="alias/data-lake-cmk", # encrypt with CMK (not default key)
Tags=[
{"Key": "environment", "Value": "production"},
{"Key": "team", "Value": "data-engineering"}
]
)
print(f"Secret ARN: {response['ARN']}")
# arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/rds/pipeline-user-AbCdEf
Like Parameter Store, Secrets Manager supports slash-delimited hierarchical names. This is not just cosmetic — IAM policies can grant access to entire subtrees like prod/rds/* or prod/*, enabling fine-grained role-based access where the staging Glue role can only read staging/* secrets and the production role can only read prod/*.
prod/rds/pipeline-userprod/snowflake/etl-rolestaging/kafka/schema-registrydatalake/prod/redshiftdatalake/prod/s3-keysdatalake/staging/rdsprod-glue-role access to prod/* only. Grant staging-emr-role access to staging/* only. Zero cross-env risk.This is the most important boto3 pattern in this section — you will write this in every pipeline that connects to a database, Kafka cluster, or any external API. The response has a SecretString field (for text secrets) — always json.loads() it to extract individual fields like host, port, password.
import boto3, json
from botocore.exceptions import ClientError
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
"""Retrieve a JSON secret from AWS Secrets Manager."""
sm = boto3.client("secretsmanager", region_name=region)
try:
response = sm.get_secret_value(SecretId=secret_name)
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "ResourceNotFoundException":
raise ValueError(f"Secret '{secret_name}' does not exist.")
elif code == "AccessDeniedException":
raise PermissionError(f"IAM role lacks permission to read '{secret_name}'.")
else:
raise # re-raise unexpected errors
# SecretString for text secrets (most common), SecretBinary for binary
return json.loads(response["SecretString"])
# ── Usage — RDS connection ───────────────────────────────────
creds = get_secret("prod/rds/pipeline-user")
# creds is now a plain dict — unpack what you need
DB_HOST = creds["host"]
DB_PORT = creds["port"]
DB_USER = creds["username"]
DB_PASS = creds["password"]
DB_NAME = creds["dbname"]
# Use them to build a JDBC URL for Spark
jdbc_url = f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}"
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("user", DB_USER) \
.option("password", DB_PASS) \
.option("dbtable", "public.orders") \
.load()
In a Glue job, your script runs on a Glue executor. It assumes the Glue execution role automatically — so as long as that role has secretsmanager:GetSecretValue on the secret ARN, calling get_secret_value() inside the Glue script requires no extra configuration. No passing credentials as job parameters — just call the API.
# glue_etl_job.py — this script runs on a Glue executor
import sys, json, boto3
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# ── Fetch Snowflake creds from Secrets Manager at runtime ──
sm = boto3.client("secretsmanager")
sf_creds = json.loads(
sm.get_secret_value(SecretId="prod/snowflake/etl-role")["SecretString"]
)
# ── Use them in the Spark Snowflake connector ──
sf_options = {
"sfURL": sf_creds["sfURL"],
"sfUser": sf_creds["sfUser"],
"sfPassword": sf_creds["sfPassword"],
"sfDatabase": "ANALYTICS",
"sfWarehouse":"ETL_WH",
"sfSchema": "SILVER",
}
df = spark.read \
.format("net.snowflake.spark.snowflake") \
.options(**sf_options) \
.option("dbtable", "orders") \
.load()
# creds never appear in logs, git, or job parameters — they are fetched live
--db_password=mypass) — those appear in the Glue console, CloudTrail logs, and are visible to anyone who can describe the job. Always use Secrets Manager.
Lambda is a natural consumer of Secrets Manager. The key best practice in Lambda is to cache the secret outside the handler function — Lambda reuses the same container for multiple invocations, so fetching the secret once and caching it in a module-level variable avoids a Secrets Manager API call on every single invocation.
import json, boto3
sm = boto3.client("secretsmanager")
# ── Module-level cache — fetched once per container warm-start ──
_DB_CREDS = None
def get_db_creds():
global _DB_CREDS
if _DB_CREDS is None:
raw = sm.get_secret_value(SecretId="prod/rds/pipeline-user")
_DB_CREDS = json.loads(raw["SecretString"])
return _DB_CREDS
def lambda_handler(event, context):
creds = get_db_creds() # uses cached value after first call
# ... use creds to connect to RDS and process the event ...
return {"statusCode": 200, "body": "OK"}
# First invocation: hits Secrets Manager API (~5ms latency)
# Subsequent invocations on same container: reads from _DB_CREDS (0ms)
Airflow has a built-in Secrets Manager backend. When configured, Airflow automatically fetches connections and variables from Secrets Manager instead of its metadata database. This means you manage all your production credentials in one place (Secrets Manager) rather than in the Airflow UI — much more secure and auditable.
# In airflow.cfg or MWAA environment variables:
# [secrets]
# backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
# backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}
# Then a connection stored as: airflow/connections/my_rds_conn
# and a variable stored as: airflow/variables/my_s3_bucket
# are automatically available in DAGs as:
# BaseHook.get_connection("my_rds_conn")
# Variable.get("my_s3_bucket")
# Manual retrieval inside a DAG task (if not using the backend):
import boto3, json
def my_task(**context):
sm = boto3.client("secretsmanager")
creds = json.loads(
sm.get_secret_value(SecretId="prod/rds/pipeline-user")["SecretString"]
)
# build JDBC URL and run Spark job...
Secrets Manager can automatically rotate a secret on a schedule (e.g. every 30 days). Under the hood, it triggers an AWS Lambda function that: (1) creates a new password in the target service (e.g. RDS), (2) updates the secret value in Secrets Manager, (3) tests that the new credentials work, and (4) finalizes the rotation. Your pipeline code doesn't need to change — next time it calls get_secret_value(), it gets the new credentials automatically.
import boto3
sm = boto3.client("secretsmanager")
# Enable rotation — AWS has built-in Lambda functions for RDS, Redshift, etc.
sm.rotate_secret(
SecretId="prod/rds/pipeline-user",
RotationLambdaARN="arn:aws:lambda:us-east-1:123456789012:function:SecretsManagerRDSRotation",
RotationRules={
"AutomaticallyAfterDays": 30 # rotate every 30 days
}
)
# For supported services (RDS, Aurora, Redshift, DocumentDB), AWS provides
# pre-built rotation Lambda functions in the Serverless Application Repository.
# For custom services (Snowflake, APIs), you write your own rotation Lambda.
# Check rotation status
secret_meta = sm.describe_secret(SecretId="prod/rds/pipeline-user")
print("Rotation enabled:", secret_meta.get("RotationEnabled"))
print("Last rotated:", secret_meta.get("LastRotatedDate"))
print("Next rotation:", secret_meta.get("NextRotationDate"))
Secrets Manager keeps multiple versions of a secret using staging labels. During rotation, the new version is staged as AWSPENDING, then promoted to AWSCURRENT when tests pass. The old version is moved to AWSPREVIOUS and kept temporarily. Your pipeline always fetches AWSCURRENT by default, but can explicitly request AWSPREVIOUS if needed for rollback.
import boto3, json
sm = boto3.client("secretsmanager")
# Default: fetches AWSCURRENT (latest valid secret)
current = json.loads(sm.get_secret_value(SecretId="prod/rds/pipeline-user")["SecretString"])
# Explicitly fetch the previous version (rollback scenario)
previous = json.loads(sm.get_secret_value(
SecretId="prod/rds/pipeline-user",
VersionStage="AWSPREVIOUS"
)["SecretString"])
# List all versions of a secret to see the labels
versions = sm.list_secret_version_ids(SecretId="prod/rds/pipeline-user")
for v in versions["Versions"]:
print(v["VersionId"], v.get("VersionStages"))
# e.g:
# abc123 ['AWSCURRENT']
# def456 ['AWSPREVIOUS']
get_secret_value() again (to get the freshly rotated current value), and retry — a pattern Secrets Manager's own documentation calls graceful retry on rotation.
For services not supported out of the box (Snowflake, Kafka, external APIs), you write a Lambda with four specific handler cases that Secrets Manager calls in sequence: createSecret, setSecret, testSecret, and finishSecret.
import boto3, json, string, secrets
sm = boto3.client("secretsmanager")
def lambda_handler(event, context):
step = event["Step"]
secret_id = event["SecretId"]
token = event["ClientRequestToken"] # new version ID
if step == "createSecret":
# Generate a new password and store as AWSPENDING version
new_pass = "".join(secrets.choice(string.ascii_letters + string.digits) for _ in range(32))
current = json.loads(sm.get_secret_value(SecretId=secret_id)["SecretString"])
current["password"] = new_pass
sm.put_secret_value(
SecretId=secret_id, ClientRequestToken=token,
SecretString=json.dumps(current), VersionStages=["AWSPENDING"]
)
elif step == "setSecret":
# Apply the new password to the actual service (e.g. Snowflake ALTER USER)
pending = json.loads(sm.get_secret_value(
SecretId=secret_id, VersionStage="AWSPENDING")["SecretString"])
# ... call Snowflake / RDS / API to set the new password ...
elif step == "testSecret":
# Verify the pending credentials actually work
pending = json.loads(sm.get_secret_value(
SecretId=secret_id, VersionStage="AWSPENDING")["SecretString"])
# ... try connecting with pending creds; raise exception if it fails ...
elif step == "finishSecret":
# Promote AWSPENDING → AWSCURRENT
sm.update_secret_version_stage(
SecretId=secret_id, VersionStage="AWSCURRENT",
MoveToVersionId=token,
RemoveFromVersionId=sm.describe_secret(SecretId=secret_id)["VersionIdsToStages"]
and None # simplified for illustration
)
Both services store sensitive config, but they have different design centers. Secrets Manager is purpose-built for credentials that rotate and where every access must be audited. Parameter Store is cheaper and better for static config that rarely changes.
| Feature | Secrets Manager | Parameter Store |
|---|---|---|
| Primary Use | Rotating credentials (DB, API keys) | Pipeline config, non-rotating values |
| Cost | $0.40/secret/month + $0.05 per 10K API calls | Free (Standard), $0.05/10K calls (Advanced) |
| Automatic Rotation | ✅ Built-in with Lambda integration | ❌ No — you'd build it manually |
| Versioning | ✅ Full version history with staging labels | ✅ Up to 100 versions (Advanced tier) |
| Secret Size | Up to 65KB | 4KB (Standard), 8KB (Advanced) |
| Cross-Account | ✅ Resource policy enables cross-account access | ❌ Within account only |
| Audit Logging | Every GetSecretValue call in CloudTrail | Every GetParameter call in CloudTrail |
| API | secretsmanager.* | ssm.get_parameter() |
import boto3, json
sm = boto3.client("secretsmanager", region_name="us-east-1")
# ── 1. CREATE a secret ───────────────────────────────────────
sm.create_secret(
Name="prod/kafka/sasl-credentials",
SecretString=json.dumps({"username": "kafka-user", "password": "s3cr3t"}),
KmsKeyId="alias/data-lake-cmk"
)
# ── 2. GET a secret (the most common call) ───────────────────
creds = json.loads(
sm.get_secret_value(SecretId="prod/kafka/sasl-credentials")["SecretString"]
)
# ── 3. UPDATE (rotate/change) a secret value ─────────────────
sm.put_secret_value(
SecretId="prod/kafka/sasl-credentials",
SecretString=json.dumps({"username": "kafka-user", "password": "n3wP@ss!"})
)
# ── 4. DESCRIBE a secret (metadata, rotation status) ─────────
meta = sm.describe_secret(SecretId="prod/kafka/sasl-credentials")
print(meta["RotationEnabled"]) # False / True
print(meta["LastChangedDate"]) # datetime
print(meta["Tags"]) # list of {Key, Value}
# ── 5. LIST secrets (with paginator) ─────────────────────────
paginator = sm.get_paginator("list_secrets")
for page in paginator.paginate(
Filters=[{"Key": "tag-key", "Values": ["environment"]}]
):
for secret in page["SecretList"]:
print(secret["Name"], secret["LastChangedDate"])
# ── 6. DELETE a secret ──────────────────────────────────────
sm.delete_secret(
SecretId="prod/kafka/sasl-credentials",
RecoveryWindowInDays=30 # 30-day window before permanent deletion
# ForceDeleteWithoutRecovery=True ← immediate deletion (no recovery!)
)
# ── 7. RESTORE a deleted secret (within recovery window) ─────
sm.restore_secret(SecretId="prod/kafka/sasl-credentials")
# ── 8. TAG a secret ─────────────────────────────────────────
sm.tag_resource(
SecretId="prod/kafka/sasl-credentials",
Tags=[{"Key": "cost-center", "Value": "data-platform"}]
)
# ── 9. REPLICATE to another region (DR pattern) ──────────────
sm.replicate_secret_to_regions(
SecretId="prod/rds/pipeline-user",
AddReplicaRegions=[{"Region": "us-west-2", "KmsKeyId": "alias/data-lake-cmk-west"}]
)
# Now your pipelines in us-west-2 read from the local replica — faster & resilient
Grant each role access to only its own secrets using IAM resource ARN patterns. A Glue job for the sales pipeline should never be able to read the finance pipeline's Redshift credentials.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue",
"secretsmanager:DescribeSecret"
],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/sales/*"
},
{
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/alias/data-lake-cmk"
}
]
}
get_secret_value(SecretId=name) → json.loads(SecretString). Use automatic rotation for database passwords — AWS has pre-built Lambda rotators for RDS, Redshift, and DocumentDB. Cache the secret at module level in Lambda to avoid per-invocation API calls. Choose Secrets Manager over Parameter Store when credentials rotate, need cross-account sharing, or require per-access auditing. Always scope IAM to prod/service/* patterns — never grant wildcard access to all secrets.
AWS Systems Manager — Parameter Store
Parameter Store is the lightweight, cost-effective sibling of Secrets Manager. While Secrets Manager is best for rotating credentials, Parameter Store is perfect for pipeline configuration — environment flags, S3 bucket names, Glue job parameters, feature toggles, and anything that drives pipeline behaviour but doesn't need auto-rotation. It integrates natively with Glue, Lambda, EMR, and any boto3 code for free (Standard tier).
AWS Systems Manager Parameter Store is a centralized key-value configuration store. Instead of hard-coding S3 bucket names, database hosts, or feature flags into your Glue job code, you store them in Parameter Store and fetch them at runtime. This means changing a config value doesn't require redeploying code — you update the parameter and the next job run picks it up automatically.
ssm.get_parameter("/myproject/prod/s3_bucket") and gets the latest value. Change the sheet (parameter) once, and all jobs see the update next time they run.
There are two tiers. Standard is free and sufficient for most data engineering use cases. Advanced unlocks larger values, parameter policies (auto-expiry), and higher API throughput — useful for complex frameworks with many parameters.
| Feature | Standard | Advanced |
|---|---|---|
| Cost | Free | $0.05 per advanced parameter/month |
| Value Size | Up to 4 KB | Up to 8 KB |
| Parameters per Account | 10,000 | 100,000 |
| Parameter Policies | ❌ Not available | ✅ Expiration, notification, no-change alerts |
| Throughput | 40 transactions/sec (default) | 1,000 transactions/sec (configurable) |
| SecureString | ✅ Supported (with KMS) | ✅ Supported |
| Best For | Pipeline config, feature flags, S3 paths | Large configs, auto-expiry, high-volume APIs |
Parameter Store has three value types. String is a plain text value. StringList is a comma-separated list of strings. SecureString is a KMS-encrypted value — the AWS equivalent of a secret for values that are mildly sensitive but don't need rotation.
s3://my-data-lakeus-east-1glue-job-v2Non-sensitive config.
us-east-1,eu-west-1bronze,silver,goldMulti-value configs.
Mildly sensitive values.
Pipeline environment keys.
Internal API URLs. Not for rotating passwords — use Secrets Manager for those.
import boto3
ssm = boto3.client("ssm", region_name="us-east-1")
# ── String: plain non-sensitive config ──────────────────────────
ssm.put_parameter(
Name="/datalake/prod/s3_bucket",
Value="my-company-data-lake-prod",
Type="String",
Description="Main data lake S3 bucket for production",
Overwrite=True
)
# ── StringList: comma-separated list of values ──────────────────
ssm.put_parameter(
Name="/datalake/prod/active_regions",
Value="us-east-1,eu-west-1,ap-southeast-1",
Type="StringList",
Description="Regions where the pipeline runs",
Overwrite=True
)
# ── SecureString: encrypted with KMS (for mildly sensitive config)
ssm.put_parameter(
Name="/datalake/prod/internal_api_key",
Value="int-api-key-abc123",
Type="SecureString",
KeyId="alias/data-lake-cmk", # optional: use CMK (default: AWS-managed key)
Description="Internal monitoring API key",
Overwrite=True
)
print("All parameters stored.")
A SecureString parameter is stored encrypted using KMS. When you call get_parameter(WithDecryption=True), SSM calls KMS to decrypt the value and returns it in plaintext to your code. If you call it without WithDecryption=True, you get back the raw encrypted blob — which is useless. The caller's IAM role needs both ssm:GetParameter AND kms:Decrypt on the key used to encrypt it.
WithDecryption=True is you using the UV lamp.
import boto3
ssm = boto3.client("ssm", region_name="us-east-1")
# ✅ CORRECT — WithDecryption=True decrypts the SecureString
response = ssm.get_parameter(
Name="/datalake/prod/internal_api_key",
WithDecryption=True
)
api_key = response["Parameter"]["Value"]
print(api_key) # "int-api-key-abc123" — actual value
# ❌ WRONG — without WithDecryption you get an encrypted blob
bad_response = ssm.get_parameter(
Name="/datalake/prod/internal_api_key"
# WithDecryption defaults to False!
)
print(bad_response["Parameter"]["Value"])
# AQICAHh7... (encrypted gibberish — DO NOT use this)
WithDecryption=True on a SecureString is one of the most common boto3 bugs. Your code will run without error but get an encrypted blob as the value — causing downstream connection failures that look mysterious. Always set it explicitly.
A Glue job or Lambda that reads SecureString parameters needs permissions at both layers: SSM to read the parameter and KMS to decrypt it. Here's the minimal IAM policy to grant both:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadSSMParameters",
"Effect": "Allow",
"Action": [
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:GetParametersByPath"
],
"Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/datalake/prod/*"
},
{
"Sid": "DecryptSSMSecureString",
"Effect": "Allow",
"Action": "kms:Decrypt",
"Resource": "arn:aws:kms:us-east-1:123456789012:key/alias/data-lake-cmk"
}
]
}
Data pipelines have dozens of configurable values: which S3 bucket to write to, how many Spark partitions to use, whether a feature flag is enabled, the name of the Glue database, the Kafka topic to read from, the DQ threshold percentage. Storing these in Parameter Store means you can change pipeline behaviour without redeploying code — a huge operational advantage in production.
/pipeline/prod/source_tables (StringList). When the business adds an 11th table, you update the parameter — the next job run automatically includes it, with zero code change or deployment.
Instead of making one API call per parameter, use get_parameters_by_path() to fetch all parameters under a path prefix in a single (paginated) call. This is the production pattern for loading all pipeline config at startup.
import boto3
ssm = boto3.client("ssm", region_name="us-east-1")
def load_pipeline_config(path_prefix: str) -> dict:
"""Load all parameters under a path prefix as a flat dict."""
config = {}
paginator = ssm.get_paginator("get_parameters_by_path")
for page in paginator.paginate(
Path=path_prefix,
Recursive=True, # include all nested sub-paths
WithDecryption=True # decrypt SecureStrings automatically
):
for param in page["Parameters"]:
# Strip the path prefix to get the short key name
short_name = param["Name"].replace(path_prefix, "").lstrip("/")
config[short_name] = param["Value"]
return config
# ── Usage in a Glue job ───────────────────────────────────────────
cfg = load_pipeline_config("/datalake/prod")
print(cfg)
# {
# "s3_bucket": "my-company-data-lake-prod",
# "glue_database": "prod_analytics",
# "spark_partitions": "200",
# "dq_threshold_pct": "95",
# "kafka_topic": "prod.sales.events",
# "active_regions": "us-east-1,eu-west-1,ap-southeast-1",
# "internal_api_key": "int-api-key-abc123" ← SecureString, auto-decrypted
# }
# Now use config values naturally
s3_bucket = cfg["s3_bucket"]
partitions = int(cfg["spark_partitions"])
dq_pct = float(cfg["dq_threshold_pct"])
regions = cfg["active_regions"].split(",") # StringList → Python list
print(f"Writing to: {s3_bucket}")
print(f"Partitions: {partitions}, DQ threshold: {dq_pct}%")
print(f"Active regions: {regions}")
For fetching individual parameters in the middle of a job (e.g., checking a feature flag before an optional step), use get_parameter(). Always use WithDecryption=True even for String types — it's a no-op for non-SecureStrings but saves bugs if the type ever changes.
import boto3
from botocore.exceptions import ClientError
ssm = boto3.client("ssm")
def get_param(name: str, default=None):
"""Get a single parameter, returning default if it doesn't exist."""
try:
resp = ssm.get_parameter(Name=name, WithDecryption=True)
return resp["Parameter"]["Value"]
except ClientError as e:
if e.response["Error"]["Code"] == "ParameterNotFound":
return default
raise # re-raise unexpected errors
# ── Feature flag check in pipeline ───────────────────────────────
run_dq_checks = get_param("/datalake/prod/feature/run_dq_checks", default="true")
if run_dq_checks.lower() == "true":
print("Running data quality checks...")
# ... run checks ...
else:
print("DQ checks disabled via feature flag — skipping.")
# ── Batch fetch multiple specific parameters ──────────────────────
resp = ssm.get_parameters(
Names=[
"/datalake/prod/s3_bucket",
"/datalake/prod/glue_database",
"/datalake/prod/kafka_topic"
],
WithDecryption=True
)
params = {p["Name"].split("/")[-1]: p["Value"] for p in resp["Parameters"]}
# {"s3_bucket": "...", "glue_database": "...", "kafka_topic": "..."}
# Check for invalid/missing names
if resp.get("InvalidParameters"):
print(f"WARNING: Parameters not found: {resp['InvalidParameters']}")
Parameter Store allows slash-delimited hierarchical names — not just for organisation but because get_parameters_by_path() lets you fetch all parameters under a prefix in one call. More importantly, IAM policies can scope access to a subtree — your staging Glue role can access /datalake/staging/* but never /datalake/prod/*. This is how you enforce environment isolation without complex per-parameter policies.
ls /project/prod/ to see all prod configs, you can call get_parameters_by_path("/project/prod") to get all production parameters in one call. And just as you can set folder permissions, IAM policies can restrict access to entire subtrees.
Here are the most common naming schemes used in production data engineering teams — pick one and apply it consistently across your entire organisation:
/project/env/...), not last. This lets you write IAM policies like Resource: "arn:...parameter/datalake/prod/*" that grant or deny access to all prod parameters but no staging ones.
| Config Type | Recommended Path | Type | Example Value |
|---|---|---|---|
| S3 bucket name | /proj/prod/s3/bucket_name | String | my-lake-prod |
| Glue database | /proj/prod/glue/database | String | prod_analytics |
| Spark partitions | /proj/prod/spark/shuffle_partitions | String | 200 |
| Kafka topic | /proj/prod/kafka/topic | String | prod.sales |
| Source table list | /proj/prod/etl/source_tables | StringList | orders,customers,products |
| Feature flag | /proj/prod/feature/dq_enabled | String | true |
| DQ threshold | /proj/prod/dq/threshold_pct | String | 95.0 |
| Internal API key | /proj/prod/api/monitoring_key | SecureString | key-xyz-123 (encrypted) |
import boto3
from botocore.exceptions import ClientError
ssm = boto3.client("ssm", region_name="us-east-1")
# ── 1. CREATE or UPDATE a parameter (Overwrite=True = upsert) ────
ssm.put_parameter(
Name="/datalake/prod/spark/shuffle_partitions",
Value="200",
Type="String",
Description="Spark shuffle partition count for production",
Overwrite=True
)
# ── 2. GET a single parameter ─────────────────────────────────────
resp = ssm.get_parameter(
Name="/datalake/prod/spark/shuffle_partitions",
WithDecryption=True # always set True; no-op for String type
)
value = resp["Parameter"]["Value"] # "200"
ptype = resp["Parameter"]["Type"] # "String"
ver = resp["Parameter"]["Version"] # version number (increments on update)
# ── 3. GET multiple specific parameters in one API call ───────────
multi = ssm.get_parameters(
Names=[
"/datalake/prod/s3/raw_bucket",
"/datalake/prod/glue/database",
"/datalake/prod/kafka/topic"
],
WithDecryption=True
)
# Build a dict: strip full path, keep just the last key segment
params = {p["Name"].rsplit("/", 1)[-1]: p["Value"] for p in multi["Parameters"]}
# Check if any requested names were invalid/missing
if multi["InvalidParameters"]:
raise ValueError(f"Missing parameters: {multi['InvalidParameters']}")
# ── 4. GET ALL parameters under a path (bulk config load) ─────────
paginator = ssm.get_paginator("get_parameters_by_path")
all_params = {}
for page in paginator.paginate(
Path="/datalake/prod",
Recursive=True,
WithDecryption=True
):
for p in page["Parameters"]:
all_params[p["Name"]] = p["Value"]
# ── 5. GET a specific version of a parameter ──────────────────────
# Useful for rollback: fetch a known-good previous version
history_resp = ssm.get_parameter_history(
Name="/datalake/prod/spark/shuffle_partitions",
WithDecryption=True
)
for h in history_resp["Parameters"]:
print(h["Version"], h["Value"], h["LastModifiedDate"])
# Fetch a specific version directly
v2 = ssm.get_parameter(
Name="/datalake/prod/spark/shuffle_partitions:2", # :version suffix
WithDecryption=True
)["Parameter"]["Value"]
# ── 6. DELETE a parameter ─────────────────────────────────────────
ssm.delete_parameter(Name="/datalake/staging/temp_flag")
# Delete multiple at once (up to 10)
ssm.delete_parameters(
Names=[
"/datalake/staging/old_flag",
"/datalake/staging/deprecated_key"
]
)
# ── 7. DESCRIBE (list parameters with metadata) ───────────────────
desc_paginator = ssm.get_paginator("describe_parameters")
for page in desc_paginator.paginate(
ParameterFilters=[{
"Key": "Path",
"Option": "Recursive",
"Values": ["/datalake/prod"]
}]
):
for p in page["Parameters"]:
print(p["Name"], p["Type"], p.get("Description", ""))
# ── 8. ADD TAGS to a parameter ────────────────────────────────────
ssm.add_tags_to_resource(
ResourceType="Parameter",
ResourceId="/datalake/prod/spark/shuffle_partitions",
Tags=[
{"Key": "team", "Value": "data-engineering"},
{"Key": "owner", "Value": "pipeline-team"}
]
)
This is the canonical production pattern for a Glue job that reads all its configuration from Parameter Store at startup:
import sys, boto3, json
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# ── Step 1: Get job arguments (just the env) ──────────────────────
args = getResolvedOptions(sys.argv, ["JOB_NAME", "ENV"])
env = args["ENV"] # "prod" or "staging" — passed at job launch
# ── Step 2: Load ALL config for this environment in one call ──────
ssm = boto3.client("ssm")
def load_config(env: str) -> dict:
paginator = ssm.get_paginator("get_parameters_by_path")
cfg = {}
for page in paginator.paginate(
Path=f"/datalake/{env}",
Recursive=True,
WithDecryption=True
):
for p in page["Parameters"]:
short = p["Name"].replace(f"/datalake/{env}/", "")
cfg[short] = p["Value"]
return cfg
cfg = load_config(env)
# ── Step 3: Use config values in the job ─────────────────────────
RAW_BUCKET = cfg["s3/raw_bucket"]
SILVER_BUCKET = cfg["s3/silver_bucket"]
GLUE_DB = cfg["glue/database"]
PARTITIONS = int(cfg["spark/shuffle_partitions"])
DQ_ENABLED = cfg.get("feature/dq_enabled", "true") == "true"
print(f"[{env.upper()}] Loading from {RAW_BUCKET} → {SILVER_BUCKET}")
print(f"Shuffle partitions: {PARTITIONS}, DQ: {DQ_ENABLED}")
# ── Step 4: Spark job runs with config ────────────────────────────
sc = SparkContext()
glue = GlueContext(sc)
spark = glue.spark_session
spark.conf.set("spark.sql.shuffle.partitions", PARTITIONS)
df = spark.read.parquet(f"s3://{RAW_BUCKET}/bronze/sales/")
# ... transformations ...
df.write.mode("overwrite").parquet(f"s3://{SILVER_BUCKET}/silver/sales/")
ENV=staging or ENV=prod as a job argument. All configuration differences are in Parameter Store, not in the code. Promoting from staging to prod means updating parameters, not redeploying code.
Advanced-tier parameters support parameter policies — lifecycle rules that can automatically expire a parameter, send an SNS notification when it's about to expire, or alert if it hasn't been updated in a certain time. This is useful for API keys or internal tokens that need to be refreshed periodically (but where you manage rotation manually rather than via Secrets Manager).
import boto3, json
ssm = boto3.client("ssm")
# ── Advanced parameter with expiry in 90 days ────────────────────
# (requires Tier="Advanced")
ssm.put_parameter(
Name="/datalake/prod/api/partner_token",
Value="token-abc-xyz-123",
Type="SecureString",
Tier="Advanced", # required for parameter policies
Overwrite=True,
Policies=json.dumps([
{
"Type": "Expiration", # auto-delete after this date
"Version": "1.0",
"Attributes": {
"Timestamp": "2025-12-31T23:59:59.000Z"
}
},
{
"Type": "ExpirationNotification", # alert 14 days before expiry
"Version": "1.0",
"Attributes": {
"Before": "14",
"Unit": "Days"
}
},
{
"Type": "NoChangeNotification", # alert if not updated for 30 days
"Version": "1.0",
"Attributes": {
"After": "30",
"Unit": "Days"
}
}
])
)
print("Advanced parameter with policies created.")
| Scenario | Use | Why |
|---|---|---|
| Database password that rotates every 90 days | Secrets Manager | Built-in auto-rotation with Lambda |
| S3 bucket name for the pipeline | Parameter Store String | Non-sensitive, free, simple |
| Kafka SASL password | Secrets Manager | Sensitive, needs rotation and audit trail |
| Spark shuffle.partitions value | Parameter Store String | Just config, completely non-sensitive |
| Internal monitoring API key (mildly sensitive) | Parameter Store SecureString | Sensitive enough to encrypt, but doesn't rotate |
| List of source tables to process | Parameter Store StringList | Non-sensitive, easy to update |
| Snowflake private key for key-pair auth | Secrets Manager | Highly sensitive, cross-account sharing possible |
| Feature flag (true/false) | Parameter Store String | Free, instant to update, simple to read |
/project/env/component/key) to enable bulk loading and IAM scoping. get_parameters_by_path() loads all pipeline config in one paginated call — use it at job startup. Always set WithDecryption=True on get calls or SecureStrings return an encrypted blob. Use Parameter Store for config; use Secrets Manager for rotating credentials.
AWS Glue — Serverless ETL & Data Catalog
AWS Glue is the central ETL and metadata hub for most AWS data lakes. It provides a fully managed Apache Spark environment (ETL jobs), a schema registry (Data Catalog), automated schema discovery (Crawlers), and a built-in data quality framework. As a data engineer you'll use Glue daily — running Spark jobs without managing clusters, registering table schemas that Athena and Redshift Spectrum can query, and tracking incremental data loads via job bookmarks.
The Glue Data Catalog is a fully managed, account-level metadata store — essentially a Hive Metastore as a service. It stores schema definitions (databases, tables, columns, data types), partition metadata, and table location (S3 path). Athena, Redshift Spectrum, EMR, and Spark on EMR all look up table schemas from the same Glue Catalog, making it the single source of truth for your entire data lake. You never re-define schemas in each tool separately.
A Glue database is just a namespace — a logical grouping of tables. A Glue table is a schema definition: column names + types, table location, SerDe (serialization library), input/output formats, and partition keys. The actual data stays in S3; the catalog stores only metadata. Managed tables (rare in Glue) store data in a Glue-managed S3 location; external tables (standard practice) point to your own S3 prefix.
import boto3
glue = boto3.client("glue", region_name="us-east-1")
# ── 1. Create a database ─────────────────────────────────────────
glue.create_database(DatabaseInput={
"Name": "prod_analytics",
"Description": "Production analytics — silver and gold tables",
"LocationUri": "s3://my-data-lake/silver/"
})
# ── 2. Create an external Parquet table ──────────────────────────
glue.create_table(
DatabaseName="prod_analytics",
TableInput={
"Name": "sales_transactions",
"Description": "Cleaned and validated sales transactions",
"StorageDescriptor": {
"Columns": [
{"Name": "txn_id", "Type": "string"},
{"Name": "customer_id", "Type": "bigint"},
{"Name": "amount", "Type": "decimal(18,2)"},
{"Name": "txn_date", "Type": "date"},
{"Name": "product_id", "Type": "string"}
],
"Location": "s3://my-data-lake/silver/sales/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"
},
"Compressed": True
},
"PartitionKeys": [
{"Name": "year", "Type": "string"},
{"Name": "month", "Type": "string"}
],
"TableType": "EXTERNAL_TABLE"
}
)
print("Table 'sales_transactions' registered in Glue Catalog.")
Every time a Glue Crawler (or update_table()) changes a table's schema, Glue stores the previous schema as a version. You can retrieve any historical version and compare schemas — useful when a bug introduced a wrong column type. Schema evolution works by adding columns (safe — old files are read without the new column) or changing compatible types (e.g. int → bigint). Dropping or renaming columns is a breaking change that can confuse existing queries.
import boto3, copy
glue = boto3.client("glue")
# 1. Get the current table definition
resp = glue.get_table(DatabaseName="prod_analytics", Name="sales_transactions")
table = resp["Table"]
# 2. Build the TableInput (must strip read-only fields AWS adds)
table_input = {
k: v for k, v in table.items()
if k not in ["DatabaseName", "CreateTime", "UpdateTime",
"CreatedBy", "IsRegisteredWithLakeFormation",
"CatalogId", "VersionId"]
}
# 3. Add the new column
table_input["StorageDescriptor"]["Columns"].append(
{"Name": "discount_pct", "Type": "double", "Comment": "Applied discount percentage"}
)
# 4. Push the update — Glue saves the old schema as a version
glue.update_table(DatabaseName="prod_analytics", TableInput=table_input)
print("Column 'discount_pct' added. Old schema saved as a version.")
# 5. List schema versions to see history
versions = glue.get_table_versions(
DatabaseName="prod_analytics", TableName="sales_transactions"
)
for v in versions["TableVersions"]:
ncols = len(v["Table"]["StorageDescriptor"]["Columns"])
print(f" Version {v['VersionId']}: {ncols} columns ({v['Table']['UpdateTime']})")
Any Spark session on EMR or a self-managed cluster can use the Glue Catalog as its Hive Metastore — no separate Hive installation needed. You configure SparkSession with two settings: one to enable the Glue Catalog connector, one to point to your AWS account's catalog. After that, spark.sql("SHOW TABLES IN prod_analytics") queries exactly the same tables that Athena sees.
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("GlueCatalogExample")
# Tell Spark to use Glue Catalog instead of local Hive Metastore
.config("spark.sql.catalogImplementation", "hive")
.config("hive.metastore.client.factory.class",
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate()
)
# Now Spark SQL sees the same tables as Athena
spark.sql("SHOW DATABASES").show()
spark.sql("USE prod_analytics")
spark.sql("SHOW TABLES").show()
# Read directly using catalog table name — no path needed!
df = spark.sql("SELECT * FROM sales_transactions WHERE year='2024'")
df.show(5)
When you write new partitioned data to S3, the Glue Catalog doesn't know about the new partitions automatically — you must register them. The fastest way is batch_create_partition() (up to 100 partitions per call) or running MSCK REPAIR TABLE via Athena (slow for large tables). Always register new partitions at the end of your ETL job — otherwise Athena queries won't see today's data.
import boto3
from datetime import date, timedelta
glue = boto3.client("glue")
DB = "prod_analytics"
TBL = "sales_transactions"
BASE = "s3://my-data-lake/silver/sales"
# Build partition objects for the last 7 days
partitions = []
today = date.today()
for i in range(7):
d = today - timedelta(days=i)
yr, mo = str(d.year), f"{d.month:02d}"
partitions.append({
"Values": [yr, mo], # must match PartitionKeys order
"StorageDescriptor": {
"Location": f"{BASE}/year={yr}/month={mo}/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"SerdeInfo": {"SerializationLibrary":
"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"},
"Compressed": True
}
})
# batch_create_partition — up to 100 per call
for chunk in [partitions[i:i+100] for i in range(0, len(partitions), 100)]:
resp = glue.batch_create_partition(
DatabaseName=DB, TableName=TBL,
PartitionInputList=chunk
)
if resp.get("Errors"):
for err in resp["Errors"]:
# AlreadyExistsException is fine — partition already registered
if err["ErrorDetail"]["ErrorCode"] != "AlreadyExistsException":
print(f"ERROR: {err}")
print(f"Registered {len(partitions)} partitions in Glue Catalog.")
batch_create_partition() with error suppression for AlreadyExistsException. Your job may retry on failure, and trying to re-register an existing partition should never cause the job to fail.
A Glue Crawler inspects a data source — S3 prefix, JDBC database, Redshift, DynamoDB — reads sample files, infers column names and data types, and writes (or updates) table definitions in the Glue Catalog automatically. Instead of manually defining every table schema, you point a crawler at your S3 prefix and it builds the catalog entry for you. Crawlers also detect new partitions and schema changes in subsequent runs.
Crawlers support multiple source types. For S3 they sample Parquet/ORC/Avro/CSV/JSON files. For RDS/Redshift they read the JDBC schema directly — column definitions come straight from the database, not file inference. You can have one crawler cover multiple data sources (S3 + RDS) in a single run, writing all discovered tables to the same target database in the Catalog.
import boto3, time
glue = boto3.client("glue")
# ── Create the crawler ───────────────────────────────────────────
glue.create_crawler(
Name="silver-sales-crawler",
Role="arn:aws:iam::123456789012:role/glue-crawler-role",
DatabaseName="prod_analytics", # target Catalog database
Targets={
"S3Targets": [{
"Path": "s3://my-data-lake/silver/sales/",
"Exclusions": ["**/_temporary/**", "**/_spark_metadata/**"]
}]
},
TablePrefix="silver_", # tables created as silver_sales, etc.
SchemaChangePolicy={
"UpdateBehavior": "UPDATE_IN_DATABASE", # auto-update schema
"DeleteBehavior": "LOG" # log deleted tables, don't auto-delete
},
RecrawlPolicy={"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"}, # incremental
Schedule="cron(0 6 * * ? *)" # daily at 06:00 UTC
)
# ── Start the crawler on-demand (outside of schedule) ───────────
glue.start_crawler(Name="silver-sales-crawler")
# ── Poll until READY (no built-in waiter — must poll manually) ──
def wait_for_crawler(name: str, poll_sec: int = 15, timeout: int = 600):
start = time.time()
while time.time() - start < timeout:
state = glue.get_crawler(Name=name)["Crawler"]["State"]
print(f" Crawler state: {state}")
if state == "READY":
print("✅ Crawler finished.")
return
if state == "STOPPING":
time.sleep(poll_sec)
continue
time.sleep(poll_sec)
raise TimeoutError(f"Crawler did not finish within {timeout}s")
wait_for_crawler("silver-sales-crawler")
By default, a crawler re-scans all files on every run — wasteful for large data lakes where only a new partition arrives daily. Setting RecrawlBehavior: "CRAWL_NEW_FOLDERS_ONLY" (or CRAWL_EVERYTHING for full schema refresh) makes the crawler skip already-catalogued folders and only process new S3 prefixes. This cuts crawler runtime from hours to seconds on mature data lakes.
CRAWL_EVERYTHING — monthly/weekly full refresh to catch schema drift or file format changes. Slow but thorough.
Crawlers automatically detect Hive-style partitions (year=2024/month=01/) and register them. For non-standard formats (custom CSV with unusual delimiters, nested JSON, proprietary binary formats), you attach a Custom Classifier — a regex, JSON path expression, or XML pattern that tells the crawler how to read the file and what schema it has.
import boto3
glue = boto3.client("glue")
# Custom CSV classifier — pipe-delimited, quoted, with header
glue.create_classifier(
CsvClassifier={
"Name": "pipe-delimited-csv",
"Delimiter": "|",
"QuoteSymbol": '"',
"ContainsHeader": "PRESENT", # first row is header
"AllowSingleColumn": False
}
)
# Attach to a crawler
glue.create_crawler(
Name="legacy-csv-crawler",
Role="arn:aws:iam::123456789012:role/glue-crawler-role",
DatabaseName="raw_landing",
Targets={"S3Targets": [{"Path": "s3://my-lake/raw/legacy-exports/"}]},
Classifiers=["pipe-delimited-csv"] # applies before built-in classifiers
)
Glue's primary job type is a Spark ETL job — Apache Spark running on AWS-managed infrastructure. You provide a PySpark script (uploaded to S3), configure DPUs (Data Processing Units — each DPU = 4 vCPUs + 16 GB RAM), and Glue manages the cluster for you. There's no EC2 to manage, no YARN to configure. Glue also provides the GlueContext wrapper which adds Catalog integration, job bookmarks, and DynamicFrame support on top of the standard SparkContext.
import sys
import boto3
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql import functions as F
# ── 1. Parse job arguments ───────────────────────────────────────
args = getResolvedOptions(sys.argv, [
"JOB_NAME", # always required for job bookmarks
"source_bucket",
"target_bucket",
"env"
])
# ── 2. Initialize Glue and Spark contexts ────────────────────────
sc = SparkContext()
glue = GlueContext(sc)
spark = glue.spark_session
job = Job(glue)
job.init(args["JOB_NAME"], args) # init required for bookmarks
# ── 3. Read data (standard Spark or DynamicFrame) ────────────────
df = spark.read.parquet(
f"s3://{args['source_bucket']}/bronze/orders/"
)
# ── 4. Transform ─────────────────────────────────────────────────
df_clean = (df
.dropDuplicates(["order_id"])
.filter(F.col("amount") > 0)
.withColumn("load_date", F.current_date())
)
# ── 5. Write ─────────────────────────────────────────────────────
df_clean.write.mode("overwrite").parquet(
f"s3://{args['target_bucket']}/silver/orders/"
)
# ── 6. Commit job (required — marks bookmark checkpoint) ─────────
job.commit()
print("Job committed successfully.")
Glue jobs accept user-defined parameters (key-value strings prefixed with --) that your script reads via getResolvedOptions(). You also pass Spark configuration overrides via --conf arguments — for example overriding shuffle partitions, executor memory, or enabling dynamic partition overwrite. Parameters set at job creation are defaults; you can override them per-run via start_job_run(Arguments={...}).
import boto3
glue = boto3.client("glue")
response = glue.start_job_run(
JobName="silver-orders-etl",
Arguments={
# User-defined parameters — read by getResolvedOptions()
"--source_bucket": "my-lake-raw",
"--target_bucket": "my-lake-silver",
"--env": "prod",
# Spark configuration overrides for this run
"--conf": (
"spark.sql.shuffle.partitions=200 "
"spark.sql.sources.partitionOverwriteMode=dynamic "
"spark.serializer=org.apache.spark.serializer.KryoSerializer"
),
# Glue-specific options
"--enable-metrics": "", # push metrics to CloudWatch
"--enable-continuous-cloudwatch-log": "", # real-time log streaming
"--job-bookmark-option": "job-bookmark-enable"
},
MaxCapacity=10.0 # DPUs to allocate (10 DPU = 40 vCPUs + 160 GB RAM)
)
run_id = response["JobRunId"]
print(f"Started job run: {run_id}")
The Glue Job Bookmark is the most powerful and most misunderstood Glue feature. When enabled, Glue tracks which S3 files (or JDBC rows) the job has already processed. On the next run, it automatically skips previously processed data and reads only new files — giving you incremental ETL without any watermark code. The bookmark state is stored in AWS-managed storage and linked to the job name + run ID chain.
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glue = GlueContext(sc)
job = Job(glue)
job.init(args["JOB_NAME"], args) # ← bookmark state restored here
# Read using DynamicFrame — bookmark tracking works automatically
# Glue will only read files it hasn't seen in previous successful runs
raw_dyf = glue.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
"paths": ["s3://my-lake/raw/landing/"],
"recurse": True
},
format="parquet",
transformation_ctx="raw_data" # ← name used as bookmark key
)
print(f"New records this run: {raw_dyf.count()}")
# ... transform and write ...
job.commit() # ← bookmark state saved — next run picks up from here
--job-bookmark-option job-bookmark-reset) after a full reload.
When a Glue Spark job writes partitioned Parquet to S3, the new partition folders appear in S3 but the Glue Catalog doesn't know about them yet. You must register the partitions at the end of the job using either batch_create_partition() (boto3) or spark.sql("MSCK REPAIR TABLE prod_analytics.sales_transactions") — the latter is slower but simpler for many partitions.
# After writing partitioned data in the Glue job, repair the table
# (this scans S3 and adds any missing partitions to the Glue Catalog)
spark.sql("MSCK REPAIR TABLE prod_analytics.sales_transactions")
print("Partitions registered in Glue Catalog.")
# OR — more efficient for daily jobs: only add today's partition
from datetime import date
today = date.today()
spark.sql(f"""
ALTER TABLE prod_analytics.sales_transactions
ADD IF NOT EXISTS PARTITION (year='{today.year}', month='{today.month:02d}')
LOCATION 's3://my-lake/silver/sales/year={today.year}/month={today.month:02d}/'
""")
In production, source schemas change — new columns appear, types widen. Glue's DynamicFrame handles schema evolution better than a plain Spark DataFrame because it reads all data into a flexible "dynamic" structure first. For DataFrame-based pipelines, use mergeSchema=True when reading Parquet and handle missing columns explicitly with withColumn defaults.
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
# Read with mergeSchema — handles files with different column sets
df = spark.read.option("mergeSchema", True).parquet(
"s3://my-lake/raw/sales/"
)
# Safely handle columns that may not exist in older files
if "discount_pct" not in df.columns:
df = df.withColumn("discount_pct", F.lit(0.0).cast(DoubleType()))
# Cast to expected types in case source widened (e.g. int → bigint)
df = df.withColumn("order_id", F.col("order_id").cast("bigint"))
print(f"Schema after evolution handling: {df.schema.simpleString()}")
Each DPU costs money — you want to allocate enough to run fast without over-provisioning. Glue emits job-level metrics to CloudWatch: glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors tells you the peak executors actually used. If your job uses 3 executors but you allocated 10 DPUs (supporting 9 executors + 1 driver), you're over-provisioned. Profile first with MaxCapacity=2, then scale based on actual usage.
| DPU Count | vCPUs | RAM | Typical Use Case |
|---|---|---|---|
| 2 DPU | 8 | 32 GB | Dev/test, small files (<500 MB) |
| 5 DPU | 20 | 80 GB | Medium daily ETL (1–10 GB) |
| 10 DPU | 40 | 160 GB | Large daily ETL (10–100 GB) |
| 20+ DPU | 80+ | 320+ GB | Full loads, large joins, >100 GB data |
To read from or write to RDS, Redshift, or any JDBC source, Glue uses a Connection — a named config object that stores the JDBC URL, credentials reference (Secrets Manager), and VPC network settings. When you run a Glue job that uses a Connection, Glue provisions an elastic network interface in your VPC subnet so the Spark executors can reach the private RDS/Redshift endpoint without traffic leaving AWS.
import boto3
glue = boto3.client("glue")
glue.create_connection(
ConnectionInput={
"Name": "prod-rds-postgres",
"Description": "Production RDS PostgreSQL — analytics database",
"ConnectionType": "JDBC",
"ConnectionProperties": {
"JDBC_CONNECTION_URL": "jdbc:postgresql://prod-rds.cluster-abc.us-east-1.rds.amazonaws.com:5432/analytics",
"USERNAME": "glue_user",
"PASSWORD": "{{resolve:secretsmanager:prod/rds/glue-user}}"
# In practice: Glue reads creds from Secrets Manager directly via secret name
},
"PhysicalConnectionRequirements": {
"SubnetId": "subnet-0abc123", # private subnet in your VPC
"SecurityGroupIdList": ["sg-0xyz789"], # SG allowing outbound to RDS
"AvailabilityZone": "us-east-1a"
}
}
)
print("Glue Connection 'prod-rds-postgres' created.")
# Inside the Glue ETL job script — read from RDS via JDBC connection
customers_dyf = glue.create_dynamic_frame.from_catalog(
database="prod_analytics",
table_name="raw_customers", # Glue Catalog table pointing to RDS
additional_options={"jobBookmarkKeys": ["updated_at"]}
)
# OR — direct JDBC read with partitioning for parallelism
customers_df = (spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://prod-rds:5432/analytics")
.option("dbtable", "public.customers")
.option("user", db_user)
.option("password", db_pass)
.option("partitionColumn", "customer_id") # enables parallel reads
.option("lowerBound", "1")
.option("upperBound", "10000000")
.option("numPartitions", "10") # 10 parallel JDBC tasks
.option("driver", "org.postgresql.Driver")
.load()
)
When your source is in a private subnet (no internet access), you need a Glue Connection with the correct subnet and security group configuration. Glue creates an ENI (Elastic Network Interface) in your VPC, which means your Glue executors have a private IP in your VPC and can reach RDS, Redshift, or Kafka on private IP addresses — exactly as if the Spark cluster were physically inside your network.
AWS Glue has a native Data Quality service — no third-party library needed. You define rulesets (collections of rules) using DQDL (Data Quality Definition Language), a simple English-like syntax. Rules run against a Glue table or DynamicFrame and produce a score (0–1). You can fail the pipeline if the score drops below your threshold, or just log results to an audit table.
IsComplete "customer_id"Checks for nulls in a column.
IsUnique "txn_id"Checks for duplicate values.
DataFreshness "load_dt" <= 24 hoursChecks recency of data.
ColumnValues "amount" > 0Checks value range rules.
# DQDL — Data Quality Definition Language
# This is a string you pass to Glue, not Python
Rules = [
# Completeness — no nulls in critical columns
IsComplete "customer_id",
IsComplete "txn_id",
IsComplete "amount",
# Uniqueness — no duplicate transaction IDs
IsUnique "txn_id",
# Value ranges — business rules
ColumnValues "amount" between 0.01 and 1000000,
ColumnValues "discount_pct" between 0 and 1,
# Regex matching — basic format checks
MatchesRegex "email" with pattern "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z]{2,}$",
# Referential integrity — values must be in an allowed set
ColumnValues "status" in ["PENDING","COMPLETED","CANCELLED","REFUNDED"],
# Row count — detect data drops
RowCount >= 1000,
# Completeness ratio — allow up to 2% nulls in optional fields
Completeness "notes" >= 0.98,
# Data freshness — data should be from last 48 hours
DataFreshness "created_at" <= 48 hours
]
import boto3, time, json
glue = boto3.client("glue")
dynamo = boto3.resource("dynamodb")
tbl = dynamo.Table("pipeline-dq-audit")
DQDL_RULES = """
Rules = [
IsComplete "customer_id",
IsComplete "txn_id",
IsUnique "txn_id",
ColumnValues "amount" between 0.01 and 1000000,
RowCount >= 1000
]
"""
# ── 1. Create the ruleset (idempotent) ──────────────────────────
try:
glue.create_data_quality_ruleset(
Name="sales-transactions-dq",
Ruleset=DQDL_RULES,
TargetTable={
"TableName": "sales_transactions",
"DatabaseName": "prod_analytics"
}
)
except glue.exceptions.AlreadyExistsException:
glue.update_data_quality_ruleset(
Name="sales-transactions-dq", Ruleset=DQDL_RULES
)
# ── 2. Run the evaluation ────────────────────────────────────────
run_resp = glue.start_data_quality_ruleset_evaluation_run(
DataSource={"GlueTable": {
"TableName": "sales_transactions", "DatabaseName": "prod_analytics"
}},
Role="arn:aws:iam::123456789012:role/glue-dq-role",
RulesetNames=["sales-transactions-dq"],
NumberOfWorkers=5
)
run_id = run_resp["RunId"]
# ── 3. Poll until finished ───────────────────────────────────────
for _ in range(40):
run_detail = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
status = run_detail["Status"]
print(f"DQ run status: {status}")
if status in ["SUCCEEDED", "FAILED", "ERROR"]:
break
time.sleep(15)
# ── 4. Check rule results ────────────────────────────────────────
results = run_detail.get("ResultIds", [])
overall_pass = True
dq_score = 0.0
if results:
result_detail = glue.get_data_quality_result(ResultId=results[0])
rule_results = result_detail["RuleResults"]
score = result_detail.get("Score", 0.0)
dq_score = score
print(f"\nOverall DQ Score: {score:.2%}")
for r in rule_results:
icon = "✅" if r["Result"] == "PASS" else "❌"
print(f" {icon} {r['Name']}: {r['Result']} — {r.get('Description','')}")
if r["Result"] == "FAIL":
overall_pass = False
# ── 5. Write DQ results to DynamoDB audit table ──────────────────
from datetime import datetime
from decimal import Decimal
tbl.put_item(Item={
"run_id": run_id,
"table": "sales_transactions",
"ts": datetime.utcnow().isoformat(),
"score": Decimal(str(round(dq_score, 4))),
"passed": overall_pass
})
# ── 6. Fail the pipeline if score below threshold ────────────────
DQ_THRESHOLD = 0.95
if dq_score < DQ_THRESHOLD:
raise ValueError(
f"DQ score {dq_score:.2%} below threshold {DQ_THRESHOLD:.0%}. Pipeline halted."
)
The cleanest pattern is to run DQ checks inside the Glue Spark job after writing data to S3 but before registering partitions in the Catalog or sending success alerts. If DQ fails, the partition is never registered — downstream queries simply don't see the bad data, and the job fails with a clear error that triggers an alert.
import boto3, time
from botocore.exceptions import ClientError
glue = boto3.client("glue")
# ── CREATE a Glue Spark job ───────────────────────────────────────
glue.create_job(
Name="silver-orders-etl",
Role="arn:aws:iam::123456789012:role/glue-etl-role",
Command={
"Name": "glueetl", # Spark job type
"ScriptLocation": "s3://my-scripts/etl/silver_orders.py",
"PythonVersion": "3"
},
GlueVersion="4.0", # Spark 3.3 + Python 3.10
MaxCapacity=10.0, # DPUs
Timeout=60, # minutes before auto-kill
DefaultArguments={
"--job-bookmark-option": "job-bookmark-enable",
"--enable-metrics": "",
"--env": "prod"
},
Connections={"Connections": ["prod-rds-postgres"]}
)
# ── START a job run ───────────────────────────────────────────────
resp = glue.start_job_run(JobName="silver-orders-etl")
run_id = resp["JobRunId"]
print(f"Started: {run_id}")
# ── POLL until terminal state ─────────────────────────────────────
TERMINAL = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}
while True:
run = glue.get_job_run(JobName="silver-orders-etl", RunId=run_id)
state = run["JobRun"]["JobRunState"]
print(f" State: {state}")
if state in TERMINAL:
break
time.sleep(30)
if state != "SUCCEEDED":
error = run["JobRun"].get("ErrorMessage", "unknown")
raise RuntimeError(f"Glue job {run_id} {state}: {error}")
# ── LIST run history with paginator ──────────────────────────────
paginator = glue.get_paginator("get_job_runs")
for page in paginator.paginate(JobName="silver-orders-etl"):
for run in page["JobRuns"]:
print(f" {run['JobRunId']}: {run['JobRunState']} — {run.get('CompletedOn','')}")
# ── STOP a running job ────────────────────────────────────────────
glue.batch_stop_job_run(
JobName="silver-orders-etl",
JobRunIds=[run_id]
)
CRAWL_NEW_FOLDERS_ONLY for daily incremental runs. Spark ETL Jobs run managed Spark — no cluster management; pass params via -- arguments, override Spark conf inline. Job Bookmarks = incremental processing without watermark code — but understand their file-timestamp-based mechanics. Connections = JDBC + VPC config for private RDS/Redshift access; always add self-referencing SG rule. Glue Data Quality = write DQDL rulesets, run evaluations after every ETL load, fail the pipeline on score drop, write results to DynamoDB audit table.
AWS EMR — Spark in the Cloud
EMR (Elastic MapReduce) is AWS's managed big-data platform. It lets you run Apache Spark (and Hadoop, Hive, Presto, etc.) on auto-provisioned EC2 clusters — or completely serverlessly — without managing JVMs, YARN configs, or OS patches yourself. As a Data Engineer, EMR is where your PySpark jobs run at production scale on AWS.
Every EMR cluster is made up of three kinds of nodes, each playing a distinct role:
| Node Type | Role | What Runs Here | Can You Lose It? |
|---|---|---|---|
| Master Node | Brain of the cluster | YARN ResourceManager, HDFS NameNode, Spark Driver (client mode), EMR control daemons | No — cluster dies |
| Core Nodes | Workers + HDFS storage | YARN NodeManager, HDFS DataNode, Spark Executors | No — HDFS data loss |
| Task Nodes | Extra compute only | YARN NodeManager, Spark Executors — no HDFS | Yes — safe for Spot |
AWS offers three ways to run Spark with EMR — each with a different trade-off between control and operational overhead:
When you submit a Spark job to EMR, you choose where the Spark Driver runs — on the master node (cluster mode) or on the machine doing the submission (client mode):
| Deploy Mode | Driver Location | Stdout/Logs | Use When |
|---|---|---|---|
| cluster | Runs inside YARN on a cluster node | Must fetch from YARN logs | Production — submitting and disconnecting |
| client | Runs on the machine submitting the job | Streams to your terminal | Dev/debug — interactive feedback needed |
# Submit a PySpark job to a running EMR cluster
aws emr add-steps \
--cluster-id j-2AXXXXXXGAPLF \
--steps '[{
"Name": "Daily Sales ETL",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--deploy-mode", "cluster",
"--conf", "spark.executor.memory=8g",
"--conf", "spark.executor.cores=4",
"--conf", "spark.sql.shuffle.partitions=200",
"s3://my-code-bucket/scripts/daily_sales_etl.py",
"--date", "2024-01-15",
"--env", "prod"
]
}
}]'
command-runner.jar is a special EMR jar that translates YARN step commands into real shell commands. When you pass spark-submit as the first Arg, EMR knows to run the actual spark-submit binary on the cluster. It's the standard way to submit Spark steps on EMR — you'll see it in every real pipeline.
On regular Hadoop, data lives in HDFS — which means it's inside the cluster and disappears when the cluster terminates. EMR replaces HDFS with EMRFS (EMR File System), which is a connector that makes S3 appear as a local filesystem to Spark. Your PySpark code reads s3://... paths exactly like it would read local files. Data persists on S3 even after the cluster is terminated.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SalesETL").getOrCreate()
# On EMR, s3:// paths work exactly like local paths
# EMRFS handles the translation transparently
df = spark.read.parquet("s3://my-data-lake/bronze/sales/year=2024/")
result = df.groupBy("region").agg({"revenue": "sum"})
# Write results back to S3 — persists after cluster terminates
result.write.mode("overwrite").parquet(
"s3://my-data-lake/gold/sales_by_region/year=2024/"
)
spark.stop()
You can tune Spark settings at cluster launch time using EMR's Configurations — structured JSON that overrides config files like spark-defaults.conf:
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.shuffle.partitions": "400",
"spark.default.parallelism": "400",
"spark.executor.memory": "8g",
"spark.executor.cores": "4",
"spark.driver.memory": "4g",
"spark.dynamicAllocation.enabled": "true",
"spark.sql.adaptive.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
}
},
{
"Classification": "spark-env",
"Configurations": [{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}]
}
]
With classic EMR on EC2, you provision a cluster, wait for it to start (3–8 mins), run jobs, then terminate it. EMR Serverless eliminates all of that — you create an Application (a logical container), then submit job runs. AWS automatically allocates vCPUs and memory for each run, scales them during execution, and releases them when the job ends. You pay only for actual vCPU-seconds and GB-seconds used.
| Aspect | EMR on EC2 (Classic) | EMR Serverless |
|---|---|---|
| Cluster startup | 3–8 minutes | ~30 seconds |
| Node management | You choose instance types | Fully managed by AWS |
| Pricing | Pay per EC2 instance-hour | Pay per vCPU-second + GB-second |
| Custom libs / bootstrap | Full control via bootstrap | Via custom image or --py-files |
| Best for | Predictable, large batch workloads | Variable, on-demand Spark jobs |
import boto3, time
emr = boto3.client("emr-serverless", region_name="us-east-1")
# ── Step 1: Create an Application (one-time setup) ──────────────
app = emr.create_application(
name="spark-etl-app",
releaseLabel="emr-6.15.0", # EMR runtime version
type="SPARK",
autoStartConfiguration={"enabled": True},
autoStopConfiguration={
"enabled": True,
"idleTimeoutMinutes": 15 # auto-stop if idle 15 min
},
maximumCapacity={ # cost guard-rail
"cpu": "200 vCPU",
"memory": "1000 GB"
}
)
app_id = app["applicationId"]
print(f"Application created: {app_id}")
# ── Step 2: Start the Application ───────────────────────────────
emr.start_application(applicationId=app_id)
# Wait until it's in STARTED state
while True:
state = emr.get_application(applicationId=app_id)["application"]["state"]
if state == "STARTED": break
time.sleep(5)
# ── Step 3: Submit a Job Run ─────────────────────────────────────
job = emr.start_job_run(
applicationId=app_id,
executionRoleArn="arn:aws:iam::123456789012:role/EMRServerlessJobRole",
jobDriver={
"sparkSubmit": {
"entryPoint": "s3://my-code-bucket/scripts/daily_etl.py",
"entryPointArguments": ["--date", "2024-01-15"],
"sparkSubmitParameters": (
"--conf spark.executor.cores=4 "
"--conf spark.executor.memory=8g "
"--conf spark.sql.shuffle.partitions=200"
)
}
},
configurationOverrides={
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://my-emr-logs/serverless/"
}
}
},
name="daily-etl-2024-01-15"
)
job_run_id = job["jobRunId"]
print(f"Job submitted: {job_run_id}")
# ── Step 4: Poll until complete ──────────────────────────────────
while True:
run = emr.get_job_run(applicationId=app_id, jobRunId=job_run_id)
state = run["jobRun"]["state"]
print(f"State: {state}")
if state in ("SUCCESS", "FAILED", "CANCELLED"): break
time.sleep(20)
print(f"Final state: {state}")
EMR's Managed Scaling continuously monitors YARN metrics and automatically adds or removes instances to match workload demand. You set a min and max, and EMR does the rest — no custom CloudWatch alarms or scale-in policies needed.
import boto3
emr = boto3.client("emr")
emr.put_managed_scaling_policy(
ClusterId="j-2AXXXXXXGAPLF",
ManagedScalingPolicy={
"ComputeLimits": {
"UnitType": "Instances",
"MinimumCapacityUnits": 2, # minimum 2 core/task nodes
"MaximumCapacityUnits": 20, # scale up to 20 nodes
"MaximumOnDemandCapacityUnits": 5, # max on-demand (rest Spot)
"MaximumCoreCapacityUnits": 5 # core nodes stay small
}
}
)
MaximumOnDemandCapacityUnits low (e.g. 5) and let the rest scale with Spot instances. This caps your on-demand cost while allowing burst capacity at Spot pricing (~70% cheaper).
EMR offers two ways to define node pools. Instance Groups use a single instance type per group (simpler). Instance Fleets let you specify multiple instance types, and EMR picks whichever has Spot availability (more resilient, better Spot fulfillment).
AWS Spot instances are spare EC2 capacity sold at up to 90% discount versus on-demand pricing. The catch: AWS can reclaim them with 2-minute notice if capacity is needed elsewhere. For task nodes (no HDFS data), this is perfectly safe — if a task node is reclaimed, YARN simply reschedules those tasks onto surviving nodes. You lose some compute time but no data.
import boto3
emr = boto3.client("emr")
response = emr.run_job_flow(
Name="production-spark-cluster",
ReleaseLabel="emr-7.1.0",
Applications=[{"Name": "Spark"}],
Instances={
"InstanceFleets": [
{
"Name": "MasterFleet",
"InstanceFleetType": "MASTER",
"TargetOnDemandCapacity": 1, # Master always on-demand
"InstanceTypeConfigs": [
{"InstanceType": "m5.xlarge"}
]
},
{
"Name": "CoreFleet",
"InstanceFleetType": "CORE",
"TargetOnDemandCapacity": 2, # Core on-demand
"InstanceTypeConfigs": [
{"InstanceType": "r5.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1}
]
},
{
"Name": "TaskFleet",
"InstanceFleetType": "TASK",
"TargetSpotCapacity": 10, # Task nodes on Spot!
"InstanceTypeConfigs": [
# Multiple types = better Spot availability
{"InstanceType": "r5.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r5a.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "r4.4xlarge", "WeightedCapacity": 1},
{"InstanceType": "m5.8xlarge", "WeightedCapacity": 2}
],
"LaunchSpecifications": {
"SpotSpecification": {
"TimeoutDurationMinutes": 5,
"TimeoutAction": "SWITCH_TO_ON_DEMAND" # fallback
}
}
}
],
"Ec2SubnetIds": ["subnet-abc123", "subnet-def456"], # multi-AZ
"KeepJobFlowAliveWhenNoSteps": False, # auto-terminate after steps
},
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
AutoTerminationPolicy={"IdleTimeout": 3600} # terminate if idle 1 hr
)
cluster_id = response["JobFlowId"]
print(f"Cluster started: {cluster_id}")
When AWS reclaims a Spot instance, it gives a 2-minute interruption notice. YARN detects the node disappearing and automatically reschedules tasks that were running on it. To minimize disruption:
- Enable Spark's external shuffle service — shuffle data is stored on a separate service, not the executor. If the executor is reclaimed, shuffle data survives.
- Set
spark.stage.maxConsecutiveAttemptshigher (default 4) if stages frequently fail due to Spot interruptions. - Use multiple Spot instance types in Instance Fleets — if one type's pool is exhausted, EMR tries another.
- Set
TimeoutAction: SWITCH_TO_ON_DEMANDso the fleet falls back to on-demand if Spot is unavailable at launch.
Bootstrap Actions are shell scripts that run on every node in the cluster before YARN and Spark start. They're your chance to install Python packages, set OS-level config, download files from S3, or configure the environment. Think of it like apt-get install + pip install that runs on every machine in the cluster at launch time.
pandas, boto3, pyarrow, great-expectations) · Mounting EFS or NFS · Copying config files from S3 · Installing system packages · Setting environment variables
#!/bin/bash
# Bootstrap script — runs on every node before EMR services start
set -ex # exit on error, print each command
# Update pip
sudo pip3 install --upgrade pip
# Install Python dependencies your PySpark job needs
sudo pip3 install \
pyarrow==14.0.1 \
great-expectations==0.18.0 \
boto3==1.34.0 \
tenacity==8.2.3 \
psycopg2-binary==2.9.9
# Download a shared config file from S3
aws s3 cp s3://my-config-bucket/pipeline.yaml /etc/pipeline/pipeline.yaml
# Set environment variable for all processes
echo 'export PIPELINE_ENV=prod' | sudo tee -a /etc/environment
echo "Bootstrap complete!"
import boto3
# First, upload your bootstrap script to S3
s3 = boto3.client("s3")
s3.upload_file("bootstrap.sh", "my-code-bucket", "bootstrap/bootstrap.sh")
emr = boto3.client("emr")
response = emr.run_job_flow(
Name="cluster-with-bootstrap",
ReleaseLabel="emr-7.1.0",
Applications=[{"Name": "Spark"}],
BootstrapActions=[
{
"Name": "Install Python Dependencies",
"ScriptBootstrapAction": {
"Path": "s3://my-code-bucket/bootstrap/bootstrap.sh",
"Args": [] # optional args passed to the script
}
}
],
Instances={
"InstanceGroups": [
{"Name": "Master", "InstanceRole": "MASTER",
"InstanceType": "m5.xlarge", "InstanceCount": 1},
{"Name": "Core", "InstanceRole": "CORE",
"InstanceType": "r5.4xlarge", "InstanceCount": 4}
],
"Ec2SubnetId": "subnet-abc123",
"KeepJobFlowAliveWhenNoSteps": False
},
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
LogUri="s3://my-emr-logs/"
)
print(f"Cluster: {response['JobFlowId']}")
EMR Studio is a managed Jupyter-based IDE hosted by AWS. You connect it to a running EMR cluster (or EMR Serverless application) and run PySpark code interactively — exactly like a local Jupyter notebook, but with the full cluster compute behind it. It's invaluable for:
- Exploring large datasets on S3 interactively
- Debugging Spark jobs — checking DataFrames mid-transformation
- Profiling slow queries with the built-in Spark UI link
- Prototyping logic before packaging it into a production script
| Strategy | How to Implement | Typical Saving |
|---|---|---|
| Auto-terminate clusters | Set KeepJobFlowAliveWhenNoSteps=False and AutoTerminationPolicy | Eliminates idle cluster cost |
| Spot task nodes | Instance Fleets with TargetSpotCapacity on task fleet | 50–80% off task node cost |
| Right-size instances | Monitor YARN container utilization, pick r5/r6 for memory-heavy Spark | 20–40% off wasted capacity |
| Reserved Instances | Use 1-year RIs for core nodes in always-on clusters | ~30% vs on-demand |
| EMR Serverless | For variable/infrequent jobs — pay only during job execution | Eliminates idle time entirely |
import boto3, time
emr = boto3.client("emr", region_name="us-east-1")
# ── Launch cluster with embedded steps ──────────────────────────
response = emr.run_job_flow(
Name="daily-etl-2024-01-15",
ReleaseLabel="emr-7.1.0",
Applications=[{"Name": "Spark"}],
Instances={
"InstanceGroups": [
{
"Name": "Master", "InstanceRole": "MASTER",
"InstanceType": "m5.xlarge", "InstanceCount": 1,
"Market": "ON_DEMAND"
},
{
"Name": "Core", "InstanceRole": "CORE",
"InstanceType": "r5.4xlarge", "InstanceCount": 4,
"Market": "ON_DEMAND"
},
{
"Name": "Task", "InstanceRole": "TASK",
"InstanceType": "r5.4xlarge", "InstanceCount": 8,
"Market": "SPOT",
"BidPrice": "0.50" # max Spot bid
}
],
"Ec2SubnetId": "subnet-abc123",
"KeepJobFlowAliveWhenNoSteps": False, # ← auto-terminate!
"TerminationProtected": False
},
Steps=[
{
"Name": "Bronze ETL",
"ActionOnFailure": "TERMINATE_CLUSTER", # fail fast
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"--conf", "spark.sql.shuffle.partitions=400",
"s3://my-code/scripts/bronze_etl.py",
"--date", "2024-01-15"
]
}
},
{
"Name": "Silver Transform",
"ActionOnFailure": "TERMINATE_CLUSTER",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"s3://my-code/scripts/silver_transform.py",
"--date", "2024-01-15"
]
}
}
],
BootstrapActions=[{
"Name": "Install libs",
"ScriptBootstrapAction": {
"Path": "s3://my-code/bootstrap/setup.sh"
}
}],
Configurations=[{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.adaptive.enabled": "true",
"spark.dynamicAllocation.enabled": "true"
}
}],
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
LogUri="s3://my-emr-logs/clusters/",
AutoTerminationPolicy={"IdleTimeout": 3600},
Tags=[
{"Key": "Project", "Value": "DataLake"},
{"Key": "Environment", "Value": "prod"},
{"Key": "CostCenter", "Value": "DE-Team"} # for billing reports
]
)
cluster_id = response["JobFlowId"]
print(f"Cluster launched: {cluster_id}")
print(f"Will auto-terminate after steps complete.")
# ── Poll until cluster terminates ───────────────────────────────
while True:
cluster = emr.describe_cluster(ClusterId=cluster_id)
state = cluster["Cluster"]["Status"]["State"]
print(f" Cluster state: {state}")
if state in ("TERMINATED", "TERMINATED_WITH_ERRORS"):
break
time.sleep(30)
print("Done.")
| Operation | boto3 Call | Key Parameter |
|---|---|---|
| Launch cluster | emr.run_job_flow() | Instances, Steps, BootstrapActions |
| Add steps | emr.add_job_flow_steps() | ClusterId, Steps |
| Check cluster state | emr.describe_cluster() | ClusterId → Status.State |
| Check step state | emr.describe_step() | ClusterId, StepId → Status.State |
| List clusters | emr.list_clusters() + paginator | ClusterStates=["RUNNING"] |
| Terminate cluster | emr.terminate_job_flows() | JobFlowIds=[cluster_id] |
| Enable managed scaling | emr.put_managed_scaling_policy() | ManagedScalingPolicy |
| Serverless — submit job | emr_serverless.start_job_run() | applicationId, jobDriver |
| Serverless — poll state | emr_serverless.get_job_run() | applicationId, jobRunId |
Amazon Athena — Serverless SQL on S3
Athena is AWS's serverless, interactive query service. You point it at files on S3 — Parquet, ORC, JSON, CSV — and run standard SQL against them with no cluster to manage, no data to load, and no infrastructure to provision. You pay only for the bytes scanned. For Data Engineers, Athena is the fastest way to query and validate data in your lake, run ad-hoc analytics, and automate SQL-based pipelines via boto3.
Athena is built on top of Trino (formerly PrestoSQL) — a massively parallel SQL engine. When you submit a query, Athena spins up a fleet of compute workers, reads the relevant S3 files in parallel, processes the query, writes results to an S3 output location, and then releases all compute. You never see any of this — it's fully managed and scales automatically with query size.
To query S3 data, you register it as a table in the Glue Data Catalog (or use a Glue Crawler to auto-discover it). Then Athena can query it with standard SQL. Here's what a full setup looks like:
-- Run this DDL in Athena console or via start_query_execution API
CREATE EXTERNAL TABLE IF NOT EXISTS sales_db.transactions (
transaction_id STRING,
customer_id STRING,
product_id STRING,
amount DOUBLE,
status STRING,
created_at TIMESTAMP
)
PARTITIONED BY (
year STRING,
month STRING,
day STRING
)
STORED AS PARQUET
LOCATION 's3://my-data-lake/silver/transactions/'
TBLPROPERTIES (
'parquet.compress' = 'SNAPPY',
'projection.enabled' = 'false'
);
-- Load partition metadata (tells Athena about existing partitions)
MSCK REPAIR TABLE sales_db.transactions;
-- Now query with partition pruning (only scans year=2024/month=01/day=15)
SELECT
customer_id,
SUM(amount) AS total_spend
FROM sales_db.transactions
WHERE
year = '2024'
AND month = '01'
AND day = '15'
AND status = 'COMPLETED'
GROUP BY customer_id
ORDER BY total_spend DESC
LIMIT 100;
Partition pruning means Athena skips S3 prefixes entirely when your WHERE clause filters on partition columns. Instead of listing and reading every file in the table, Athena only reads files under the matching partition paths. This is the single biggest cost and performance lever in Athena.
-- ❌ BAD: Full table scan — reads ALL partitions regardless of date
SELECT * FROM sales_db.transactions
WHERE created_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Athena scans ALL files → slow + expensive
-- created_at is a data column, not a partition column
-- ✅ GOOD: Partition pruning — only reads year=2024/month=01/day=15
SELECT * FROM sales_db.transactions
WHERE year = '2024' AND month = '01' AND day = '15'
AND created_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Athena skips all other partitions → fast + cheap
-- ✅ ALSO GOOD: Range on partition columns
SELECT * FROM sales_db.transactions
WHERE year = '2024' AND month IN ('01', '02', '03');
-- Scans only Q1 2024 data
With large tables, MSCK REPAIR TABLE (which scans S3 to discover partitions) can take minutes or even time out. Partition Projection is an Athena feature where you declare the partition schema mathematically — Athena computes valid partition paths on the fly without any metadata lookup. This makes partition-heavy tables query-ready instantly, even with years of daily data.
CREATE EXTERNAL TABLE sales_db.events (
event_id STRING,
user_id STRING,
event_type STRING,
payload STRING
)
PARTITIONED BY (dt STRING) -- single date partition column
STORED AS PARQUET
LOCATION 's3://my-data-lake/events/'
TBLPROPERTIES (
-- Enable partition projection
'projection.enabled' = 'true',
'projection.dt.type' = 'date',
'projection.dt.format' = 'yyyy-MM-dd',
'projection.dt.range' = '2023-01-01,NOW',
'projection.dt.interval' = '1',
'projection.dt.interval.unit' = 'DAYS',
-- Tell Athena how to build the S3 path from partition value
'storage.location.template' = 's3://my-data-lake/events/dt=${dt}/'
);
-- No MSCK REPAIR needed! Just query directly:
SELECT event_type, COUNT(*) AS cnt
FROM sales_db.events
WHERE dt BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY 1
ORDER BY 2 DESC;
Even with partition pruning, small files are expensive because Athena has to open and process many file handles. The sweet spot is 128 MB – 1 GB Parquet files with Snappy compression. This is why Spark's OPTIMIZE (Delta) or manual compaction matters so much for Athena workloads.
| Format | Typical Cost | Query Speed | Recommendation |
|---|---|---|---|
| CSV (uncompressed) | Very High | Slow | Never use for large tables |
| JSON (uncompressed) | High | Slow | Raw landing only |
| Parquet + Snappy | Low | Fast | ✅ Default choice |
| ORC + Zlib | Very Low | Fast | ✅ Good alternative |
| Parquet + Zstd | Lowest | Fastest | ✅ Best for large tables |
Workgroups let you isolate and control Athena usage by team, project, or environment. Each workgroup can have its own: output S3 location, data scan limit per query (cost guard-rail), query history, and IAM-controlled access. Without workgroups, a single runaway query from any user could scan a petabyte and generate a huge bill.
import boto3
athena = boto3.client("athena", region_name="us-east-1")
athena.create_work_group(
Name="de-team-prod",
Description="Data Engineering team production workgroup",
Configuration={
"ResultConfiguration": {
"OutputLocation": "s3://my-athena-results/de-team-prod/",
"EncryptionConfiguration": {
"EncryptionOption": "SSE_KMS",
"KmsKey": "alias/data-lake-cmk"
}
},
"EnforceWorkGroupConfiguration": True, # users can't override below
"BytesScannedCutoffPerQuery": 10 * 1024**3, # 10 GB max per query
"PublishCloudWatchMetricsEnabled": True, # send metrics to CW
"RequesterPaysEnabled": False
},
Tags=[{"Key": "Team", "Value": "DataEngineering"}]
)
print("Workgroup created.")
dev, staging, and prod environments, each with a different scan limit (e.g. 1 GB for dev, 100 GB for prod pipelines). This prevents a dev query from accidentally scanning the full production dataset and costing hundreds of dollars.
Athena can cache query results for up to 7 days. If the same query (or one with the same hash) is submitted again within the cache window, Athena returns the previous result instantly at zero cost — no S3 scan. This is especially valuable for dashboard queries that run the same aggregations repeatedly.
import boto3
athena = boto3.client("athena")
athena.update_work_group(
WorkGroup="de-team-prod",
ConfigurationUpdates={
"ResultReuseByAgeConfiguration": {
"Enabled": True,
"MaxAgeInMinutes": 60 * 24 # cache results for 24 hours
}
}
)
Standard Athena external tables are read-only — you can't UPDATE or DELETE rows. Apache Iceberg changes this. Athena has native Iceberg support, which means you can run full ACID DML — INSERT, UPDATE, DELETE, MERGE INTO — directly on S3 data via SQL, without Spark. This is ideal for lightweight CDC landing, small corrective updates, and data deletion for GDPR compliance.
-- Create an Iceberg table (stored on S3, managed via Glue Catalog)
CREATE TABLE sales_db.customers_iceberg (
customer_id STRING,
name STRING,
email STRING,
country STRING,
updated_at TIMESTAMP
)
LOCATION 's3://my-data-lake/iceberg/customers/'
TBLPROPERTIES (
'table_type' = 'ICEBERG',
'format' = 'parquet',
'write_compression' = 'snappy'
);
-- Standard INSERT
INSERT INTO sales_db.customers_iceberg
VALUES ('C001', 'Alice', 'alice@example.com', 'IN', NOW());
-- MERGE INTO — upsert from a staging table
MERGE INTO sales_db.customers_iceberg t
USING sales_db.customers_staging s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
UPDATE SET name = s.name, email = s.email, updated_at = s.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, country, updated_at)
VALUES (s.customer_id, s.name, s.email, s.country, s.updated_at);
-- DELETE rows (e.g. GDPR right-to-erasure)
DELETE FROM sales_db.customers_iceberg
WHERE customer_id = 'C001';
-- Time travel — query historical snapshot
SELECT * FROM sales_db.customers_iceberg
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-15 12:00:00';
Athena Federated Query lets you run SQL that joins S3 data with data in other AWS services — RDS (PostgreSQL/MySQL), DynamoDB, Redshift, ElasticSearch, and more — in a single query. It uses Lambda-based Data Source Connectors that translate Athena queries into calls against the target service.
-- After installing the RDS connector Lambda and registering it as a catalog:
-- "my_rds_catalog" points to an RDS PostgreSQL instance
-- "AwsDataCatalog" points to Glue Catalog (S3 data)
SELECT
t.transaction_id,
t.amount,
c.name AS customer_name,
c.email AS customer_email,
c.country
FROM
-- S3 data via Glue Catalog
AwsDataCatalog.sales_db.transactions t
-- RDS PostgreSQL data via federated connector
JOIN my_rds_catalog.public.customers c
ON t.customer_id = c.customer_id
WHERE
t.year = '2024'
AND t.month = '01'
AND c.country = 'IN';
Named queries are saved SQL statements stored in Athena (per workgroup). They appear in the Athena console as saved queries — useful for standard DQ checks, daily reports, or DDL templates that multiple team members run. You can also create and retrieve them via boto3 to build query libraries in code.
import boto3
athena = boto3.client("athena")
# Save a frequently-used DQ check as a named query
response = athena.create_named_query(
Name="daily_null_check",
Description="Check null counts in transactions table for a given date",
Database="sales_db",
QueryString="""
SELECT
COUNT(*) AS total_rows,
COUNT(*) - COUNT(transaction_id) AS null_transaction_id,
COUNT(*) - COUNT(customer_id) AS null_customer_id,
COUNT(*) - COUNT(amount) AS null_amount
FROM transactions
WHERE year = '2024' AND month = '01' AND day = '15'
""",
WorkGroup="de-team-prod"
)
named_query_id = response["NamedQueryId"]
print(f"Saved named query: {named_query_id}")
# Retrieve it later to get the SQL string
nq = athena.get_named_query(NamedQueryId=named_query_id)
sql = nq["NamedQuery"]["QueryString"]
print(f"SQL: {sql}")
Athena is asynchronous — you submit a query, get back a QueryExecutionId, then poll until it's done, then fetch results. This 4-step pattern is what every production Athena automation looks like:
import boto3, time
athena = boto3.client("athena", region_name="us-east-1")
# ─────────────────────────────────────────────────────────────────
# Step 1: Submit the query
# ─────────────────────────────────────────────────────────────────
response = athena.start_query_execution(
QueryString="""
SELECT
region,
SUM(revenue) AS total_revenue,
COUNT(*) AS num_orders
FROM sales_db.transactions
WHERE year = '2024' AND month = '01'
GROUP BY region
ORDER BY total_revenue DESC
""",
QueryExecutionContext={
"Database": "sales_db",
"Catalog": "AwsDataCatalog"
},
ResultConfiguration={
"OutputLocation": "s3://my-athena-results/de-team-prod/"
},
WorkGroup="de-team-prod"
)
query_execution_id = response["QueryExecutionId"]
print(f"Query submitted: {query_execution_id}")
# ─────────────────────────────────────────────────────────────────
# Step 2: Poll until SUCCEEDED or FAILED
# ─────────────────────────────────────────────────────────────────
def wait_for_query(athena_client, qeid, poll_interval=2):
terminal = {"SUCCEEDED", "FAILED", "CANCELLED"}
while True:
result = athena_client.get_query_execution(QueryExecutionId=qeid)
status = result["QueryExecution"]["Status"]
state = status["State"]
print(f" State: {state}")
if state == "FAILED":
reason = status.get("StateChangeReason", "Unknown")
raise RuntimeError(f"Athena query FAILED: {reason}")
if state == "CANCELLED":
raise RuntimeError("Athena query was CANCELLED.")
if state == "SUCCEEDED":
stats = result["QueryExecution"]["Statistics"]
scanned_mb = stats["DataScannedInBytes"] / 1024**2
print(f" ✅ SUCCEEDED — Scanned: {scanned_mb:.1f} MB")
return
time.sleep(poll_interval)
wait_for_query(athena, query_execution_id)
# ─────────────────────────────────────────────────────────────────
# Step 3: Fetch and parse results with paginator
# ─────────────────────────────────────────────────────────────────
def fetch_results(athena_client, qeid):
paginator = athena_client.get_paginator("get_query_results")
pages = paginator.paginate(QueryExecutionId=qeid)
rows = []
headers = None
for page in pages:
result_set = page["ResultSet"]
if headers is None:
# First row in first page is the header row
headers = [
col["VarCharValue"]
for col in result_set["Rows"][0]["Data"]
]
data_rows = result_set["Rows"][1:] # skip header
else:
data_rows = result_set["Rows"]
for row in data_rows:
values = [
cell.get("VarCharValue", None)
for cell in row["Data"]
]
rows.append(dict(zip(headers, values)))
return rows
results = fetch_results(athena, query_execution_id)
# Step 4: Use results — print or convert to DataFrame
for row in results:
print(row)
# → {'region': 'South', 'total_revenue': '1234567.89', 'num_orders': '4521'}
# → {'region': 'North', 'total_revenue': '987654.32', 'num_orders': '3200'}
# Convert to Pandas DataFrame if needed
import pandas as pd
df = pd.DataFrame(results)
df["total_revenue"] = df["total_revenue"].astype(float)
print(df)
import boto3
athena = boto3.client("athena")
# Cancel a running query (e.g. if it's scanning too much data)
athena.stop_query_execution(
QueryExecutionId="abc12345-1234-1234-1234-abc123456789"
)
print("Query cancellation requested.")
| Operation | boto3 Call | Key Parameters |
|---|---|---|
| Run a query | start_query_execution() | QueryString, OutputLocation, WorkGroup |
| Check query state | get_query_execution() | QueryExecutionId → Status.State |
| Fetch results | get_query_results() + paginator | QueryExecutionId |
| Cancel a query | stop_query_execution() | QueryExecutionId |
| List past queries | list_query_executions() + paginator | WorkGroup |
| Save a query | create_named_query() | Name, QueryString, WorkGroup |
| Load a saved query | get_named_query() | NamedQueryId |
| Create workgroup | create_work_group() | BytesScannedCutoffPerQuery |
- Always filter on partition columns in WHERE clauses — this is the #1 cost and speed lever.
- Use Parquet or ORC format, never CSV or JSON for production tables.
- Keep files 128 MB – 1 GB — compact small files with Spark or Delta OPTIMIZE before Athena queries them.
- Use Workgroups with scan byte limits — a single runaway query can cost hundreds of dollars.
- Enable Partition Projection for large time-series tables — eliminates MSCK REPAIR and makes partition discovery instant.
AWS Lake Formation — Fine-Grained Data Governance
Lake Formation is AWS's centralised data lake governance service. Before Lake Formation, controlling who could access which tables, columns, or rows in your data lake required a patchwork of S3 bucket policies and IAM policies that quickly became unmanageable. Lake Formation gives you a single place to grant and revoke table-level, column-level, and row-level permissions across your entire Glue Catalog — enforced for Athena, Glue, EMR, and Redshift Spectrum automatically.
Without Lake Formation, controlling data lake access looks like this: you give a team an IAM role that grants S3 read access to a specific prefix. But S3 permissions are bucket/prefix-level — you can't say "this team can read the customers table but only the name and region columns, not email or phone". For column-level or row-level access, you need custom views, masking logic in every query tool, or separate physical copies of data per team — all unmaintainable at scale.
| Scenario | Use IAM | Use Lake Formation |
|---|---|---|
| Control S3 bucket access broadly | ✅ Yes | Not designed for this |
| Table-level access in Glue Catalog | Possible but complex | ✅ Preferred |
| Column-level access control | Not possible | ✅ Yes — column grants |
| Row-level filtering | Not possible | ✅ Yes — row filters |
| Cross-account data sharing | Complex RAM setup | ✅ Built-in cross-account |
| Tag-based access policies | IAM Attribute-based AC | ✅ LF-Tags (simpler) |
Before Lake Formation can govern data in an S3 bucket, you must register that S3 path with Lake Formation. Registration transfers ownership of that path's access control from S3/IAM to Lake Formation. After registration, services like Athena and Glue get data access through a Lake Formation service-linked role — not through the user's own IAM role directly touching S3. This is how Lake Formation enforces its column and row filters: it intercepts the data access at the service level.
import boto3
lf = boto3.client("lakeformation", region_name="us-east-1")
# Register the S3 path — Lake Formation now controls access here
lf.register_resource(
ResourceArn="arn:aws:s3:::my-data-lake",
UseServiceLinkedRole=True # Lake Formation uses its own IAM role to access S3
)
print("S3 location registered with Lake Formation.")
# List all registered locations
response = lf.list_resources()
for r in response["ResourceInfoList"]:
print(r["ResourceArn"], r["RoleArn"])
The most common Lake Formation operation: grant a role the ability to SELECT from a specific table in the Glue Catalog. Once granted, that role can query the table via Athena or read it in a Glue job — without needing direct S3 IAM permissions.
import boto3
lf = boto3.client("lakeformation")
# Grant SELECT on the transactions table to the Data Analyst role
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataAnalystRole"
},
Resource={
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "sales_db",
"Name": "transactions"
}
},
Permissions=["SELECT"],
PermissionsWithGrantOption=[] # cannot re-grant to others
)
print("Table permission granted.")
# Grant SELECT on ALL tables in a database
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataEngineerRole"
},
Resource={
"Database": {
"CatalogId": "123456789012",
"Name": "sales_db"
}
},
Permissions=["ALL"] # full access to the database
)
# Revoke a permission
lf.revoke_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataAnalystRole"
},
Resource={
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "sales_db",
"Name": "transactions"
}
},
Permissions=["SELECT"]
)
Column-level security is one of Lake Formation's most valuable features for PII protection. You can grant SELECT on specific columns only — the principal sees the table in Athena but querying excluded columns returns an access denied error.
import boto3
lf = boto3.client("lakeformation")
# customers table has: customer_id, name, email, phone, country, segment
# Analyst role should NOT see email or phone (PII)
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataAnalystRole"
},
Resource={
"TableWithColumns": {
"CatalogId": "123456789012",
"DatabaseName": "sales_db",
"Name": "customers",
"ColumnNames": [
"customer_id",
"name",
"country",
"segment"
# email and phone are NOT listed → access denied
]
}
},
Permissions=["SELECT"]
)
print("Column-level permission granted (email and phone excluded).")
# Alternatively, use ColumnWildcard with Excluded columns
# (grant all columns EXCEPT the listed ones)
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataAnalystRole"
},
Resource={
"TableWithColumns": {
"CatalogId": "123456789012",
"DatabaseName": "sales_db",
"Name": "customers",
"ColumnWildcard": {
"ExcludedColumnNames": ["email", "phone"]
}
}
},
Permissions=["SELECT"]
)
Row-level security lets you restrict which rows a principal can see. You create a Data Filter with a filter expression (a SQL WHERE clause fragment), then attach it to a table grant. When the principal queries the table, Lake Formation automatically applies the filter — they only see rows that match the expression.
import boto3
lf = boto3.client("lakeformation")
# Step 1: Create a row filter (a named WHERE clause)
lf.create_data_cells_filter(
TableData={
"TableCatalogId": "123456789012",
"DatabaseName": "sales_db",
"TableName": "transactions",
"Name": "india_only_filter", # filter name
"RowFilter": {
"FilterExpression": "country = 'IN'" # SQL WHERE clause
},
# Optionally restrict columns too in the same filter
"ColumnWildcard": {} # all columns allowed in this filter
}
)
print("Row filter created.")
# Step 2: Grant the filter to an IAM role
# The India Analytics team can only see rows where country = 'IN'
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/IndiaAnalyticsRole"
},
Resource={
"DataCellsFilter": {
"TableCatalogId": "123456789012",
"DatabaseName": "sales_db",
"TableName": "transactions",
"Name": "india_only_filter"
}
},
Permissions=["SELECT"]
)
print("Row-level filter grant applied.")
# Now when IndiaAnalyticsRole queries transactions in Athena,
# they ONLY see rows where country = 'IN' — automatically enforced.
If you have 500 tables in your lake and need to grant 20 teams access to different subsets, managing individual table grants becomes unmanageable — that's potentially 10,000 grant statements to maintain. LF-Tags (Lake Formation Tag-Based Access Control) solve this with attribute-based access. You tag tables and columns with key-value labels, then grant access to a tag expression rather than specific resources. When you add a new table with the right tags, it's automatically included in existing grants.
department=finance or sensitivity=public and give the employee a badge that works on all cabinets with department=finance AND sensitivity=public. When you add a new cabinet, just tag it — the employee's badge automatically works on it.
import boto3
lf = boto3.client("lakeformation")
# ── Step 1: Create LF-Tag keys and their allowed values ─────────
lf.create_lf_tag(
TagKey="sensitivity",
TagValues=["public", "internal", "confidential", "restricted"]
)
lf.create_lf_tag(
TagKey="domain",
TagValues=["sales", "finance", "marketing", "hr"]
)
print("LF-Tags created.")
# ── Step 2: Assign tags to a database ───────────────────────────
lf.add_lf_tags_to_resource(
Resource={
"Database": {
"CatalogId": "123456789012",
"Name": "sales_db"
}
},
LFTags=[
{"TagKey": "domain", "TagValues": ["sales"]},
{"TagKey": "sensitivity", "TagValues": ["internal"]}
]
)
# ── Step 3: Assign a more restrictive tag to a specific table ────
lf.add_lf_tags_to_resource(
Resource={
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "sales_db",
"Name": "customers" # has PII
}
},
LFTags=[
{"TagKey": "sensitivity", "TagValues": ["confidential"]}
]
)
# ── Step 4: Grant access via tag expression ──────────────────────
# Sales team can access all tables tagged domain=sales AND sensitivity=internal
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/SalesAnalyticsRole"
},
Resource={
"LFTagPolicy": {
"CatalogId": "123456789012",
"ResourceType": "TABLE",
"Expression": [
{"TagKey": "domain", "TagValues": ["sales"]},
{"TagKey": "sensitivity", "TagValues": ["public", "internal"]}
]
}
},
Permissions=["SELECT"]
)
print("Tag-based grant applied.")
# SalesAnalyticsRole can now query any table tagged
# domain=sales AND (sensitivity=public OR sensitivity=internal)
# The 'customers' table (tagged confidential) is automatically excluded.
# Any NEW table added to sales_db with these tags is automatically accessible.
Lake Formation makes cross-account data sharing straightforward. Account A (the data producer) grants Lake Formation permissions to Account B's IAM principal. Account B then creates a resource link in their own Glue Catalog that points to Account A's table. Teams in Account B can query Account A's data via Athena as if it were a local table — without any S3 cross-account policy complexity.
import boto3
# Run in ACCOUNT A (data producer)
lf_producer = boto3.client("lakeformation", region_name="us-east-1")
# Grant SELECT on a specific table to Account B's consumer role
lf_producer.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::222222222222:role/ConsumerRole"
},
Resource={
"Table": {
"CatalogId": "111111111111",
"DatabaseName": "sales_db",
"Name": "transactions"
}
},
Permissions=["SELECT"],
PermissionsWithGrantOption=["SELECT"] # allow B to re-share
)
print("Cross-account grant sent to Account B.")
# ────────────────────────────────────────────────────────────────
# In ACCOUNT B: create a resource link to access the shared table
# ────────────────────────────────────────────────────────────────
glue_consumer = boto3.client("glue", region_name="us-east-1")
# Create a local resource link pointing to Account A's table
glue_consumer.create_table(
DatabaseName="shared_data",
TableInput={
"Name": "transactions_from_a", # local alias
"TargetTable": {
"CatalogId": "111111111111", # Account A's ID
"DatabaseName": "sales_db",
"Name": "transactions"
}
}
)
print("Resource link created. Account B can now query Account A's table.")
Lake Formation permissions are enforced automatically for all services that use the Glue Data Catalog. You don't need to configure Athena, Glue, or EMR separately — once Lake Formation is set up, all three respect its grants:
| Service | How Lake Formation Enforces Permissions |
|---|---|
| Amazon Athena | Before executing a query, Athena checks Lake Formation permissions. Columns/rows excluded by grants are invisible in query results. |
| AWS Glue Jobs | Glue ETL jobs running with a role that has Lake Formation grants can only read the allowed columns/rows from the data source. |
| Amazon Redshift Spectrum | Spectrum queries on Glue Catalog tables respect Lake Formation column and row permissions. |
| EMR (Ranger integration) | EMR with Apache Ranger plugin integrates with Lake Formation for Spark SQL permission enforcement. |
import boto3
lf = boto3.client("lakeformation")
# Designate a role as a Data Lake Administrator
# (admins can grant any permission to anyone)
lf.put_data_lake_settings(
DataLakeSettings={
"DataLakeAdmins": [
{"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataLakeAdminRole"},
{"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:user/de-lead"}
],
"CreateDatabaseDefaultPermissions": [], # no implicit grants
"CreateTableDefaultPermissions": [], # no implicit grants
"TrustedResourceOwners": []
}
)
print("Data lake admins configured.")
# List all permissions currently in effect
paginator = lf.get_paginator("list_permissions")
for page in paginator.paginate():
for perm in page["PrincipalResourcePermissions"]:
principal = perm["Principal"]["DataLakePrincipalIdentifier"]
resource = perm["Resource"]
perms = perm["Permissions"]
print(f" {principal} → {perms} on {resource}")
| Operation | boto3 Call |
|---|---|
| Register S3 location | lf.register_resource() |
| Grant table/column/row permission | lf.grant_permissions() |
| Revoke permission | lf.revoke_permissions() |
| List all permissions | lf.list_permissions() + paginator |
| Create LF-Tag | lf.create_lf_tag() |
| Assign tag to resource | lf.add_lf_tags_to_resource() |
| Grant via tag expression | lf.grant_permissions() with LFTagPolicy resource |
| Create row filter | lf.create_data_cells_filter() |
| Set data lake admins | lf.put_data_lake_settings() |
| Cross-account share | lf.grant_permissions() with cross-account principal ARN |
Amazon Redshift
Redshift is AWS's fully managed, petabyte-scale cloud data warehouse. As a data engineer, you'll use it as the serving layer for analytics — loading transformed data from S3 via COPY, exporting results via UNLOAD, optimising query performance with distribution and sort keys, and federating queries to S3 with Redshift Spectrum. Every topic below is something you'll encounter in a real production warehouse.
The leader node is the entry point for every query. It receives SQL from your client (BI tool, psql, JDBC), builds an execution plan, divides the work across compute nodes, aggregates partial results, and returns the final answer. You never directly interact with compute nodes — all connections go through the leader node on port 5439. The leader node does not store any actual table data — it only stores query plans and cluster metadata.
Compute nodes store slices of data and execute the actual query tasks. Each compute node is divided into node slices — on a ds2.xlarge node there are 2 slices; on a dc2.8xlarge there are 16. Data is distributed across slices, and each slice processes its portion of a query in parallel. More nodes = more parallelism = faster queries on large data.
RA3 nodes decouple compute and storage. The local SSD is a high-speed cache — Redshift automatically keeps hot data on local SSD and cold data in S3-backed managed storage. You can scale compute nodes up or down without migrating data, and you pay for compute and storage separately. This makes RA3 the recommended node type for all new Redshift clusters.
The COPY command is the fastest and most efficient way to load data into Redshift. It loads in parallel — each compute node slice reads directly from S3 independently. A single COPY loading 1 TB from 200 Parquet files is orders of magnitude faster than equivalent INSERTs. Never use INSERT for bulk loads — always COPY.
Parquet is the recommended format for loading into Redshift — columnar, compressed, and schema-embedded. With Parquet you don't need to specify delimiter or column types — Redshift reads them from the Parquet metadata automatically.
-- Load all Parquet files under a prefix in parallel
COPY analytics.orders
FROM 's3://my-data-lake/silver/orders/year=2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
-- Load a specific manifest file (list of exact S3 paths)
COPY analytics.orders
FROM 's3://my-data-lake/manifests/orders_2024_load.manifest'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET
MANIFEST;
CSV COPY requires specifying the delimiter, quote character, and whether to ignore a header row. Use IGNOREHEADER 1 when the CSV has a column header on row 1. Always use GZIP-compressed CSV to reduce S3 transfer time.
COPY analytics.customers
FROM 's3://my-data-lake/raw/customers/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
DELIMITER ','
QUOTE '"'
IGNOREHEADER 1
GZIP
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
TRUNCATECOLUMNS -- silently truncate strings that exceed column length
MAXERROR 10; -- allow up to 10 bad rows before failing
-- Check what was rejected (stl_load_errors is Redshift's error log)
SELECT filename, err_reason, raw_line
FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 20;
In a data pipeline you'll run COPY programmatically — either via psycopg2 (direct JDBC-equivalent for Python) or via the Redshift Data API (boto3, no driver required). The Data API is preferred in serverless environments like Lambda and Glue because it doesn't need a VPC connection.
import boto3, time
redshift_data = boto3.client("redshift-data", region_name="us-east-1")
copy_sql = """
COPY analytics.orders
FROM 's3://my-data-lake/silver/orders/year=2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
"""
# Submit the COPY statement
resp = redshift_data.execute_statement(
ClusterIdentifier="my-redshift-cluster",
Database="analytics",
DbUser="etl_user",
Sql=copy_sql
)
stmt_id = resp["Id"]
# Poll until FINISHED or FAILED
while True:
status = redshift_data.describe_statement(Id=stmt_id)["Status"]
print(f"COPY status: {status}")
if status in ("FINISHED", "FAILED", "ABORTED"):
break
time.sleep(5)
if status == "FINISHED":
print("✅ COPY completed successfully")
else:
err = redshift_data.describe_statement(Id=stmt_id).get("Error")
raise RuntimeError(f"COPY failed: {err}")
The spark-redshift connector (open-source, also available as Databricks built-in) lets you write a PySpark DataFrame directly to Redshift. Under the hood it: (1) writes the DataFrame to S3 as Avro/Parquet, (2) runs a COPY command from that S3 path into Redshift. It uses S3 as a staging area, so you need an IAM role that grants both S3 write and Redshift COPY permissions.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("RedshiftWriter") \
.config("spark.jars.packages",
"io.github.spark-redshift-community:spark-redshift_2.12:6.2.0-spark_3.5") \
.getOrCreate()
df = spark.read.parquet("s3://my-data-lake/silver/orders/")
# Write DataFrame → S3 staging → Redshift COPY (all automatic)
df.write \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://my-cluster.abc.us-east-1.redshift.amazonaws.com:5439/analytics") \
.option("dbtable", "analytics.orders") \
.option("tempdir", "s3://my-temp-bucket/redshift-staging/") \
.option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftS3ReadRole") \
.option("user", "etl_user") \
.option("password", db_password) \
.mode("append") \ # or "overwrite" to truncate first
.save()
print("DataFrame written to Redshift via COPY")
s3://temp/staging/ as Parquet → connector issues COPY analytics.orders FROM 's3://temp/staging/' FORMAT AS PARQUET → data lands in Redshift. The staging S3 path is cleaned up automatically.
UNLOAD exports the results of a SELECT query to S3 — in parallel, one file per slice. Use it to: export data for downstream consumers (reporting, ML training), create data extracts for sharing, or move aggregated results to S3 for Athena queries. UNLOAD always writes parallel files, not a single file — use a manifest to track what was written.
-- Export aggregated sales summary to S3 as compressed Parquet
UNLOAD ('
SELECT order_date,
product_category,
SUM(amount) AS total_amount,
COUNT(*) AS order_count
FROM analytics.orders
WHERE order_date >= \'\'2024-01-01\'\'
GROUP BY 1, 2
')
TO 's3://my-data-lake/exports/sales_summary/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET
ALLOWOVERWRITE -- overwrite existing files
PARALLEL ON -- write one file per slice (default, fast)
MANIFEST; -- write a manifest listing all output files
-- Output: s3://my-data-lake/exports/sales_summary/0000_part_00.parquet
-- s3://my-data-lake/exports/sales_summary/0001_part_00.parquet
-- s3://my-data-lake/exports/sales_summary/manifest (manifest JSON)
-- Export to a single CSV (slow on large tables — only use for small result sets)
UNLOAD ('SELECT * FROM analytics.dim_product')
TO 's3://my-data-lake/exports/dim_product/product_export_'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
DELIMITER ','
ADDQUOTES -- wrap string values in quotes
HEADER -- add column header row
GZIP -- compress output
PARALLEL OFF -- write exactly ONE file (slow — only for small tables)
ALLOWOVERWRITE;
-- Output: s3://my-data-lake/exports/dim_product/product_export_000.gz
PARALLEL OFF forces all data through a single leader node thread — it is dramatically slower on large tables. Only use it when downstream systems require exactly one file (some legacy ETL tools). For everything else, use PARALLEL ON (the default).
Redshift distributes rows of each table across node slices. The distribution style controls which rows go to which slice. Choosing the right distribution style eliminates or minimises data redistribution during joins — the most expensive operation in a distributed query. If two large tables are joined on a column and their rows are on the same slice, the join is local (fast). If rows are on different slices, Redshift must redistribute data over the network (slow).
Rows are distributed round-robin across slices regardless of content. Every slice gets an equal number of rows — no skew. Use EVEN for tables that are not frequently joined on a consistent key, or when you don't know what the join pattern will be. It's the safe default but doesn't optimise any specific join.
CREATE TABLE analytics.pipeline_audit (
run_id VARCHAR(64),
pipeline_name VARCHAR(128),
status VARCHAR(20),
row_count BIGINT,
run_date DATE
)
DISTSTYLE EVEN; -- round-robin; good for tables not joined on a key
Rows with the same distribution key value always go to the same slice. If two large tables are both distributed on the same join column (e.g. customer_id), their matching rows are co-located on the same slice — the join requires no network redistribution. This is the most impactful distribution choice for star schema fact-dimension joins.
-- Fact table distributed on customer_id
CREATE TABLE analytics.fact_orders (
order_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
order_date DATE,
amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id) -- distribute by customer_id
SORTKEY (order_date); -- sort within each slice by date
-- Dimension table also distributed on customer_id
CREATE TABLE analytics.dim_customer (
customer_id BIGINT,
customer_name VARCHAR(200),
country VARCHAR(50)
)
DISTSTYLE KEY
DISTKEY (customer_id); -- same key → co-located with fact_orders
-- This join requires ZERO network redistribution → very fast
SELECT c.country, SUM(o.amount)
FROM analytics.fact_orders o
JOIN analytics.dim_customer c USING (customer_id)
GROUP BY 1;
customer_id = 1), one slice will be overloaded and queries will be slow despite co-location. Always check skew with SELECT slice, COUNT(*) FROM svv_table_info ... after loading.
A full copy of the table is placed on every node. This means joining a large fact table to a small dimension table is always local — the dimension rows are already on every slice. Use ALL only for small, rarely-updated dimension tables (under a few million rows). The downside is that writes are 4–16× slower because every node must be updated.
-- Small lookup table — copy to ALL nodes for fast local joins
CREATE TABLE analytics.dim_country (
country_code CHAR(2),
country_name VARCHAR(100),
region VARCHAR(50)
)
DISTSTYLE ALL; -- full copy on every node; tiny table, updated rarely
| Style | Use When | Join Performance | Write Speed |
|---|---|---|---|
| EVEN | No dominant join key; staging tables | Medium (may redistribute) | Fast |
| KEY | Large fact tables joined on a consistent column | Best (co-located) | Fast |
| ALL | Small dimensions (<5M rows, rarely updated) | Best (local broadcast) | Slow |
Redshift stores rows on disk in sort key order within each slice. When a query filters on the sort key column, Redshift uses zone maps — per-block min/max metadata — to skip entire 1 MB blocks that can't contain matching rows. This is called zone map pruning and it means a well-sorted table can answer a filtered query by reading 1% of the data instead of 100%.
A compound sort key creates a multi-column sort order — similar to an ORDER BY clause. The first column in the sort key is the primary sort, and subsequent columns only help when the query also filters on earlier columns. Best for tables that are consistently filtered on the same leading columns (e.g. always filter by order_date first, then optionally region).
CREATE TABLE analytics.fact_orders (
order_id BIGINT,
customer_id BIGINT,
order_date DATE,
region VARCHAR(50),
amount DECIMAL(18,2)
)
DISTSTYLE KEY
DISTKEY (customer_id)
COMPOUND SORTKEY (order_date, region);
-- Queries filtering on order_date benefit most
-- Queries filtering on order_date AND region benefit even more
-- Queries filtering on ONLY region (not order_date) get no benefit
An interleaved sort key gives equal weight to all sort key columns — a query filtering on any one of them benefits equally. The trade-off is slower VACUUM and COPY performance because the interleaved z-ordering index is expensive to maintain. Use interleaved only when queries have diverse filter combinations with no dominant leading column. In practice, compound is almost always preferred.
CREATE TABLE analytics.fact_events (
event_id BIGINT,
user_id BIGINT,
event_date DATE,
event_type VARCHAR(50)
)
INTERLEAVED SORTKEY (user_id, event_date, event_type);
-- Any single-column filter benefits equally
-- But VACUUM takes 3-5× longer — use sparingly
After large loads or deletes, rows may be unsorted (Redshift appends new rows at the end regardless of sort key). VACUUM re-sorts rows and reclaims space from deleted rows. ANALYZE updates table statistics used by the query planner. Run both after major data loads in production.
-- Re-sort unsorted rows and reclaim space from deletes
VACUUM analytics.fact_orders;
-- VACUUM only the unsorted region (faster — doesn't re-sort already-sorted rows)
VACUUM SORT ONLY analytics.fact_orders;
-- VACUUM only reclaim space without re-sorting
VACUUM DELETE ONLY analytics.fact_orders;
-- Update query planner statistics (always run after VACUUM)
ANALYZE analytics.fact_orders;
-- Check unsorted percentage before deciding to VACUUM
SELECT "table", unsorted, stats_off, size
FROM svv_table_info
WHERE "table" = 'fact_orders';
Redshift Spectrum allows Redshift to query data directly in S3 without loading it into Redshift storage. You define an external table that points to an S3 path (via Glue Catalog), then query it with regular SQL from Redshift. Spectrum spins up a fleet of Spectrum nodes that scan S3 in parallel, push filtering down to S3, and return results to Redshift for final aggregation. You pay per terabyte scanned.
-- Step 1: Create an external schema backed by Glue Catalog
CREATE EXTERNAL SCHEMA ext_silver
FROM DATA CATALOG
DATABASE 'silver_db' -- Glue Catalog database name
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Step 2: Now query Glue Catalog tables from Redshift directly
-- (tables defined in Glue appear automatically in ext_silver schema)
SELECT COUNT(*), order_date
FROM ext_silver.orders -- this table lives in S3, not Redshift!
WHERE year = '2024' -- partition pruning — only reads 2024 data
GROUP BY order_date;
-- Step 3: Join S3 data with internal Redshift table in ONE query
SELECT c.customer_name,
SUM(o.amount) AS total_spend
FROM ext_silver.orders o -- S3 via Spectrum
JOIN analytics.dim_customer c -- internal Redshift table
ON o.customer_id = c.customer_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;
Workload Management (WLM) controls how Redshift allocates memory and query slots across different types of queries. Without WLM, a single long-running ETL query could consume all cluster memory and block all dashboard queries. WLM lets you create queues with dedicated memory percentages and concurrency limits so ETL and BI queries don't starve each other.
Automatic WLM lets Redshift dynamically allocate memory to queries based on their size and priority. Redshift classifies queries as short, medium, or long and allocates more memory to complex queries automatically. This is the recommended mode for most clusters — you don't need to manually tune queue memory percentages.
-- See current WLM configuration
SELECT * FROM svl_wlm_query_state;
-- Check queue wait times (is a queue a bottleneck?)
SELECT service_class,
COUNT(*) AS queued_queries,
AVG(queue_start_time) AS avg_wait
FROM stl_wlm_query
WHERE queue_start_time > GETDATE() - INTERVAL '1 hour'
GROUP BY service_class;
-- Set query priority for a session (automatic WLM only)
SET query_group TO 'etl_low_priority';
In manual WLM you define queues with explicit memory % and concurrency limits. A typical production setup has: a BI queue (high concurrency, low memory each — fast dashboard queries), an ETL queue (low concurrency, high memory each — large transformations), and a default queue for everything else.
Concurrency Scaling automatically adds transient Redshift clusters when the main cluster has queued queries. The additional capacity handles burst demand and is billed per second. It's transparent to users — queries just run faster during peak load. The first hour per day is free.
A materialized view stores the result of a query physically on disk. Instead of re-computing a complex aggregation on every dashboard refresh, the MV pre-computes it and users query the pre-built result. Use materialized views for expensive recurring aggregations that don't need real-time freshness.
-- Create a materialized view of daily sales (expensive to compute live)
CREATE MATERIALIZED VIEW analytics.mv_daily_sales AS
SELECT order_date,
region,
product_category,
SUM(amount) AS total_amount,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS order_count
FROM analytics.fact_orders
GROUP BY 1, 2, 3;
-- Dashboard queries now hit the pre-built MV — instant results
SELECT * FROM analytics.mv_daily_sales
WHERE order_date >= CURRENT_DATE - 30;
-- Refresh the MV after each ETL load (or on a schedule)
REFRESH MATERIALIZED VIEW analytics.mv_daily_sales;
-- Auto-refresh option (Redshift refreshes automatically when base table changes)
ALTER MATERIALIZED VIEW analytics.mv_daily_sales AUTO REFRESH YES;
The simplest and most reliable incremental load for daily partitioned data: delete the target partition's rows, then COPY the fresh partition from S3. Atomic — if COPY fails, the delete is rolled back (Redshift is ACID at statement level within a transaction).
BEGIN;
-- Delete today's partition from target (idempotent — safe to re-run)
DELETE FROM analytics.fact_orders
WHERE order_date = '2024-06-15';
-- Load fresh data from S3 (today's Parquet partition)
COPY analytics.fact_orders
FROM 's3://my-data-lake/silver/orders/year=2024/month=06/day=15/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
COMMIT; -- atomic: both succeed or both rolled back
Redshift has no native UPSERT (MERGE) in older versions. The classic pattern is: COPY new data into a staging table → DELETE matching rows from the target → INSERT from staging. This is fully atomic within a transaction.
-- Step 1: COPY incoming data into a staging table
CREATE TEMP TABLE stg_orders (LIKE analytics.fact_orders);
COPY stg_orders
FROM 's3://my-data-lake/silver/orders/incremental/2024-06-15/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
BEGIN;
-- Step 2: Delete existing rows that match incoming keys
DELETE FROM analytics.fact_orders
USING stg_orders
WHERE analytics.fact_orders.order_id = stg_orders.order_id;
-- Step 3: Insert all rows from staging (includes new + updated)
INSERT INTO analytics.fact_orders
SELECT * FROM stg_orders;
COMMIT;
-- Staging table auto-dropped when session ends (TEMP table)
import boto3, time, json
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, current_date
spark = SparkSession.builder.appName("OrdersPipeline").getOrCreate()
# ── 1. Read from source and transform ──────────────────────────
df = spark.read.parquet("s3://my-lake/bronze/orders/date=2024-06-15/")
df_clean = df.filter(df.amount > 0) \
.withColumn("load_date", current_date()) \
.dropDuplicates(["order_id"])
# ── 2. Write to S3 Silver as Parquet ───────────────────────────
s3_path = "s3://my-lake/silver/orders/date=2024-06-15/"
df_clean.write.mode("overwrite").parquet(s3_path)
print(f"Wrote {df_clean.count()} rows to S3")
# ── 3. Load from S3 into Redshift via Data API ─────────────────
sm = boto3.client("secretsmanager")
creds = json.loads(sm.get_secret_value(SecretId="prod/redshift/etl")["SecretString"])
rdclient = boto3.client("redshift-data", region_name="us-east-1")
copy_sql = f"""
BEGIN;
DELETE FROM analytics.fact_orders WHERE order_date = '2024-06-15';
COPY analytics.fact_orders
FROM '{s3_path}'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;
COMMIT;
"""
resp = rdclient.execute_statement(
ClusterIdentifier="prod-redshift",
Database="analytics",
DbUser=creds["username"],
Sql=copy_sql
)
stmt_id = resp["Id"]
# ── 4. Poll until complete ─────────────────────────────────────
for _ in range(60):
status = rdclient.describe_statement(Id=stmt_id)["Status"]
if status in ("FINISHED", "FAILED", "ABORTED"):
break
time.sleep(10)
print(f"Redshift COPY status: {status}")
if status != "FINISHED":
raise RuntimeError("Redshift COPY failed — check stl_load_errors")
Amazon MSK
Managed Streaming for Apache Kafka
Amazon MSK is AWS's fully managed Apache Kafka service. As a data engineer you'll use MSK as the central event backbone — ingesting real-time data from producers, feeding Spark Structured Streaming consumers, and connecting to S3 via MSK Connect. Every topic below maps directly to what you configure and debug in production streaming pipelines.
A Kafka broker is a single server in the Kafka cluster. It stores topic partitions on disk and serves read/write requests from producers and consumers. In MSK, AWS manages the broker fleet — you choose the number of brokers and the instance type. A typical production MSK cluster has 3 brokers (one per Availability Zone) for HA. Each broker hosts a subset of all topic partitions.
A topic is a named category of messages — e.g. prod.orders, prod.clicks. Each topic is split into partitions — ordered, immutable sequences of messages. Partitions are the unit of parallelism: more partitions = more consumers reading in parallel = higher throughput. Each partition lives on exactly one broker (its leader) but is replicated to other brokers for fault tolerance.
Every message in a partition has a unique, monotonically increasing offset (0, 1, 2, …). Consumers track their position by offset — they can replay from any offset, read only new messages, or start from the beginning. Kafka retains messages for a configurable period (default 7 days) or until a size limit — after that, old messages are deleted. Retention must be long enough to cover your recovery window.
A consumer group is a set of consumers that collectively read a topic. Kafka assigns each partition to exactly one consumer in the group — no two consumers in the same group read the same partition simultaneously. This enables parallel processing without duplicate reads. Different consumer groups can independently read the same topic — each group has its own offset tracking. Your Spark Structured Streaming job is one consumer group; a separate Flink job can be another.
The replication factor controls how many broker copies each partition has. Replication factor 3 means every partition has 1 leader + 2 followers. The In-Sync Replicas (ISR) set is the list of followers that are fully caught up with the leader. A message is only acknowledged to the producer when all ISR members have written it (when acks=all) — this guarantees no data loss even if a broker crashes immediately after acknowledgement.
from kafka.admin import KafkaAdminClient, NewTopic
admin = KafkaAdminClient(
bootstrap_servers="b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9092,"
"b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9092",
client_id="topic-creator"
)
topic = NewTopic(
name="prod.orders",
num_partitions=12, # 12 partitions = 12 parallel Spark tasks
replication_factor=3, # 1 leader + 2 followers per partition
topic_configs={
"retention.ms": "604800000", # 7 days in ms
"compression.type":"lz4", # broker-side compression
"min.insync.replicas": "2" # need 2 ISR to accept writes
}
)
admin.create_topics([topic])
print("Topic prod.orders created with 12 partitions")
MSK Provisioned gives you dedicated Kafka brokers on EC2-backed instances. You choose the instance type and number of brokers. Key sizing considerations: throughput (MB/s in + out × replication factor) determines network requirements; storage (retention period × daily data volume) determines EBS volume size; partition count drives CPU and memory requirements for ZooKeeper / KRaft metadata.
| Instance Type | Network | Use Case |
|---|---|---|
kafka.t3.small | Up to 5 Gbps | Dev/test only — not for production |
kafka.m5.large | Up to 10 Gbps | Low-throughput production (<50 MB/s) |
kafka.m5.4xlarge | Up to 25 Gbps | Medium production (50–300 MB/s) |
kafka.m5.16xlarge | Up to 100 Gbps | High-throughput production (>300 MB/s) |
MSK Serverless automatically scales capacity based on traffic — you don't choose broker count or instance type. Pay per partition-hour and GB throughput. Ideal for variable or unpredictable workloads where you don't want to over-provision brokers. Limitation: does not support all Kafka configurations (e.g. log compaction is limited) and has higher per-unit cost at sustained high throughput than provisioned.
Without a schema registry, every producer and consumer must agree on the message format out-of-band. When the producer adds a new field, every consumer breaks. A Schema Registry is a central repository for message schemas. Producers register a schema and embed only a small schema ID in each message. Consumers fetch the schema by ID and deserialise correctly — even across schema versions. This decouples producers from consumers and enables schema evolution without downtime.
The Schema Registry enforces compatibility rules when a new schema version is registered. Choose the mode based on your deployment strategy:
| Mode | Rule | Producer/Consumer Upgrade Order | Use Case |
|---|---|---|---|
| BACKWARD | New schema can read data written with old schema | Upgrade consumers first, then producers | Most common — add optional fields |
| FORWARD | Old schema can read data written with new schema | Upgrade producers first, then consumers | Remove fields consumers don't use |
| FULL | Both backward AND forward compatible | Either order | Strictest — only add optional fields, never remove |
| NONE | No compatibility check | Any order (risky) | Dev only — never production |
AWS provides a Glue Schema Registry that integrates with MSK. Producers register schemas in Glue and embed the schema version ID in each message. Consumers look up the schema from Glue to deserialise. Below is the full pattern using the AWS Glue Schema Registry serialiser.
import boto3, json
from kafka import KafkaProducer
from aws_schema_registry import SchemaRegistryClient
from aws_schema_registry.avro import AvroSchema
from aws_schema_registry.serde import KafkaSerializer
# ── Define the Avro schema ─────────────────────────────────────
ORDER_SCHEMA = AvroSchema("""{
"type": "record",
"name": "Order",
"namespace": "com.mycompany.events",
"fields": [
{"name": "order_id", "type": "long"},
{"name": "customer_id", "type": "long"},
{"name": "amount", "type": "double"},
{"name": "order_date", "type": "string"},
{"name": "region", "type": ["null", "string"], "default": null}
]
}""")
# ── Create Glue Schema Registry client ───────────────────────
glue_client = boto3.client("glue", region_name="us-east-1")
registry_client = SchemaRegistryClient(glue_client, registry_name="prod-registry")
serializer = KafkaSerializer(registry_client)
# ── Kafka producer with Avro serialiser ──────────────────────
producer = KafkaProducer(
bootstrap_servers="b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9092",
value_serializer=lambda v: serializer.serialize("prod.orders", v, ORDER_SCHEMA)
)
order_event = {
"order_id": 1001,
"customer_id": 42,
"amount": 299.99,
"order_date": "2024-06-15",
"region": "us-east"
}
producer.send("prod.orders", key=str(order_event["order_id"]).encode(), value=order_event)
producer.flush()
print("Order event published with Avro schema")
IAM authentication lets Kafka clients authenticate using their AWS IAM role or user credentials — no separate Kafka username/password needed. The Kafka client signs requests with AWS Signature V4. This is the recommended authentication method for Lambda, Glue, EMR, and EKS workloads because they already have IAM roles.
from kafka import KafkaProducer
from aws_msk_iam_sasl_signer import MSKAuthTokenProvider
def oauth_cb(oauth_config):
"""Called by kafka-python to get a fresh IAM token."""
auth_token, expiry_ms = MSKAuthTokenProvider.generate_auth_token("us-east-1")
return auth_token, expiry_ms / 1000
producer = KafkaProducer(
bootstrap_servers=[
"b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9098", # IAM port = 9098
"b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9098"
],
security_protocol="SASL_SSL",
sasl_mechanism="OAUTHBEARER",
sasl_oauth_token_provider=oauth_cb,
value_serializer=lambda v: v.encode("utf-8")
)
producer.send("prod.orders", value='{"order_id": 1001}')
producer.flush()
print("Message sent with IAM auth")
MSK encrypts data in transit using TLS by default. Clients connect on port 9094 (TLS) or 9098 (IAM + TLS). Enable TLS-only mode on your MSK cluster to reject plaintext connections. Download the Amazon CA certificate for client trust store configuration.
from kafka import KafkaConsumer
import json
consumer = KafkaConsumer(
"prod.orders",
bootstrap_servers=[
"b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9096" # SASL_SSL port
],
security_protocol="SASL_SSL",
sasl_mechanism="SCRAM-SHA-512",
sasl_plain_username="kafka-user",
sasl_plain_password="kafka-password", # fetched from Secrets Manager
ssl_cafile="/etc/kafka/amazon-root-ca.pem", # AWS CA certificate
group_id="spark-etl-cg",
auto_offset_reset="earliest",
value_deserializer=lambda m: json.loads(m.decode("utf-8"))
)
for msg in consumer:
print(f"Partition={msg.partition} Offset={msg.offset} Value={msg.value}")
Consumer lag is the difference between the latest message offset in a partition (the log end offset) and the last offset the consumer group has committed. Lag = 0 means the consumer is fully caught up. Lag = 10,000 means the consumer is 10,000 messages behind. Rising lag is the most important signal of a struggling consumer — your Spark job is processing slower than messages are arriving.
import boto3
from datetime import datetime, timedelta
cw = boto3.client("cloudwatch", region_name="us-east-1")
# MSK publishes SumOffsetLag metric per consumer group + topic
response = cw.get_metric_statistics(
Namespace="AWS/Kafka",
MetricName="SumOffsetLag",
Dimensions=[
{"Name": "Cluster Name", "Value": "prod-msk-cluster"},
{"Name": "Consumer Group", "Value": "spark-etl-cg"},
{"Name": "Topic", "Value": "prod.orders"}
],
StartTime=datetime.utcnow() - timedelta(minutes=30),
EndTime=datetime.utcnow(),
Period=60,
Statistics=["Maximum"]
)
for dp in sorted(response["Datapoints"], key=lambda x: x["Timestamp"]):
print(f"{dp['Timestamp'].strftime('%H:%M')} lag={dp['Maximum']:,.0f}")
# ── Create a CloudWatch alarm if lag exceeds 100k messages ─────
cw.put_metric_alarm(
AlarmName="MSK-spark-etl-lag-high",
MetricName="SumOffsetLag",
Namespace="AWS/Kafka",
Dimensions=[
{"Name": "Cluster Name", "Value": "prod-msk-cluster"},
{"Name": "Consumer Group", "Value": "spark-etl-cg"},
{"Name": "Topic", "Value": "prod.orders"}
],
Statistic="Maximum",
Period=300,
EvaluationPeriods=2,
Threshold=100000,
ComparisonOperator="GreaterThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:123456789012:data-eng-alerts"],
TreatMissingData="notBreaching"
)
Spark Structured Streaming reads from MSK (Kafka) using readStream with the kafka format. Each Kafka partition maps to one Spark task. The value column comes as raw bytes — you must cast/parse it. Checkpoint location stores the committed offsets so the job can resume from where it left off after a restart.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, current_timestamp
from pyspark.sql.types import StructType, StructField, LongType, DoubleType, StringType
spark = SparkSession.builder \
.appName("MSKOrdersConsumer") \
.config("spark.sql.shuffle.partitions", "12") \
.getOrCreate()
# ── Schema for the JSON payload inside Kafka value ─────────────
order_schema = StructType([
StructField("order_id", LongType(), nullable=False),
StructField("customer_id", LongType(), nullable=False),
StructField("amount", DoubleType(), nullable=True),
StructField("order_date", StringType(), nullable=True),
StructField("region", StringType(), nullable=True)
])
MSK_BROKERS = ("b-1.my-msk.abc.kafka.us-east-1.amazonaws.com:9098,"
"b-2.my-msk.abc.kafka.us-east-1.amazonaws.com:9098")
# ── Read stream from MSK ───────────────────────────────────────
raw_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", MSK_BROKERS) \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
.option("kafka.sasl.jaas.config",
"software.amazon.msk.auth.iam.IAMLoginModule required;") \
.option("kafka.sasl.client.callback.handler.class",
"software.amazon.msk.auth.iam.IAMClientCallbackHandler") \
.option("subscribe", "prod.orders") \
.option("startingOffsets", "latest") \
.option("maxOffsetsPerTrigger", 50000) \ # backpressure control
.load()
# ── Parse JSON value column ────────────────────────────────────
orders = raw_stream \
.select(
col("key").cast("string").alias("msg_key"),
from_json(col("value").cast("string"), order_schema).alias("data"),
col("partition"),
col("offset"),
col("timestamp").alias("kafka_timestamp")
) \
.select("data.*", "partition", "offset", "kafka_timestamp") \
.withColumn("processed_at", current_timestamp())
# ── Write to Delta Lake (S3) ───────────────────────────────────
query = orders.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://my-lake/checkpoints/orders-msk/") \
.option("path", "s3://my-lake/bronze/orders/") \
.trigger(processingTime="60 seconds") \
.start()
query.awaitTermination()
Use writeStream with kafka format to publish enriched or transformed events back to an MSK topic. The DataFrame must have a value column (and optionally key and topic columns). Convert complex types to JSON string with to_json().
from pyspark.sql.functions import to_json, struct
# Enrich the orders stream — join with static product dimension
dim_product = spark.read.parquet("s3://my-lake/silver/dim_product/")
enriched = orders.join(dim_product, "product_id", "left")
# Prepare Kafka output: must have 'value' column as bytes/string
kafka_output = enriched.select(
col("order_id").cast("string").alias("key"),
to_json(struct("*")).alias("value")
)
# Write enriched events to a new MSK topic
kafka_output.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", MSK_BROKERS) \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
.option("kafka.sasl.jaas.config",
"software.amazon.msk.auth.iam.IAMLoginModule required;") \
.option("kafka.sasl.client.callback.handler.class",
"software.amazon.msk.auth.iam.IAMClientCallbackHandler") \
.option("topic", "prod.orders-enriched") \
.option("checkpointLocation", "s3://my-lake/checkpoints/orders-enriched/") \
.outputMode("append") \
.trigger(processingTime="60 seconds") \
.start() \
.awaitTermination()
MSK Connect is a fully managed service that runs Kafka Connect connectors without you managing workers. You define a connector configuration, upload the connector plugin, and MSK Connect runs it for you — auto-scaling, monitoring, and patching included. The two most important connectors for data engineers are: S3 Sink (Kafka → S3) and JDBC Source (database → Kafka).
The Confluent S3 Sink Connector reads messages from Kafka topics and writes them to S3 as Parquet, Avro, or JSON — automatically, without writing any code. Use it to archive all Kafka events to your data lake for batch processing.
{
"connector.class": "io.confluent.connect.s3.S3SinkConnector",
"tasks.max": "6",
"topics": "prod.orders,prod.clicks",
"s3.region": "us-east-1",
"s3.bucket.name": "my-data-lake",
"s3.part.size": "67108864",
"topics.dir": "bronze",
"flush.size": "100000",
"rotate.interval.ms": "300000",
"storage.class": "io.confluent.connect.s3.storage.S3Storage",
"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
"parquet.codec": "snappy",
"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
"timestamp.extractor": "RecordField",
"timestamp.field": "order_date",
"schema.compatibility": "FULL"
}
// Writes: s3://my-data-lake/bronze/prod.orders/year=2024/month=06/day=15/hour=10/
// part-000001.snappy.parquet
// Every 100k messages OR every 5 minutes — whichever comes first
The JDBC Source Connector polls a relational database table on a schedule and publishes new/changed rows to Kafka. Use it for lightweight CDC when full log-based CDC (Debezium) is overkill. Configure mode=timestamp+incrementing to capture rows updated since the last poll using an updated_at column and auto-increment primary key.
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"tasks.max": "3",
"connection.url": "jdbc:postgresql://my-rds.abc.us-east-1.rds.amazonaws.com:5432/mydb",
"connection.user": "kafka_reader",
"connection.password": "${file:/opt/kafka/secrets.properties:db.password}",
"mode": "timestamp+incrementing",
"timestamp.column.name": "updated_at",
"incrementing.column.name": "order_id",
"table.whitelist": "public.orders,public.customers",
"topic.prefix": "jdbc.rds.",
"poll.interval.ms": "60000",
"batch.max.rows": "5000",
"numeric.mapping": "best_fit"
}
// Publishes to topics: jdbc.rds.orders, jdbc.rds.customers
// Every 60 seconds polls for rows where updated_at > last_poll_time
Amazon DynamoDB
DynamoDB is AWS's fully managed, serverless NoSQL database built for single-digit millisecond performance at any scale. For Data Engineers it is not a primary data warehouse — it is the operational backbone of pipelines: storing job audit records, pipeline control tables, checkpoint state, metadata configs, and watermarks. It requires zero server management and scales to millions of requests per second automatically.
Every DynamoDB table requires a partition key (also called a hash key). DynamoDB hashes this value and uses it to decide which physical partition stores the item. A good partition key has high cardinality — many distinct values — so items spread evenly across partitions and you avoid "hot partitions" that throttle reads/writes.
run_id UUID instead of a fixed job_name string.
import boto3
dynamodb = boto3.client("dynamodb", region_name="us-east-1")
# Partition key only — simple primary key
dynamodb.create_table(
TableName="pipeline_audit",
KeySchema=[
{"AttributeName": "run_id", "KeyType": "HASH"} # partition key
],
AttributeDefinitions=[
{"AttributeName": "run_id", "AttributeType": "S"} # S = String
],
BillingMode="PAY_PER_REQUEST" # on-demand — no capacity planning needed
)
An optional sort key (range key) combined with the partition key forms a composite primary key. Items with the same partition key are stored together and sorted by the sort key — enabling efficient range queries like "all runs for pipeline X between date A and date B". This is the most useful design for pipeline audit tables where you want to query all runs of a given pipeline.
# Composite primary key — very useful for pipeline audit tables
dynamodb.create_table(
TableName="pipeline_runs",
KeySchema=[
{"AttributeName": "pipeline_id", "KeyType": "HASH"}, # partition key
{"AttributeName": "run_timestamp", "KeyType": "RANGE"} # sort key
],
AttributeDefinitions=[
{"AttributeName": "pipeline_id", "AttributeType": "S"},
{"AttributeName": "run_timestamp", "AttributeType": "S"}
],
BillingMode="PAY_PER_REQUEST"
)
# Now you can query: "all runs for sales-etl, sorted newest-first"
# → KeyConditionExpression: pipeline_id = "sales-etl"
# → ScanIndexForward = False (descending by sort key)
pipeline_id as the partition key and run_timestamp (ISO 8601 string like 2024-01-15T08:00:00Z) as the sort key. This lets you instantly fetch the last N runs of any pipeline with a single query() call and ScanIndexForward=False.
DynamoDB offers two billing and scaling models. For most DE use cases (pipeline metadata, audit tables with bursty writes at job completion), On-Demand (PAY_PER_REQUEST) is the right choice — no capacity planning, no throttling, and cost is proportional to actual usage.
| Mode | How It Works | Best For | Cost Model |
|---|---|---|---|
On-DemandPAY_PER_REQUEST | AWS auto-scales instantly to any load | Bursty, unpredictable workloads — like pipeline audit writes | Pay per read/write request |
| Provisioned | You specify RCU and WCU; AWS holds that capacity reserved | Steady, predictable throughput at high volume | Pay per provisioned capacity unit per hour |
# RCU = Read Capacity Unit → 1 strongly consistent read of up to 4 KB/s
# WCU = Write Capacity Unit → 1 write of up to 1 KB/s
# On-Demand skips all this — AWS handles it automatically
# Provisioned example — only if you have steady predictable load
dynamodb.create_table(
TableName="high_volume_events",
KeySchema=[{"AttributeName": "event_id", "KeyType": "HASH"}],
AttributeDefinitions=[{"AttributeName": "event_id", "AttributeType": "S"}],
BillingMode="PROVISIONED",
ProvisionedThroughput={
"ReadCapacityUnits": 100,
"WriteCapacityUnits": 50
}
)
DynamoDB is the ideal store for pipeline config metadata — a table that describes every active pipeline (its source, target, load type, schedule, watermark column). The ETL orchestrator reads this table at startup and drives execution dynamically. This pattern is called metadata-driven ETL.
import boto3
from boto3.dynamodb.conditions import Key
# Use the resource (higher-level) API for cleaner item access
ddb = boto3.resource("dynamodb", region_name="us-east-1")
table = ddb.Table("pipeline_config")
# Fetch config for a specific pipeline by its primary key
response = table.get_item(Key={"pipeline_id": "sales-orders-etl"})
config = response["Item"]
print(config["source_table"]) # → "raw.orders"
print(config["target_table"]) # → "silver.orders_cleaned"
print(config["load_type"]) # → "incremental"
print(config["watermark_column"]) # → "updated_at"
print(config["is_active"]) # → True
pipeline_id, source_system (e.g. "salesforce"), source_table, target_table, load_type (full/incremental), watermark_column, schedule, is_active (bool), owner_team. Any field can be updated in DynamoDB without changing pipeline code.
Every pipeline run should write an audit record to DynamoDB — both at the start (status: RUNNING) and end (status: SUCCEEDED or FAILED). This gives you a complete operational log of every run, queryable without spinning up a database server. DynamoDB's low-latency writes make this a negligible overhead even in tight pipelines.
import boto3, uuid
from datetime import datetime, timezone
ddb = boto3.resource("dynamodb")
table = ddb.Table("pipeline_audit")
run_id = str(uuid.uuid4())
pipeline = "sales-orders-etl"
start_time = datetime.now(timezone.utc).isoformat()
# ① Write RUNNING record at pipeline start
table.put_item(Item={
"run_id": run_id,
"pipeline_id": pipeline,
"status": "RUNNING",
"start_time": start_time,
"end_time": None,
"rows_read": 0,
"rows_written":0,
"error_msg": None
})
# … run your ETL logic here …
rows_written = 142_830
# ② Update to SUCCEEDED at pipeline end
table.update_item(
Key={"run_id": run_id},
UpdateExpression="SET #s = :s, end_time = :et, rows_written = :rw",
ExpressionAttributeNames={"#s": "status"}, # "status" is a reserved word
ExpressionAttributeValues={
":s": "SUCCEEDED",
":et": datetime.now(timezone.utc).isoformat(),
":rw": rows_written
}
)
For pipeline-level configuration that changes between environments (dev/staging/prod) or between runs — like S3 bucket names, JDBC URLs, batch sizes, or feature flags — DynamoDB is a fast, cheap, serverless alternative to hardcoding values or reading from S3. A single get_item() takes under 5ms and costs a fraction of a cent.
def get_config(env: str, key: str) -> str:
"""Fetch a single config value for a given environment."""
ddb = boto3.resource("dynamodb")
table = ddb.Table("etl_config")
resp = table.get_item(Key={"env": env, "config_key": key})
return resp["Item"]["config_value"]
# Usage in your Glue / EMR job
s3_output = get_config("prod", "silver_bucket") # → "s3://my-co-silver"
batch_size = get_config("prod", "jdbc_batch_size") # → "50000"
A control table stores the current watermark (last successfully processed timestamp or offset) for each pipeline. Before each run, the pipeline reads the watermark to know where to start; after a successful run, it updates the watermark. This enables safe, idempotent incremental processing — if a run fails, the watermark is not updated so the next run re-processes from the last safe point.
import boto3
from boto3.dynamodb.conditions import Key
ddb = boto3.resource("dynamodb")
wm_table = ddb.Table("pipeline_watermarks")
pipeline = "sales-orders-etl"
# ① Read last watermark before ETL run
resp = wm_table.get_item(Key={"pipeline_id": pipeline})
last_wm = resp["Item"]["last_watermark"] # e.g. "2024-01-14T23:59:59Z"
print(f"Reading records updated after {last_wm}")
# ② Run incremental ETL — read only new rows from source
# df = spark.read.jdbc(...).filter(f"updated_at > '{last_wm}'")
# ③ After successful ETL — update watermark to now
new_wm = "2024-01-15T23:59:59Z"
wm_table.update_item(
Key={"pipeline_id": pipeline},
UpdateExpression="SET last_watermark = :wm, updated_at = :ts",
ExpressionAttributeValues={
":wm": new_wm,
":ts": datetime.now(timezone.utc).isoformat()
}
)
print(f"Watermark updated to {new_wm}")
update_item after all downstream writes (S3, Delta, Redshift) have been confirmed. If the pipeline fails mid-run, the unchanged watermark ensures the next run starts from the last safe point.
On EMR, AWS ships the EMR DynamoDB Connector — a Hadoop InputFormat that lets Spark read a DynamoDB table as an RDD/DataFrame. It handles parallel scanning across multiple Spark tasks, segment-level reads, and throughput throttle management. This is useful when you need to join pipeline metadata or config stored in DynamoDB with large datasets in S3.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DynamoDB-Read").getOrCreate()
sc = spark.sparkContext
# EMR DynamoDB connector — uses Hadoop InputFormat under the hood
rdd = sc.hadoopConfiguration.setAll([
("dynamodb.input.tableName", "pipeline_config"),
("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com"),
("dynamodb.splits", "4") # number of parallel segments
])
# Alternative: use boto3 scan in a foreachPartition for smaller tables
def scan_dynamo_partition(_):
import boto3
ddb = boto3.resource("dynamodb")
table = ddb.Table("pipeline_config")
pages = table.scan()
for item in pages["Items"]:
yield item
# For small config/metadata tables — simplest approach
config_items = sc.parallelize([1]).flatMap(scan_dynamo_partition).collect()
config_df = spark.createDataFrame(config_items)
The recommended pattern for writing Spark results to DynamoDB is not a direct connector — it's using foreachPartition to batch-write items from each Spark partition using boto3 inside the executor. This is far more efficient than writing one item per row, since DynamoDB's batch_write_item supports up to 25 items per call.
from pyspark.sql.functions import col
import boto3
from boto3.dynamodb.types import TypeSerializer
serializer = TypeSerializer()
def write_partition_to_dynamo(rows):
"""Called once per Spark partition — batch-writes to DynamoDB."""
ddb = boto3.resource("dynamodb", region_name="us-east-1")
table = ddb.Table("pipeline_audit")
batch = []
for row in rows:
item = row.asDict()
batch.append({"PutRequest": {"Item": item}})
if len(batch) == 25: # DynamoDB batch limit is 25 items
with table.batch_writer() as bw:
for req in batch:
bw.put_item(Item=req["PutRequest"]["Item"])
batch = []
# Flush remaining items
if batch:
with table.batch_writer() as bw:
for req in batch:
bw.put_item(Item=req["PutRequest"]["Item"])
# Write a small summary DataFrame (e.g. per-pipeline row counts) to DynamoDB
summary_df.foreachPartition(write_partition_to_dynamo)
batch_writer() context manager automatically handles batching into groups of 25, retries on unprocessed items, and throttle-aware backoff — far safer than manually calling batch_write_item() and handling UnprocessedItems yourself.
DynamoDB is schemaless — you only define the primary key attributes at table creation. Every other attribute is defined per-item and can vary between items. But each attribute value is typed: S (String), N (Number), B (Binary), BOOL (Boolean), NULL, L (List), M (Map), SS (String Set), NS (Number Set). The resource API handles Python → DynamoDB type conversion automatically.
from decimal import Decimal # DynamoDB resource uses Decimal for numbers
table.put_item(Item={
"run_id": "run-abc-123", # S — partition key
"pipeline_id": "sales-orders-etl", # S
"status": "SUCCEEDED", # S
"rows_written": Decimal("142830"), # N — use Decimal not int/float
"dq_score": Decimal("98.5"), # N
"is_backfill": False, # BOOL
"tags": ["sales", "incremental"], # L — List
"meta": { # M — Map (nested dict)
"source_table": "raw.orders",
"target_table": "silver.orders"
}
})
float — always use Decimal("98.5") for numeric attributes. The client API uses DynamoDB's low-level type notation ({"N": "98.5"}) which avoids this issue.
DynamoDB supports conditional expressions that make a write succeed only if a condition on the current item is true. For pipeline state machines (e.g., only update status to FAILED if current status is RUNNING), this prevents race conditions when multiple processes might update the same record.
from botocore.exceptions import ClientError
try:
table.update_item(
Key={"run_id": run_id},
UpdateExpression="SET #s = :failed, error_msg = :err",
# Condition: only succeed if the item currently has status = RUNNING
ConditionExpression="#s = :running",
ExpressionAttributeNames={"#s": "status"},
ExpressionAttributeValues={
":failed": "FAILED",
":running": "RUNNING",
":err": "Out of memory on executor 3"
}
)
except ClientError as e:
if e.response["Error"]["Code"] == "ConditionalCheckFailedException":
print("Run was already marked SUCCEEDED or FAILED — skipping")
else:
raise
A query() uses the partition key to read only items in one partition — fast, cheap, O(result size). A scan() reads every item in the table — slow, expensive, O(table size), and consumes capacity proportional to the full table. Always design your table's primary key so that your most common access pattern maps to a query() not a scan().
from boto3.dynamodb.conditions import Key, Attr
# ✅ QUERY — fast, targeted: all runs for a specific pipeline, newest first
resp = table.query(
KeyConditionExpression=Key("pipeline_id").eq("sales-orders-etl"),
ScanIndexForward=False, # descending by sort key (run_timestamp)
Limit=10 # last 10 runs only
)
runs = resp["Items"]
# ❌ SCAN — reads entire table, never use in production for lookups
resp = table.scan(
FilterExpression=Attr("status").eq("FAILED") # filter happens AFTER reading all items!
)
failed_runs = resp["Items"] # costs as if you read everything
scan() is fine for small tables (under ~1,000 items) like a config table or a pipeline registry with a few dozen rows. For audit tables with millions of rows per month — always design for query().
If you need to query your audit table by a different key — for example, your primary key is run_id (partition) + run_timestamp (sort), but you also want to query all FAILED runs regardless of pipeline — you add a Global Secondary Index (GSI) with status as the partition key. A GSI is a separate automatically maintained projection of your table with a different key structure.
dynamodb.create_table(
TableName="pipeline_audit",
KeySchema=[
{"AttributeName": "run_id", "KeyType": "HASH"},
{"AttributeName": "run_timestamp", "KeyType": "RANGE"}
],
AttributeDefinitions=[
{"AttributeName": "run_id", "AttributeType": "S"},
{"AttributeName": "run_timestamp", "AttributeType": "S"},
{"AttributeName": "status", "AttributeType": "S"}
],
GlobalSecondaryIndexes=[{
"IndexName": "status-index",
"KeySchema": [
{"AttributeName": "status", "KeyType": "HASH"},
{"AttributeName": "run_timestamp", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "ALL"},
"BillingMode": "PAY_PER_REQUEST"
}],
BillingMode="PAY_PER_REQUEST"
)
# Now query the GSI for all FAILED runs today
resp = table.query(
IndexName="status-index",
KeyConditionExpression=(
Key("status").eq("FAILED") &
Key("run_timestamp").begins_with("2024-01-15")
)
)
Every production DE team uses some form of pipeline audit table. Here is the recommended schema that covers all operational needs — queryable by run, by pipeline, and by status via GSIs.
Encapsulate the audit table logic into a reusable class that your Glue jobs, Lambda functions, and EMR scripts can all import from a shared library. This ensures consistent audit records across every pipeline in your platform.
import boto3, uuid
from decimal import Decimal
from datetime import datetime, timezone
class PipelineAudit:
"""Reusable audit writer for all DE pipelines."""
def __init__(self, pipeline_id: str, table_name: str = "pipeline_audit"):
ddb = boto3.resource("dynamodb")
self.table = ddb.Table(table_name)
self.pipeline_id = pipeline_id
self.run_id = str(uuid.uuid4())
self.start_ts = datetime.now(timezone.utc)
def start(self, trigger_type: str = "SCHEDULED"):
"""Write RUNNING record at pipeline start."""
ts = self.start_ts.isoformat()
self.table.put_item(Item={
"pipeline_id": self.pipeline_id,
"run_timestamp": ts,
"run_id": self.run_id,
"status": "RUNNING",
"trigger_type": trigger_type,
"start_time": ts,
})
return self
def succeed(self, rows_read=0, rows_written=0, rows_rejected=0, dq_score=100):
"""Update to SUCCEEDED with metrics."""
end_ts = datetime.now(timezone.utc)
duration = (end_ts - self.start_ts).seconds
self.table.update_item(
Key={"pipeline_id": self.pipeline_id, "run_timestamp": self.start_ts.isoformat()},
UpdateExpression=("SET #s=:s, end_time=:et, duration_secs=:d,"
"rows_read=:rr, rows_written=:rw, rows_rejected=:rej, dq_score=:dq"),
ExpressionAttributeNames={"#s": "status"},
ExpressionAttributeValues={
":s": "SUCCEEDED",
":et": end_ts.isoformat(),
":d": Decimal(str(duration)),
":rr": Decimal(str(rows_read)),
":rw": Decimal(str(rows_written)),
":rej": Decimal(str(rows_rejected)),
":dq": Decimal(str(dq_score))
}
)
def fail(self, error_msg: str, error_type: str = "UNKNOWN"):
"""Update to FAILED with error details."""
self.table.update_item(
Key={"pipeline_id": self.pipeline_id, "run_timestamp": self.start_ts.isoformat()},
UpdateExpression="SET #s=:s, end_time=:et, error_msg=:em, error_type=:et2",
ExpressionAttributeNames={"#s": "status"},
ExpressionAttributeValues={
":s": "FAILED",
":et": datetime.now(timezone.utc).isoformat(),
":em": error_msg[:1000], # truncate long stack traces
":et2": error_type
}
)
# ─── Usage in any pipeline ───────────────────────────────────────────
audit = PipelineAudit("sales-orders-etl").start("SCHEDULED")
try:
# … your ETL code …
rows = 142_830
audit.succeed(rows_read=rows, rows_written=rows, dq_score=99)
except Exception as e:
audit.fail(error_msg=str(e), error_type="RUNTIME")
raise
Amazon RDS
Amazon RDS (Relational Database Service) is AWS's managed relational database — you pick the engine, AWS handles patching, backups, failover, and scaling. For Data Engineers, RDS is almost always a source system, not a destination. Your job is to efficiently extract data from RDS into your data lake (S3 / Delta / Iceberg) using Spark JDBC reads, Glue crawlers, or CDC tools — while being careful not to hammer the production database with full table scans.
PostgreSQL is the default choice for most modern data-engineering-adjacent applications. It supports JSONB columns, arrays, window functions, and logical replication (which Debezium and DMS use for CDC). If your source system runs Postgres, you have the richest set of extraction options available.
org.postgresql.DriverJAR:
postgresql-42.x.x.jarURL:
jdbc:postgresql://host:5432/dbnameupdated_at or id column as watermark for timestamp-based or offset-based incrementals.MySQL is widely used in OLTP applications (especially older stacks and e-commerce). Its binary log (binlog) is the CDC source — DMS and Debezium both read it. Spark JDBC reads work identically to Postgres, just with a different driver and URL format.
# PostgreSQL
jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db
# MySQL
jdbc:mysql://mydb.abc123.us-east-1.rds.amazonaws.com:3306/sales_db
# Aurora PostgreSQL (same driver as PostgreSQL)
jdbc:postgresql://mycluster.cluster-abc123.us-east-1.rds.amazonaws.com:5432/sales_db
The simplest JDBC read uses a single connection — Spark sends one query, gets all results back through one thread. This is fine for small lookup tables (under ~1 million rows) but will create a bottleneck and OOM executor on large tables because all data flows through one partition.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("RDS-Extract") \
.config("spark.jars", "/opt/jars/postgresql-42.6.0.jar") \
.getOrCreate()
# ⚠️ Single partition — only for small tables
df = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db") \
.option("dbtable", "public.orders") \
.option("user", "etl_reader") \
.option("password", "secret") \
.load()
df.printSchema()
df.show(5)
This is the most important JDBC pattern for large tables. Spark splits the read into numPartitions parallel queries, each fetching a range of the partitionColumn. All queries run simultaneously — one per Spark task — dramatically reducing extraction time. Each Spark executor opens its own JDBC connection to RDS.
import boto3, json
# ① Fetch credentials from Secrets Manager (always)
sm = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/sales-db")["SecretString"])
jdbc_url = f"jdbc:postgresql://{secret['host']}:{secret['port']}/{secret['dbname']}"
# ② Find the actual min/max of the partition column first
bounds_df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "(SELECT MIN(order_id) AS mn, MAX(order_id) AS mx FROM public.orders) t") \
.option("user", secret["username"]) \
.option("password",secret["password"]) \
.load()
mn = bounds_df.collect()[0]["mn"] # e.g. 1
mx = bounds_df.collect()[0]["mx"] # e.g. 10_000_000
# ③ Partitioned read — Spark spawns numPartitions parallel JDBC queries
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "public.orders") \
.option("user", secret["username"]) \
.option("password", secret["password"]) \
.option("partitionColumn", "order_id") # must be numeric
.option("lowerBound", str(mn)) # min value of column
.option("upperBound", str(mx)) # max value of column
.option("numPartitions", "20") # 20 parallel queries → 20 RDS connections
.option("fetchsize", "10000") # rows fetched per JDBC round-trip
.load()
print(f"Partitions: {df.rdd.getNumPartitions()}") # → 20
print(f"Row count: {df.count()}")
lowerBound=1, upperBound=10_000_000, numPartitions=20, Spark generates 20 WHERE clauses:Partition 0:
WHERE order_id < 500_001Partition 1:
WHERE order_id >= 500_001 AND order_id < 1_000_001…
Partition 19:
WHERE order_id >= 9_500_001 (catches everything above upperBound too)⚠️ lowerBound and upperBound are only used to calculate split ranges — they do NOT filter data. Rows outside these bounds are still read (in the first/last partition).
For incremental loads, you don't want the full table — just rows updated since the last run. Combine a custom SQL query as the dbtable (wrapped in a subquery alias) with the partitioned read options. This pushes the WHERE updated_at > last_watermark filter down to RDS, so Spark only fetches new rows.
import boto3, json
from boto3.dynamodb.conditions import Key
# ① Read last watermark from DynamoDB control table
ddb = boto3.resource("dynamodb")
wm_table = ddb.Table("pipeline_watermarks")
last_wm = wm_table.get_item(Key={"pipeline_id": "orders-incremental"})["Item"]["last_watermark"]
# last_wm = "2024-01-14 23:59:59"
# ② Build a subquery — only new/updated rows
query = f"""(
SELECT order_id, customer_id, amount, status, updated_at
FROM public.orders
WHERE updated_at > '{last_wm}'
) incremental_orders"""
# ③ Partitioned incremental read
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("user", secret["username"]) \
.option("password", secret["password"]) \
.option("partitionColumn", "order_id") \
.option("lowerBound", "1") \
.option("upperBound", "99999999") \
.option("numPartitions", "10") \
.load()
print(f"New/updated rows: {df.count()}")
Predicate pushdown means Spark sends filter conditions down to the database so RDS executes them — returning only matching rows instead of the full table. For JDBC sources, Spark pushes simple filters (=, >, <, IN, IS NULL) automatically. You can verify pushdown is happening by checking the Spark UI's SQL tab — look for PushedFilters in the scan node.
# Spark automatically pushes this filter to RDS — RDS runs:
# SELECT * FROM orders WHERE status = 'COMPLETED'
df_filtered = df.filter("status = 'COMPLETED'")
# Verify pushdown in the physical plan
df_filtered.explain("extended")
# Look for: PushedFilters: [IsNotNull(status), EqualTo(status,COMPLETED)]
# For complex pushdown — use the query option instead:
# .option("dbtable", "(SELECT * FROM orders WHERE status='COMPLETED') t")
The fetchsize option controls how many rows Spark retrieves from RDS in each network round-trip. The default is very low (often 10 rows for PostgreSQL). Setting it to 10,000–50,000 dramatically reduces the number of round-trips and cuts extraction time by 5–10x on large tables.
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", "public.orders") \
.option("user", secret["username"]) \
.option("password", secret["password"]) \
.option("partitionColumn", "order_id") \
.option("lowerBound", "1") \
.option("upperBound", "10000000") \
.option("numPartitions", "20") \
.option("fetchsize", "50000") # ← key performance option
.option("driver", "org.postgresql.Driver") \
.load()
| Option | What It Controls | Recommended Value |
|---|---|---|
numPartitions | Parallel JDBC connections to RDS | 10–50 (don't exceed RDS max connections) |
fetchsize | Rows per JDBC network round-trip | 10,000–50,000 |
partitionColumn | Column used to split range | Numeric, indexed, low-skew (primary key is ideal) |
lowerBound | Range start (not a filter) | Actual MIN of partitionColumn |
upperBound | Range end (not a filter) | Actual MAX of partitionColumn |
A read replica is a continuously synced copy of the primary RDS instance that accepts read-only queries. This is the most important RDS concept for Data Engineers: never run your Spark JDBC extracts against the production primary database. Heavy parallel JDBC reads (20 connections doing full table scans) can degrade application performance or cause connection pool exhaustion. Always point your ETL at a read replica.
# ❌ Primary endpoint — never use for heavy ETL reads
# jdbc:postgresql://mydb.abc123.us-east-1.rds.amazonaws.com:5432/sales_db
# ✅ Read replica endpoint — use this for all Spark JDBC extracts
jdbc_url_replica = "jdbc:postgresql://mydb-replica.abc123.us-east-1.rds.amazonaws.com:5432/sales_db"
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url_replica) \ # ← replica, not primary
.option("dbtable", "public.orders") \
.option("user", secret["username"]) \
.option("password",secret["password"]) \
.option("numPartitions", "20") \
.load()
Multi-AZ is an RDS feature that keeps a synchronous standby replica in a different Availability Zone. If the primary fails, RDS automatically promotes the standby — typical failover is 60–120 seconds. As a DE, Multi-AZ is mostly invisible to you (your JDBC connection reconnects after failover), but you need to understand it for production RDS sizing discussions and for explaining why your pipeline might see a brief connection error during an RDS maintenance window.
| Feature | Read Replica | Multi-AZ Standby |
|---|---|---|
| Purpose | Read scaling + ETL offload | High availability / failover |
| Accepts reads? | Yes — separate endpoint | No — standby only |
| Replication type | Asynchronous | Synchronous |
| Failover | Manual promotion required | Automatic — 60–120s |
| DE relevance | Point ETL here | Transparent — just handle reconnect |
AWS Glue Crawlers can connect directly to an RDS instance via JDBC and automatically discover all tables and their schemas, registering them in the Glue Data Catalog. Once catalogued, your Glue ETL jobs and Athena can reference these tables by name without you writing schema definitions manually.
import boto3
glue = boto3.client("glue", region_name="us-east-1")
# First, create a Glue Connection for the RDS instance
glue.create_connection(
ConnectionInput={
"Name": "rds-sales-db-connection",
"ConnectionType": "JDBC",
"ConnectionProperties": {
"JDBC_CONNECTION_URL": "jdbc:postgresql://mydb-replica.abc123.us-east-1.rds.amazonaws.com:5432/sales_db",
"USERNAME": "etl_reader",
"PASSWORD": "{{resolve:secretsmanager:prod/rds/sales-db:SecretString:password}}"
},
"PhysicalConnectionRequirements": {
"SubnetId": "subnet-0abc123", # must be in same VPC as RDS
"SecurityGroupIdList": ["sg-0abc123"],
"AvailabilityZone": "us-east-1a"
}
}
)
# Create a crawler to scan the public schema of the RDS database
glue.create_crawler(
Name="rds-sales-db-crawler",
Role="arn:aws:iam::123456789:role/GlueCrawlerRole",
DatabaseName="raw_rds_sales", # Glue Catalog database to write to
Targets={"JdbcTargets": [{
"ConnectionName": "rds-sales-db-connection",
"Path": "sales_db/public/%" # db/schema/table — % wildcard = all tables
}]},
Schedule="cron(0 1 * * ? *)" # run nightly at 01:00 UTC
)
# Start the crawler
glue.start_crawler(Name="rds-sales-db-crawler")
For teams that are more comfortable with SQL than DynamoDB, RDS PostgreSQL (typically Aurora Serverless for cost) can serve as the pipeline metadata and audit database. You store run history, watermarks, control tables, and data quality results in relational tables — and query them with standard SQL joins. The trade-off vs DynamoDB: RDS requires connection management (pool size, VPC placement) and is not serverless in the same zero-management sense.
import boto3, json, psycopg2
# ① Fetch credentials
sm = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/metadata-db")["SecretString"])
# ② Connect to RDS PostgreSQL
conn = psycopg2.connect(
host= secret["host"],
port= secret["port"],
dbname= secret["dbname"],
user= secret["username"],
password= secret["password"],
sslmode= "require" # always enforce SSL on RDS
)
cursor = conn.cursor()
# ③ Insert audit record
cursor.execute("""
INSERT INTO pipeline_audit
(run_id, pipeline_id, status, start_time, rows_written, error_msg)
VALUES (%s, %s, %s, NOW(), %s, %s)
""", (run_id, "sales-orders-etl", "SUCCEEDED", 142830, None))
conn.commit()
cursor.close()
conn.close()
This is the standard end-to-end pattern for a daily incremental extract from RDS to the Bronze layer of your data lake on S3:
import boto3, json, uuid
from datetime import datetime, timezone
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, lit
spark = SparkSession.builder.appName("RDS-Bronze-Extract").getOrCreate()
# ── Config ─────────────────────────────────────────────────────────
PIPELINE = "rds-orders-incremental"
S3_OUTPUT = "s3://company-bronze/rds/sales/orders/"
RUN_ID = str(uuid.uuid4())
# ── Credentials ────────────────────────────────────────────────────
sm = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/rds/sales-db")["SecretString"])
jdbc_url = f"jdbc:postgresql://{secret['replica_host']}:5432/{secret['dbname']}"
# ── Watermark ──────────────────────────────────────────────────────
ddb = boto3.resource("dynamodb")
wm_tbl = ddb.Table("pipeline_watermarks")
last_wm = wm_tbl.get_item(Key={"pipeline_id": PIPELINE})["Item"]["last_watermark"]
# ── Extract ────────────────────────────────────────────────────────
query = f"(SELECT * FROM public.orders WHERE updated_at > '{last_wm}') orders_delta"
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("user", secret["username"]) \
.option("password", secret["password"]) \
.option("partitionColumn", "order_id") \
.option("lowerBound", "1") \
.option("upperBound", "99999999") \
.option("numPartitions", "20") \
.option("fetchsize", "50000") \
.load()
rows = df.count()
# ── Bronze Audit Columns ───────────────────────────────────────────
df_bronze = df \
.withColumn("_load_timestamp", current_timestamp()) \
.withColumn("_batch_id", lit(RUN_ID)) \
.withColumn("_pipeline_name", lit(PIPELINE))
# ── Write to S3 Bronze ─────────────────────────────────────────────
df_bronze.write \
.partitionBy("year", "month", "day") \
.mode("append") \
.parquet(S3_OUTPUT)
# ── Update Watermark ───────────────────────────────────────────────
new_wm = df.agg({"updated_at": "max"}).collect()[0][0]
wm_tbl.update_item(
Key={"pipeline_id": PIPELINE},
UpdateExpression="SET last_watermark = :wm",
ExpressionAttributeValues={":wm": str(new_wm)}
)
print(f"✅ Extracted {rows:,} rows → {S3_OUTPUT}")
Amazon EventBridge
EventBridge is AWS's serverless event bus — it routes events from sources (AWS services, your own application code, SaaS tools) to targets (Lambda, SQS, SNS, Glue, Step Functions, and more) based on rules you define. For Data Engineers it serves two critical roles: cron-style pipeline scheduling (replacing simple time-based triggers) and event-driven pipeline triggers (reacting to S3 file arrivals, DMS task completions, Glue job state changes, and custom application events).
An event is a JSON object describing something that happened — a state change, a file arrival, a job completion, or a custom business event your code publishes. Every event has a standard envelope with source, detail-type, detail (the payload), time, and account/region. EventBridge receives events and routes them to matching rules.
{
"version": "0",
"id": "abc-123-def-456",
"source": "com.mycompany.data-platform", // who sent it
"detail-type": "PipelineCompleted", // what kind of event
"time": "2024-01-15T08:30:00Z",
"account": "123456789012",
"region": "us-east-1",
"detail": { // your custom payload
"pipeline_id": "sales-orders-etl",
"status": "SUCCEEDED",
"rows_written": 142830,
"output_path": "s3://company-silver/orders/year=2024/month=01/day=15/"
}
}
A rule is the routing logic — it matches incoming events against a pattern (or a schedule) and sends matching events to one or more targets. There are two types of rules: Schedule rules (time-based — run at 06:00 UTC every day) and Event pattern rules (content-based — trigger when a Glue job state changes to FAILED).
{
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail": {
"state": ["FAILED", "ERROR", "TIMEOUT"],
"jobName": ["sales-orders-etl", "customer-dim-etl"]
}
}
A target is where EventBridge sends the matched event. One rule can have up to 5 targets — so a single event can simultaneously trigger a Lambda, send a message to SQS, and notify an SNS topic. Common DE targets:
Every AWS account has a default event bus that receives all AWS service events (S3, Glue, EMR, DMS state changes, etc.). You can also create custom event buses for your own application events — isolating them from AWS service events and enabling cross-account event routing. For pipeline platforms, a custom bus named something like data-platform-events keeps your events organized and separate.
import boto3
eb = boto3.client("events", region_name="us-east-1")
# Create a custom event bus for all data platform events
eb.create_event_bus(Name="data-platform-events")
# Default bus is always named "default"
# Custom buses are referenced by ARN or name in rules and put_events
EventBridge supports cron expressions for precise scheduling — run at exactly 06:00 UTC every weekday, or at midnight on the first of every month. EventBridge cron uses a 6-field format: cron(minutes hours day-of-month month day-of-week year). Note: EventBridge does not support seconds-level granularity — minimum is 1 minute.
# Format: cron(minutes hours day-of-month month day-of-week year)
cron(0 6 * * ? *) # Daily at 06:00 UTC
cron(0 1 * * ? *) # Daily at 01:00 UTC (typical nightly batch)
cron(0 0 1 * ? *) # Monthly — 1st of every month at midnight
cron(0 6 ? * MON-FRI *) # Weekdays only at 06:00 UTC
cron(0 */4 * * ? *) # Every 4 hours
cron(0 8 ? * MON *) # Every Monday at 08:00 UTC (weekly batch)
# Note: use ? in day-of-month OR day-of-week (not both)
Rate expressions are simpler than cron — they just say "run every N minutes/hours/days." Use them when you don't need a specific clock time, just a regular interval. They start running immediately when the rule is created.
rate(5 minutes) # every 5 minutes — lightweight polling pipeline
rate(1 hour) # every hour
rate(1 day) # every day (from rule creation time, not midnight)
rate(12 hours) # twice a day
The full pattern: create a rule with a schedule, create a target (e.g. Lambda that triggers a Glue job), and add the permission for EventBridge to invoke the Lambda.
import boto3, json
eb = boto3.client("events")
lam = boto3.client("lambda")
RULE_NAME = "nightly-sales-etl-trigger"
LAMBDA_ARN = "arn:aws:lambda:us-east-1:123456789:function:trigger-sales-etl"
LAMBDA_FUNC = "trigger-sales-etl"
# ① Create the schedule rule
rule_resp = eb.put_rule(
Name= RULE_NAME,
ScheduleExpression= "cron(0 1 * * ? *)", # every night at 01:00 UTC
State= "ENABLED",
Description= "Triggers the nightly sales orders ETL pipeline"
)
rule_arn = rule_resp["RuleArn"]
# ② Add Lambda as target — pass pipeline config in the input JSON
eb.put_targets(
Rule=RULE_NAME,
Targets=[{
"Id": "sales-etl-lambda-target",
"Arn": LAMBDA_ARN,
"Input": json.dumps({ # static JSON sent to Lambda as event
"pipeline_id": "sales-orders-etl",
"trigger_type": "SCHEDULED",
"environment": "prod"
})
}]
)
# ③ Grant EventBridge permission to invoke the Lambda
lam.add_permission(
FunctionName= LAMBDA_FUNC,
StatementId= "allow-eventbridge-invoke",
Action= "lambda:InvokeFunction",
Principal= "events.amazonaws.com",
SourceArn= rule_arn
)
print(f"✅ Scheduled rule created: {rule_arn}")
The most common event-driven DE pattern: a file lands in S3, EventBridge detects it, triggers a Lambda, which starts a Glue job to process the file. This eliminates polling — your pipeline reacts within seconds of file arrival rather than waiting for a fixed schedule.
import boto3, json
eb = boto3.client("events")
# Rule: match S3 ObjectCreated events for a specific prefix
eb.put_rule(
Name= "s3-salesforce-feed-arrival",
EventPattern= json.dumps({
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {"name": ["company-raw-landing"]},
"object": {"key": [{"prefix": "feeds/salesforce/"}]}
}
}),
State= "ENABLED",
Description= "Trigger Glue when Salesforce feed lands in S3"
)
# Lambda handler that EventBridge calls — triggers Glue
# (This goes in your Lambda function code, not boto3 call)
LAMBDA_CODE = """
import boto3, json
def handler(event, context):
glue = boto3.client("glue")
detail = event["detail"]
bucket = detail["bucket"]["name"]
key = detail["object"]["key"]
glue.start_job_run(
JobName="salesforce-accounts-etl",
Arguments={
"--source-bucket": bucket,
"--source-key": key,
"--trigger-type": "S3_EVENT"
}
)
return {"status": "Glue job started", "key": key}
"""
print("Lambda code ready — deploy as trigger-salesforce-etl function")
s3.put_bucket_notification_configuration(Bucket=..., NotificationConfiguration={"EventBridgeConfiguration": {}}).
If your orchestration is Airflow (MWAA), you can trigger a DAG run on file arrival by routing the EventBridge event to a Lambda that calls the Airflow REST API to trigger the DAG. This gives you the best of both worlds: event-driven file detection + full Airflow orchestration for the downstream pipeline.
import boto3, json, requests
from base64 import b64decode
def handler(event, context):
# Extract file details from EventBridge S3 event
detail = event["detail"]
s3_key = detail["object"]["key"]
# Fetch Airflow credentials from Secrets Manager
sm = boto3.client("secretsmanager")
secret = json.loads(sm.get_secret_value(SecretId="prod/airflow/api-creds")["SecretString"])
# Trigger the Airflow DAG via REST API
airflow_url = secret["webserver_url"]
dag_id = "salesforce_feed_processor"
resp = requests.post(
f"{airflow_url}/api/v1/dags/{dag_id}/dagRuns",
auth=(secret["username"], secret["password"]),
json={"conf": {"s3_key": s3_key, "trigger": "s3_event"}},
timeout=10
)
resp.raise_for_status()
return {"dag_run_id": resp.json()["dag_run_id"], "s3_key": s3_key}
AWS Glue automatically emits state-change events to EventBridge when a job transitions to SUCCEEDED, FAILED, TIMEOUT, or STOPPED. You can create a rule that routes FAILED events directly to SNS — giving you instant alerting without any polling code.
import boto3, json
eb = boto3.client("events")
sns = boto3.client("sns")
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789:de-pipeline-failures"
EB_ROLE_ARN = "arn:aws:iam::123456789:role/EventBridgeSNSPublishRole"
# Rule: any Glue job in FAILED or TIMEOUT state
eb.put_rule(
Name= "glue-job-failure-alert",
EventPattern= json.dumps({
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail": {"state": ["FAILED", "TIMEOUT"]}
}),
State= "ENABLED"
)
# Target: SNS topic — EventBridge sends the full event JSON as message
eb.put_targets(
Rule= "glue-job-failure-alert",
Targets= [{
"Id": "sns-failure-alert",
"Arn": SNS_TOPIC_ARN,
"RoleArn": EB_ROLE_ARN # EB needs a role to publish to SNS
}]
)
# Allow EventBridge to publish to this SNS topic
sns.set_topic_attributes(
TopicArn= SNS_TOPIC_ARN,
AttributeName= "Policy",
AttributeValue= json.dumps({
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "events.amazonaws.com"},
"Action": "SNS:Publish",
"Resource": SNS_TOPIC_ARN
}]
})
)
put_events() lets your pipeline code publish its own events to EventBridge — enabling downstream pipelines to react automatically. For example, when the Bronze orders ETL completes, it publishes a BronzeLayerReady event, which triggers the Silver transformation pipeline. This creates a fully event-driven, decoupled pipeline chain without any polling or hardcoded dependencies.
import boto3, json
from datetime import datetime, timezone
eb = boto3.client("events")
def publish_pipeline_event(pipeline_id: str, status: str,
rows_written: int, output_path: str):
"""Publish a pipeline completion event to EventBridge."""
resp = eb.put_events(
Entries=[{
"EventBusName": "data-platform-events", # custom bus
"Source": "com.mycompany.data-platform",
"DetailType": "PipelineStateChange",
"Detail": json.dumps({
"pipeline_id": pipeline_id,
"status": status,
"rows_written": rows_written,
"output_path": output_path,
"timestamp": datetime.now(timezone.utc).isoformat(),
"layer": "bronze"
})
}]
)
failed = resp["FailedEntryCount"]
if failed > 0:
raise RuntimeError(f"EventBridge put_events failed: {resp['Entries']}")
return resp["Entries"][0]["EventId"]
# At end of Bronze ETL pipeline — publish event to trigger Silver
event_id = publish_pipeline_event(
pipeline_id= "rds-orders-bronze",
status= "SUCCEEDED",
rows_written= 142830,
output_path= "s3://company-bronze/rds/sales/orders/year=2024/month=01/day=15/"
)
print(f"✅ Event published: {event_id}")
By combining put_events() with EventBridge rules, you can chain pipeline stages so each stage automatically triggers the next on success — with no orchestrator polling between them. This is the foundation of a reactive data platform.
put_events() accepts up to 10 events per call, with each event up to 256 KB. If you need to publish more than 10 events at once (e.g. one event per table in a metadata-driven pipeline with 50 tables), batch them into groups of 10 and check FailedEntryCount on every response.
import boto3, json
from datetime import datetime, timezone
eb = boto3.client("events")
def publish_events_batch(events: list):
"""Publish a list of events, batching into groups of 10."""
failures = []
for i in range(0, len(events), 10): # chunk into 10s
batch = events[i:i+10]
resp = eb.put_events(Entries=batch)
if resp["FailedEntryCount"] > 0:
failures.extend([
e for e in resp["Entries"] if "ErrorCode" in e
])
if failures:
raise RuntimeError(f"Failed to publish {len(failures)} events: {failures}")
# Build one event per completed pipeline table
pipeline_results = [
{"table": "orders", "rows": 142830},
{"table": "customers", "rows": 85200},
# ... up to 50 tables
]
entries = [{
"EventBusName": "data-platform-events",
"Source": "com.mycompany.data-platform",
"DetailType": "TableLoadComplete",
"Detail": json.dumps({
"table_name": r["table"],
"rows": r["rows"],
"timestamp": datetime.now(timezone.utc).isoformat()
})
} for r in pipeline_results]
publish_events_batch(entries)
print(f"✅ Published {len(entries)} table completion events")
EventBridge event patterns support rich filtering beyond simple equality — prefix matching, suffix matching, numeric ranges, existence checks, and anything-but (negation). This lets you route events with surgical precision.
{
"source": ["com.mycompany.data-platform"],
"detail": {
// Exact match
"status": ["SUCCEEDED"],
// Multiple values — OR logic
"layer": ["bronze", "silver"],
// Prefix match — key starts with "orders"
"table_name": [{ "prefix": "orders" }],
// Anything-but — NOT these values
"environment": [{ "anything-but": ["dev", "test"] }],
// Exists — only match if field is present
"output_path": [{ "exists": true }],
// Numeric range — rows_written between 1000 and 10_000_000
"rows_written": [{ "numeric": [">", 1000, "<=", 10000000] }]
}
}
The full lifecycle of EventBridge rules — listing active rules, temporarily disabling them during maintenance windows, removing stale rules, and listing their targets — is managed via boto3.
import boto3
eb = boto3.client("events")
# List all rules (paginate for > 100 rules)
paginator = eb.get_paginator("list_rules")
for page in paginator.paginate(EventBusName="data-platform-events"):
for rule in page["Rules"]:
print(f"{rule['Name']:40s} {rule['State']:10s} {rule.get('ScheduleExpression','')}")
# Disable a rule during a maintenance window
eb.disable_rule(Name="nightly-sales-etl-trigger", EventBusName="default")
# Re-enable after maintenance
eb.enable_rule(Name="nightly-sales-etl-trigger", EventBusName="default")
# List targets for a rule
targets = eb.list_targets_by_rule(Rule="nightly-sales-etl-trigger")
for t in targets["Targets"]:
print(f" Target: {t['Id']} → {t['Arn']}")
# Remove a target (required before deleting the rule)
eb.remove_targets(
Rule= "nightly-sales-etl-trigger",
Ids= ["sales-etl-lambda-target"]
)
# Delete the rule (only after removing all targets)
eb.delete_rule(Name="nightly-sales-etl-trigger")
delete_rule(). Calling delete_rule() on a rule that still has targets will raise a ValidationException. Always: remove_targets() → then delete_rule().
You may see references to CloudWatch Events in older AWS documentation and Terraform resources. EventBridge is CloudWatch Events — it was renamed and expanded in 2019. The same API, same underlying service. All new development should use the EventBridge name and console; CloudWatch Events still works but points to the same service.
| Feature | EventBridge (current) | CloudWatch Events (old name) |
|---|---|---|
| AWS service events | ✅ | ✅ |
| Custom event buses | ✅ | ❌ |
| SaaS partner events | ✅ | ❌ |
| Schema registry | ✅ | ❌ |
| Same API calls? | Yes — identical boto3 client("events") | |
Amazon SQS — Simple Queue Service
SQS is AWS's fully managed message queue service. For Data Engineers, it acts as a buffer and decoupler between pipeline stages — absorbing bursts of messages so your downstream processors (Lambda, Glue, Spark) don't get overwhelmed. It's the backbone of event-driven architectures: file arrivals land in SQS, pipelines consume from SQS, failures get routed to a Dead Letter Queue (DLQ). Understanding SQS deeply — visibility timeouts, polling strategies, DLQ design, and the consume-process-delete pattern — is essential for reliable production pipelines.
A Standard Queue gives you nearly unlimited throughput — it can handle thousands of messages per second. The trade-off is that it offers at-least-once delivery (a message might be delivered more than once) and best-effort ordering (messages may arrive out of order). For most data pipeline use cases — triggering Glue jobs, notifying Lambda of file arrivals, buffering pipeline events — Standard Queue is the right choice. Your consumer code simply needs to be idempotent (processing the same message twice produces the same result).
A FIFO Queue (First-In-First-Out) guarantees exactly-once processing and strict message ordering within a message group. It prevents duplicates through a deduplication ID — if you send two messages with the same deduplication ID within a 5-minute window, the second is discarded. FIFO queues have a throughput limit of 3,000 messages/second with batching (300 without). Use FIFO when order matters — for example, processing database CDC events where an INSERT must be processed before the UPDATE of the same row.
| Feature | Standard Queue | FIFO Queue |
|---|---|---|
| Delivery guarantee | At-least-once | Exactly-once |
| Ordering | Best-effort | Strict (per group) |
| Throughput | Unlimited | 3,000 msg/sec (batched) |
| Deduplication | No | Yes (5-min window) |
| Queue name suffix | Any name | Must end in .fifo |
| Use case in DE | File triggers, pipeline events, alerts | CDC events, ordered state transitions |
A Dead Letter Queue is just another SQS queue that receives messages that have been delivered too many times without being successfully processed. You configure a maxReceiveCount on your main queue — if a message is received more than that many times without being deleted, SQS automatically moves it to the DLQ. This protects your pipeline from a "poison message" (a malformed record that keeps crashing your consumer) from blocking all other processing.
import boto3, json
sqs = boto3.client("sqs", region_name="us-east-1")
# ① Create the Dead Letter Queue first
dlq_resp = sqs.create_queue(
QueueName="pipeline-events-dlq",
Attributes={
"MessageRetentionPeriod": "1209600" # 14 days — max retention for DLQ
}
)
dlq_url = dlq_resp["QueueUrl"]
# Get the DLQ ARN (needed for redrive policy)
dlq_arn = sqs.get_queue_attributes(
QueueUrl=dlq_url,
AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]
# ② Create the main queue with a redrive policy pointing to the DLQ
main_resp = sqs.create_queue(
QueueName="pipeline-events",
Attributes={
"VisibilityTimeout": "300", # 5 min — must be > job runtime
"MessageRetentionPeriod":"86400", # 1 day
"ReceiveMessageWaitTimeSeconds": "20", # long polling (cost-saving)
# After 3 failed attempts, move message to DLQ
"RedrivePolicy": json.dumps({
"deadLetterTargetArn": dlq_arn,
"maxReceiveCount": 3
})
}
)
main_queue_url = main_resp["QueueUrl"]
print(f"Main queue: {main_queue_url}")
print(f"DLQ: {dlq_url}")
When a consumer receives a message, SQS makes that message invisible to all other consumers for the duration of the visibility timeout. The consumer has until the timeout expires to finish processing and delete the message. If it fails to delete within the timeout window, SQS makes the message visible again so another consumer (or the same one on retry) can pick it up. This is the core mechanism behind SQS's at-least-once delivery guarantee.
import boto3, threading
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"
# Receive a message (long polling)
resp = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=1,
WaitTimeSeconds=20, # long poll — wait up to 20s for a message
VisibilityTimeout=300 # initially invisible for 5 minutes
)
messages = resp.get("Messages", [])
if not messages:
print("No messages available")
else:
msg = messages[0]
handle = msg["ReceiptHandle"] # needed to delete or extend visibility
body = msg["Body"]
# ── Heartbeat: extend visibility timeout every 4 minutes ───────────
# Useful for jobs that might run longer than the initial timeout
def extend_visibility(queue_url, receipt_handle, stop_event):
while not stop_event.is_set():
stop_event.wait(240) # wait 4 minutes
if not stop_event.is_set():
sqs.change_message_visibility(
QueueUrl= queue_url,
ReceiptHandle= receipt_handle,
VisibilityTimeout= 300 # reset to 5 more minutes
)
print("Visibility timeout extended by 5 minutes")
stop_event = threading.Event()
heartbeat = threading.Thread(
target=extend_visibility,
args=(QUEUE_URL, handle, stop_event),
daemon=True
)
heartbeat.start()
try:
# ── Process the message (e.g. start a Glue job) ────────────────
print(f"Processing: {body}")
# ... do actual work here ...
# ── On success: DELETE the message from the queue ──────────────
sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=handle)
print("✅ Message processed and deleted")
except Exception as e:
print(f"❌ Processing failed: {e} — message will become visible again")
# Do NOT delete — let SQS re-deliver (up to maxReceiveCount times)
finally:
stop_event.set() # stop the heartbeat thread
SQS stores messages for a configurable retention period — from 1 minute to 14 days (default is 4 days). If a message is not consumed and deleted within the retention period, SQS permanently deletes it. For your DLQ, set the maximum 14-day retention so your ops team has ample time to investigate failures and replay messages. For the main queue, 1–4 days is usually enough.
Short polling (default) immediately returns — even if the queue is empty — and you are charged for the API call. Long polling (WaitTimeSeconds=20) waits up to 20 seconds for a message to arrive before returning an empty response. Long polling dramatically reduces costs (fewer empty API calls) and reduces latency for message consumers. Always use long polling in production — set WaitTimeSeconds to 20 at the queue level or on each receive_message() call.
import boto3
sqs = boto3.client("sqs")
# Enable long polling on an existing queue
sqs.set_queue_attributes(
QueueUrl="https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events",
Attributes={
"ReceiveMessageWaitTimeSeconds": "20" # 20 seconds = maximum long poll
}
)
# Now every receive_message() call on this queue will automatically long-poll
# (up to 20 seconds) without needing to set WaitTimeSeconds each time
# Check current queue settings
attrs = sqs.get_queue_attributes(
QueueUrl="https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events",
AttributeNames=["All"]
)["Attributes"]
print(f"VisibilityTimeout: {attrs['VisibilityTimeout']}s")
print(f"MessageRetentionPeriod: {attrs['MessageRetentionPeriod']}s")
print(f"WaitTimeSeconds (polling): {attrs['ReceiveMessageWaitTimeSeconds']}s")
print(f"ApproxMessages in queue: {attrs['ApproximateNumberOfMessages']}")
print(f"ApproxMessages in-flight: {attrs['ApproximateNumberOfMessagesNotVisible']}")
Message attributes are key-value metadata attached to a message — separate from the message body. They let consumers route or filter messages without parsing the body. For example, you can attach source_table=orders and environment=prod as attributes, and consumers can check these before deciding whether to process the message. SQS supports up to 10 message attributes per message, with String, Number, and Binary types.
import boto3, json
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"
sqs.send_message(
QueueUrl= QUEUE_URL,
MessageBody= json.dumps({
"s3_bucket": "company-raw",
"s3_key": "uploads/orders/2024-01-15/orders.csv",
"file_size": 15728640 # 15 MB
}),
MessageAttributes={
"source_table": {
"DataType": "String",
"StringValue": "orders"
},
"environment": {
"DataType": "String",
"StringValue": "prod"
},
"file_count": {
"DataType": "Number",
"StringValue": "1"
}
}
)
# Receive with attributes
resp = sqs.receive_message(
QueueUrl= QUEUE_URL,
MaxNumberOfMessages= 1,
WaitTimeSeconds= 20,
MessageAttributeNames= ["All"] # request all attributes
)
for msg in resp.get("Messages", []):
env = msg.get("MessageAttributes", {}).get("environment", {}).get("StringValue")
tbl = msg.get("MessageAttributes", {}).get("source_table", {}).get("StringValue")
print(f"env={env} table={tbl} body={msg['Body']}")
The core SQS consumer pattern is always: receive → process → delete on success only. Never delete before processing. Never skip the delete on success. This three-step pattern, combined with the visibility timeout, is what gives SQS its reliability guarantee — if your process crashes between receive and delete, SQS automatically re-delivers the message after the timeout expires.
This is the production-grade pattern for a Python process (Lambda, Glue Python shell, EC2 worker) that continuously polls SQS and triggers a Glue job for each message:
import boto3, json, logging, time
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
sqs = boto3.client("sqs")
glue = boto3.client("glue")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"
def process_message(body: dict) -> None:
"""Trigger a Glue job for the S3 file described in the SQS message."""
s3_key = body["s3_key"]
table = body["source_table"]
logger.info(f"Starting Glue job for {table} file: {s3_key}")
glue.start_job_run(
JobName= f"ingest-{table}",
Arguments={
"--s3_key": s3_key,
"--table_name": table
}
)
def consumer_loop(max_iterations: int = None):
"""Main polling loop — runs indefinitely in a worker process."""
iteration = 0
while True:
if max_iterations and iteration >= max_iterations:
break
try:
# ① RECEIVE — long poll, up to 10 messages at once
resp = sqs.receive_message(
QueueUrl= QUEUE_URL,
MaxNumberOfMessages= 10, # batch up to 10
WaitTimeSeconds= 20, # long poll — ALWAYS use this
VisibilityTimeout= 600, # 10 min — longer than Glue job
MessageAttributeNames=["All"]
)
messages = resp.get("Messages", [])
if not messages:
logger.debug("Queue empty — polling again")
iteration += 1
continue
for msg in messages:
receipt_handle = msg["ReceiptHandle"]
try:
# ② PROCESS
body = json.loads(msg["Body"])
process_message(body)
# ③ DELETE — only on success
sqs.delete_message(
QueueUrl= QUEUE_URL,
ReceiptHandle= receipt_handle
)
logger.info(f"✅ Message processed and deleted: {body.get('s3_key')}")
except ClientError as e:
code = e.response["Error"]["Code"]
logger.error(f"❌ AWS error processing message: {code} — will retry")
# Do NOT delete — visibility timeout expires → SQS re-delivers
except Exception as e:
logger.error(f"❌ Unexpected error: {e} — will retry (maxReceiveCount times)")
# Do NOT delete — let SQS handle retry → eventually goes to DLQ
except ClientError as e:
logger.error(f"SQS receive failed: {e} — sleeping 30s")
time.sleep(30)
iteration += 1
if __name__ == "__main__":
consumer_loop()
When processing messages in batches of 10, use delete_message_batch() to delete all successfully processed messages in a single API call instead of 10 separate calls. This reduces API costs and latency. Always check Failed in the response — partial batch failures are possible.
import boto3, json
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"
# Receive up to 10 messages
resp = sqs.receive_message(QueueUrl=QUEUE_URL, MaxNumberOfMessages=10, WaitTimeSeconds=20)
messages = resp.get("Messages", [])
successful_handles = [] # collect receipt handles of successfully processed messages
for i, msg in enumerate(messages):
try:
body = json.loads(msg["Body"])
# process body...
successful_handles.append({
"Id": str(i), # unique ID within this batch
"ReceiptHandle": msg["ReceiptHandle"]
})
except Exception as e:
print(f"Failed to process message {i}: {e} — leaving in queue for retry")
# Batch delete all successfully processed messages in ONE API call
if successful_handles:
delete_resp = sqs.delete_message_batch(
QueueUrl= QUEUE_URL,
Entries= successful_handles
)
if delete_resp.get("Failed"):
print(f"⚠️ Some deletes failed: {delete_resp['Failed']}")
else:
print(f"✅ Batch deleted {len(successful_handles)} messages")
The most common data engineering SQS pattern: when a file lands in S3, an S3 event notification sends a message to SQS. A Lambda function polls the queue, validates the file, and starts a Glue ETL job. The queue acts as a buffer — if Lambda is throttled or Glue is at its concurrent job limit, messages wait safely in SQS instead of being lost.
import boto3, json
s3 = boto3.client("s3")
sqs = boto3.client("sqs")
BUCKET = "company-raw"
QUEUE_ARN = "arn:aws:sqs:us-east-1:123456789012:file-arrival-queue"
# ① Attach a resource policy to SQS allowing S3 to send messages to it
sqs.set_queue_attributes(
QueueUrl= "https://sqs.us-east-1.amazonaws.com/123456789012/file-arrival-queue",
Attributes={
"Policy": json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowS3SendMessage",
"Effect": "Allow",
"Principal": {"Service": "s3.amazonaws.com"},
"Action": "SQS:SendMessage",
"Resource": QUEUE_ARN,
"Condition": {"ArnLike": {"aws:SourceArn": f"arn:aws:s3:::{BUCKET}"}}
}]
})
}
)
# ② Configure S3 to send notifications to SQS on object creation
s3.put_bucket_notification_configuration(
Bucket= BUCKET,
NotificationConfiguration={
"QueueConfigurations": [{
"QueueArn": QUEUE_ARN,
"Events": ["s3:ObjectCreated:*"],
"Filter": {
"Key": {"FilterRules": [
{"Name": "prefix", "Value": "orders/"}, # only orders/ prefix
{"Name": "suffix", "Value": ".csv"} # only .csv files
]}
}
}]
}
)
print("S3 → SQS notification configured")
SQS decouples fast producers from slow consumers. If your Bronze ETL processes 100 tables per hour but the Silver transformation can only handle 20 tables per hour, an SQS queue between them absorbs the difference. Bronze finishes and enqueues all 100 completion events; Silver processes them at its own pace over 5 hours without dropping any. This is the classic producer-consumer decoupling pattern.
A DLQ is only useful if you monitor it and act on messages in it. Set up a CloudWatch alarm on the ApproximateNumberOfMessagesVisible metric of your DLQ — any message in the DLQ means a pipeline step failed. After fixing the root cause, use the Start DLQ Redrive feature to move messages back to the main queue for reprocessing.
import boto3
sqs = boto3.client("sqs")
cw = boto3.client("cloudwatch")
DLQ_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events-dlq"
SNS_ARN = "arn:aws:sns:us-east-1:123456789012:data-eng-alerts"
# Check DLQ depth right now
attrs = sqs.get_queue_attributes(
QueueUrl= DLQ_URL,
AttributeNames= ["ApproximateNumberOfMessages",
"ApproximateNumberOfMessagesNotVisible"]
)["Attributes"]
dlq_depth = int(attrs["ApproximateNumberOfMessages"])
in_flight = int(attrs["ApproximateNumberOfMessagesNotVisible"])
print(f"DLQ depth: {dlq_depth} visible, {in_flight} in-flight")
# Create CloudWatch alarm: alert if DLQ has ANY messages
cw.put_metric_alarm(
AlarmName= "pipeline-events-DLQ-not-empty",
AlarmDescription= "Messages in DLQ — pipeline processing failures detected",
MetricName= "ApproximateNumberOfMessagesVisible",
Namespace= "AWS/SQS",
Dimensions=[{"Name": "QueueName", "Value": "pipeline-events-dlq"}],
Statistic= "Sum",
Period= 60, # 1-minute evaluation
EvaluationPeriods= 1,
Threshold= 0, # alarm if ANY message appears
ComparisonOperator= "GreaterThanThreshold",
AlarmActions= [SNS_ARN],
TreatMissingData= "notBreaching"
)
print("DLQ alarm created — will alert on first failure")
When you need to enqueue many messages at once (e.g. one message per table at the start of a metadata-driven pipeline run), use send_message_batch() to send up to 10 messages per API call instead of 10 separate calls. Always check Failed in the response — retry any failed entries.
import boto3, json
sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789012/pipeline-events"
# Build one message per table (simulating 25 tables)
tables = ["orders", "customers", "products", "returns", "inventory",
# ... more tables]
def send_in_batches(queue_url: str, table_list: list):
"""Send one SQS message per table, batched in groups of 10."""
failed_tables = []
for i in range(0, len(table_list), 10):
batch = table_list[i:i+10]
entries = [
{
"Id": str(j),
"MessageBody": json.dumps({
"table_name": t,
"run_date": "2024-01-15",
"pipeline": "daily-incremental"
})
}
for j, t in enumerate(batch)
]
resp = sqs.send_message_batch(QueueUrl=queue_url, Entries=entries)
if resp.get("Failed"):
failed = [entries[int(f["Id"])]["MessageBody"] for f in resp["Failed"]]
failed_tables.extend(failed)
print(f"⚠️ {len(resp['Failed'])} sends failed in this batch")
print(f"✅ Enqueued {len(table_list) - len(failed_tables)} tables")
return failed_tables
failed = send_in_batches(QUEUE_URL, tables)
if failed:
print(f"Retry these: {failed}")
| Setting | Recommended Value | Why |
|---|---|---|
| VisibilityTimeout | 1.5× job runtime | Prevents re-delivery while processing is still running |
| WaitTimeSeconds | 20 (long polling) | Reduces empty polls → lower cost, lower latency |
| MessageRetentionPeriod (main) | 86400–345600 (1–4 days) | Enough buffer for delayed consumers |
| MessageRetentionPeriod (DLQ) | 1209600 (14 days) | Maximum time to investigate failures |
| maxReceiveCount (redrive) | 3–5 | Retry transient failures; avoid infinite loops |
| MaxNumberOfMessages | 10 | Always batch receive for cost efficiency |
change_message_visibility() to extend for very long-running jobs.Amazon SNS — Simple Notification Service
SNS is AWS's managed pub/sub messaging service. You publish one message to a topic and SNS fans it out to every subscriber simultaneously — email inboxes, SQS queues, Lambda functions, HTTPS endpoints, SMS, and more. For Data Engineers SNS is the standard way to alert on pipeline failures, trigger multiple downstream consumers from a single event, and integrate CloudWatch alarms with on-call tooling.
An SNS topic is a named communication channel. Producers publish messages to a topic; SNS immediately delivers copies to every active subscriber. Topics are regional and identified by an ARN: arn:aws:sns:us-east-1:123456789012:pipeline-alerts.
A single topic can have many subscriptions of different types simultaneously. When you publish one message, SNS delivers to all of them in parallel — this is the fan-out pattern.
| Protocol | Endpoint | Data Engineering Use Case |
|---|---|---|
| email address | Alert on-call engineer on pipeline failure | |
| sqs | SQS queue ARN | Fan-out to multiple processing queues |
| lambda | Lambda function ARN | Trigger automated remediation on alert |
| https | Webhook URL | Post alert to Slack / PagerDuty / Teams |
| sms | Phone number | Critical on-call SMS for SLA breaches |
prod-pipeline-failures has three subscriptions: (1) SQS queue consumed by a retry Lambda, (2) email to the data engineering team, (3) HTTPS endpoint to PagerDuty. One publish triggers all three simultaneously.
SNS alone delivers in real-time — if a subscriber is down it loses the message. By subscribing SQS queues to SNS topics you get the best of both worlds: SNS handles the fan-out, SQS provides durable buffering so downstream consumers can process at their own pace and survive outages.
import boto3, json
sns = boto3.client("sns", region_name="us-east-1")
sqs = boto3.client("sqs", region_name="us-east-1")
# ── 1. Create the SNS topic ───────────────────────────────────────
topic_resp = sns.create_topic(
Name="prod-pipeline-alerts",
Attributes={
"DisplayName": "Prod Pipeline Alerts"
},
Tags=[{"Key": "Env", "Value": "prod"}]
)
topic_arn = topic_resp["TopicArn"]
print(f"Topic ARN: {topic_arn}")
# ── 2. Subscribe an SQS queue to the topic ───────────────────────
queue_url = sqs.get_queue_url(QueueName="pipeline-retry-queue")["QueueUrl"]
queue_arn = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=["QueueArn"]
)["Attributes"]["QueueArn"]
sub_resp = sns.subscribe(
TopicArn=topic_arn,
Protocol="sqs",
Endpoint=queue_arn,
Attributes={
"RawMessageDelivery": "true" # skip the SNS JSON wrapper
}
)
print(f"SQS subscription ARN: {sub_resp['SubscriptionArn']}")
# ── 3. Subscribe an email address ─────────────────────────────────
# Note: email subscriptions require manual confirmation via inbox link
sns.subscribe(
TopicArn=topic_arn,
Protocol="email",
Endpoint="data-engineering@company.com"
)
print("Email subscription created — confirm via inbox link")
# ── 4. Update SQS queue policy to allow SNS to send ──────────────
# (Without this, SNS cannot write to the SQS queue)
policy = {
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "sns.amazonaws.com"},
"Action": "sqs:SendMessage",
"Resource": queue_arn,
"Condition": {
"ArnEquals": {"aws:SourceArn": topic_arn}
}
}]
}
sqs.set_queue_attributes(
QueueUrl=queue_url,
Attributes={"Policy": json.dumps(policy)}
)
print("SQS policy updated — SNS can now deliver messages")
sns.amazonaws.com allow policy to the SQS queue, messages from SNS will be silently dropped — no error on the publish side.
Every production pipeline should publish to an SNS topic on failure. This is far better than hardcoding email logic in your Spark code because SNS decouples the pipeline from the notification mechanism — you can add Slack, PagerDuty, or SMS alerts later without touching pipeline code.
import boto3, json, traceback
from datetime import datetime, timezone
sns = boto3.client("sns")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"
def publish_failure(pipeline_name: str, run_id: str, error: Exception) -> None:
"""Publish a structured failure event to SNS."""
payload = {
"status": "FAILED",
"pipeline": pipeline_name,
"run_id": run_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"error_type": type(error).__name__,
"error_message": str(error)[:500], # truncate long messages
"stack_trace": traceback.format_exc()[:1000]
}
sns.publish(
TopicArn=ALERT_ARN,
Subject=f"❌ Pipeline FAILED: {pipeline_name}",
Message=json.dumps(payload, indent=2),
MessageAttributes={
"pipeline_name": {
"DataType": "String",
"StringValue": pipeline_name
},
"severity": {
"DataType": "String",
"StringValue": "HIGH"
}
}
)
# ── Wrap your entire pipeline in try/except ───────────────────────
import uuid
run_id = str(uuid.uuid4())
pipeline_name = "silver-orders-etl"
try:
# ... your actual Spark / Glue / EMR job logic here ...
print("Running pipeline...")
# simulate an error:
raise ValueError("Source table orders has 0 rows — possible upstream failure")
except Exception as e:
publish_failure(pipeline_name, run_id, e)
raise # re-raise so Glue/EMR marks the job as FAILED
publish_failure(), always raise the exception again. If you swallow it, Glue/EMR marks the job as SUCCEEDED even though it failed — CloudWatch alarms won't fire and the on-call engineer won't be paged.
The most common production pattern is: CloudWatch detects a metric breach → fires an alarm → alarm sends to SNS topic → SNS fans out to email + PagerDuty. You don't even need to write any Python for this path — it's pure AWS configuration.
import boto3
cw = boto3.client("cloudwatch")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"
# Fire SNS alert if the Glue job runs longer than 90 minutes
cw.put_metric_alarm(
AlarmName="GlueJobDurationBreached-silver-orders",
AlarmDescription="silver-orders-etl exceeded 90 min SLA",
MetricName="glue.driver.ExecutorRunTime",
Namespace="Glue",
Dimensions=[{"Name": "JobName", "Value": "silver-orders-etl"}],
Statistic="Maximum",
Period=300, # check every 5 minutes
EvaluationPeriods=1,
Threshold=5400000, # 90 min in milliseconds
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[ALERT_ARN], # SNS topic to notify
OKActions=[ALERT_ARN], # also notify when it recovers
TreatMissingData="notBreaching"
)
print("CloudWatch alarm wired to SNS")
When a pipeline finishes successfully, you often need to notify several downstream systems: update a data catalog, trigger a downstream transformation job, send a Slack message to the business team, and update a dashboard. Rather than calling each from your pipeline code, publish one success event to SNS and let each subscriber handle its own action independently.
import boto3, json
from datetime import datetime, timezone
sns = boto3.client("sns")
SUCCESS_ARN = "arn:aws:sns:us-east-1:123456789012:pipeline-success-events"
def publish_success(pipeline_name, run_id, rows_written, s3_path, duration_sec):
payload = {
"status": "SUCCEEDED",
"pipeline": pipeline_name,
"run_id": run_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"rows_written": rows_written,
"output_s3_path": s3_path,
"duration_sec": duration_sec
}
sns.publish(
TopicArn=SUCCESS_ARN,
Subject=f"✅ Pipeline SUCCEEDED: {pipeline_name}",
Message=json.dumps(payload, indent=2),
MessageAttributes={
"pipeline_name": {
"DataType": "String",
"StringValue": pipeline_name
},
"status": {
"DataType": "String",
"StringValue": "SUCCEEDED"
}
}
)
# Call at the end of your job
publish_success(
pipeline_name = "silver-orders-etl",
run_id = run_id,
rows_written = 4_823_441,
s3_path = "s3://data-lake/silver/orders/year=2024/month=06/day=15/",
duration_sec = 312
)
Subscriber 2 — Email → business BI team: "Silver orders table updated, 4.8M rows"
Subscriber 3 — HTTPS → Slack webhook → posts to #data-engineering channel
All triggered by the single
sns.publish() call above.
By default, every subscriber on a topic receives every message. Filter policies let you attach a JSON rule to a subscription so that subscriber only receives messages whose MessageAttributes match the rule. This lets you have one topic serve many use cases without each subscriber processing irrelevant messages.
import boto3, json
sns = boto3.client("sns")
# Existing subscription ARN (from subscribe() call)
subscription_arn = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts:abc123"
# ── Filter: only deliver messages where severity = "HIGH" or "CRITICAL"
# AND pipeline_name starts with "silver-"
filter_policy = {
"severity": ["HIGH", "CRITICAL"],
"pipeline_name": [{"prefix": "silver-"}]
}
sns.set_subscription_attributes(
SubscriptionArn=subscription_arn,
AttributeName="FilterPolicy",
AttributeValue=json.dumps(filter_policy)
)
print("Filter policy applied — subscriber now only gets HIGH/CRITICAL silver-* alerts")
# ── To remove the filter (receive all messages again) ────────────
sns.set_subscription_attributes(
SubscriptionArn=subscription_arn,
AttributeName="FilterPolicy",
AttributeValue="" # empty string clears the filter
)
set_subscription_attributes(). The publisher doesn't change anything — it just populates MessageAttributes on the publish() call and each subscriber's filter decides whether to accept or drop it.
import boto3
sns = boto3.client("sns", region_name="us-east-1")
# ── CREATE ────────────────────────────────────────────────────────
resp = sns.create_topic(
Name="prod-pipeline-alerts",
Attributes={"DisplayName": "Prod Pipeline Alerts"},
Tags=[{"Key": "Env", "Value": "prod"}]
)
topic_arn = resp["TopicArn"] # idempotent — same ARN if topic already exists
# ── LIST (use paginator) ──────────────────────────────────────────
paginator = sns.get_paginator("list_topics")
for page in paginator.paginate():
for topic in page["Topics"]:
print(topic["TopicArn"])
# ── DELETE ────────────────────────────────────────────────────────
sns.delete_topic(TopicArn=topic_arn)
print("Topic deleted")
# ── SUBSCRIBE ─────────────────────────────────────────────────────
# SQS
sub = sns.subscribe(
TopicArn=topic_arn,
Protocol="sqs",
Endpoint="arn:aws:sqs:us-east-1:123456789012:retry-queue",
Attributes={"RawMessageDelivery": "true"},
ReturnSubscriptionArn=True # get ARN immediately (no email confirmation needed)
)
sub_arn = sub["SubscriptionArn"]
# Lambda
sns.subscribe(
TopicArn=topic_arn,
Protocol="lambda",
Endpoint="arn:aws:lambda:us-east-1:123456789012:function:failure-handler"
)
# HTTPS (Slack webhook via proxy)
sns.subscribe(
TopicArn=topic_arn,
Protocol="https",
Endpoint="https://hooks.slack.com/services/T.../B.../..."
)
# ── LIST subscriptions for a topic ───────────────────────────────
paginator = sns.get_paginator("list_subscriptions_by_topic")
for page in paginator.paginate(TopicArn=topic_arn):
for s in page["Subscriptions"]:
print(f"{s['Protocol']:10} → {s['Endpoint']}")
# ── UNSUBSCRIBE ───────────────────────────────────────────────────
sns.unsubscribe(SubscriptionArn=sub_arn)
import uuid
# ── SINGLE PUBLISH ────────────────────────────────────────────────
sns.publish(
TopicArn=topic_arn,
Subject="❌ Pipeline FAILED: silver-orders-etl", # email subject line
Message='{"status":"FAILED","pipeline":"silver-orders-etl","rows":0}',
MessageAttributes={
"severity": {
"DataType": "String",
"StringValue": "HIGH"
},
"pipeline_name": {
"DataType": "String",
"StringValue": "silver-orders-etl"
}
}
)
# ── BATCH PUBLISH (up to 10 messages per call) ───────────────────
# Useful when you need to notify multiple pipeline failures at once
entries = [
{
"Id": str(i), # unique ID within batch
"Message": f'Pipeline {i} failed',
"Subject": f'Alert: Pipeline {i}',
"MessageAttributes": {
"severity": {"DataType": "String", "StringValue": "HIGH"}
}
}
for i in range(3)
]
batch_resp = sns.publish_batch(
TopicArn=topic_arn,
PublishBatchRequestEntries=entries
)
print(f"Successful: {len(batch_resp['Successful'])}")
print(f"Failed: {len(batch_resp['Failed'])}")
# ── Handle partial failures in batch ─────────────────────────────
for failure in batch_resp.get("Failed", []):
print(f"Failed ID {failure['Id']}: {failure['Code']} — {failure['Message']}")
publish() call delivers to every subscriber simultaneously — email, SQS, Lambda, HTTPS, SMS. The critical production pattern is SNS → SQS fan-out: SNS provides the fan-out, SQS provides durable buffering so subscribers don't miss messages if they're temporarily down. Always update the SQS queue policy to allow sns.amazonaws.com to write. Use MessageAttributes + filter policies to route different message types to different subscribers on the same topic. In every pipeline, wrap job logic in try/except, call publish() on failure, and always re-raise so the job is marked FAILED in Glue/EMR.
AWS Lambda — Serverless Functions for Data Pipelines
Lambda lets you run Python code without managing any servers. You upload a function, configure a trigger, and AWS runs it in milliseconds whenever the trigger fires — you pay only for the milliseconds of execution time. For Data Engineers, Lambda is the glue between pipeline stages: it reacts to file arrivals, triggers Glue jobs, updates metadata, sends alerts, and handles lightweight transformations.
Every Lambda function has a handler — a Python function that AWS calls when the trigger fires. It receives two arguments: event (the input data from the trigger) and context (runtime metadata like the function name, remaining time, and request ID). The handler's return value becomes the response for synchronous invocations.
import json, logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event: dict, context) -> dict:
"""
event — dict containing the trigger payload (S3 event, SQS message, etc.)
context — LambdaContext object with runtime metadata
"""
# ── Context object useful fields ──────────────────────────────
logger.info(f"Function name: {context.function_name}")
logger.info(f"Function version: {context.function_version}")
logger.info(f"Request ID: {context.aws_request_id}")
logger.info(f"Memory (MB): {context.memory_limit_in_mb}")
logger.info(f"Remaining ms: {context.get_remaining_time_in_millis()}")
# ── Log the incoming event ────────────────────────────────────
logger.info(f"Event received: {json.dumps(event, default=str)}")
# ── Your business logic here ──────────────────────────────────
result = {"status": "ok", "processed": True}
return {
"statusCode": 200,
"body": json.dumps(result)
}
Lambda allocates CPU proportional to memory — doubling memory doubles CPU speed. For data engineering tasks (parsing files, calling boto3 APIs, triggering jobs), 256–512 MB is usually sufficient. Timeout can be up to 15 minutes — long enough to poll a Glue job status a few times but not for heavy Spark work.
| Setting | Range | DE Recommendation |
|---|---|---|
| Memory | 128 MB – 10 GB | 256–512 MB for orchestration; 1–3 GB for in-memory data processing |
| Timeout | 1 sec – 15 min | 30 sec for simple triggers; 5–10 min if polling an async job status |
| Runtime | Python 3.9 / 3.10 / 3.11 / 3.12 | Use Python 3.12 for new functions (latest, fastest cold start) |
| Ephemeral Storage | 512 MB – 10 GB | /tmp directory; increase if you need to stage files before uploading to S3 |
A Layer is a ZIP archive of Python packages that Lambda mounts at /opt/python before your function runs. Multiple functions can share the same layer. This keeps your deployment package small and lets you manage library versions centrally — e.g. a single boto3-latest layer shared across all 40 pipeline Lambda functions.
# ── 1. Install packages into a folder ────────────────────────────
mkdir -p layer/python
pip install pandas pyarrow tenacity -t layer/python/
# ── 2. Zip the layer ──────────────────────────────────────────────
cd layer && zip -r ../my-de-layer.zip python/
cd ..
# ── 3. Publish the layer to AWS ───────────────────────────────────
aws lambda publish-layer-version \
--layer-name my-de-layer \
--description "pandas + pyarrow + tenacity for DE lambdas" \
--zip-file fileb://my-de-layer.zip \
--compatible-runtimes python3.12
# ── 4. Attach layer to a function ─────────────────────────────────
aws lambda update-function-configuration \
--function-name file-arrival-handler \
--layers arn:aws:lambda:us-east-1:123456789012:layer:my-de-layer:1
The most common data engineering Lambda pattern: a new file lands in S3, S3 fires an event notification, and Lambda immediately processes it — validates the file, kicks off a Glue job, or writes metadata. The event object contains the bucket name, object key, size, and event time.
import boto3, json, logging
from urllib.parse import unquote_plus
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue = boto3.client("glue")
s3 = boto3.client("s3")
def lambda_handler(event, context):
# ── Parse S3 event ────────────────────────────────────────────
for record in event["Records"]:
bucket = record["s3"]["bucket"]["name"]
key = unquote_plus(record["s3"]["object"]["key"])
size = record["s3"]["object"]["size"]
logger.info(f"New file: s3://{bucket}/{key} ({size} bytes)")
# ── Validate: reject empty files ──────────────────────────
if size == 0:
logger.warning("Empty file — skipping")
continue
# ── Validate: check expected prefix ───────────────────────
if not key.startswith("landing/orders/"):
logger.info("File not in orders prefix — ignoring")
continue
# ── Trigger Glue job with file path as argument ───────────
run = glue.start_job_run(
JobName="bronze-orders-ingest",
Arguments={
"--source_bucket": bucket,
"--source_key": key,
"--file_size": str(size)
}
)
logger.info(f"Started Glue job run: {run['JobRunId']}")
return {"statusCode": 200, "body": "done"}
+ and special chars become %XX. Always call unquote_plus(key) before using the key in any boto3 call, otherwise NoSuchKey errors will appear on files with spaces or special characters.
When Lambda is triggered by SQS, AWS polls the queue for you and delivers batches of messages to the handler. Lambda automatically deletes successfully processed messages from the queue. If the handler raises an exception, the message stays in the queue and becomes visible again after the visibility timeout — naturally enabling retries.
import boto3, json, logging
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue = boto3.client("glue")
ddb = boto3.resource("dynamodb").Table("pipeline_audit")
def lambda_handler(event, context):
"""Process a batch of SQS messages — each is a pipeline trigger."""
failed_ids = []
for record in event["Records"]:
message_id = record["messageId"]
try:
body = json.loads(record["body"])
logger.info(f"Processing message {message_id}: {body}")
# ── Trigger Glue job ──────────────────────────────────────
run = glue.start_job_run(
JobName=body["job_name"],
Arguments={"--run_date": body["run_date"]}
)
logger.info(f"Glue run started: {run['JobRunId']}")
except (KeyError, json.JSONDecodeError) as e:
# Non-retryable: bad message format → route to DLQ
logger.error(f"Bad message format {message_id}: {e}")
failed_ids.append({"itemIdentifier": message_id})
except ClientError as e:
# Retryable: Glue API error → keep in queue for retry
logger.error(f"Glue error for {message_id}: {e}")
failed_ids.append({"itemIdentifier": message_id})
# ── Return partial failure report ─────────────────────────────
# Successfully processed records are auto-deleted by Lambda
# Failed record IDs are kept in queue for retry / DLQ routing
return {"batchItemFailures": failed_ids}
{"batchItemFailures": [{"itemIdentifier": msg_id}]} to tell Lambda which specific messages failed. Successfully processed messages are deleted; only failed ones go back to the queue (or DLQ after max retries). Without this, any exception causes the entire batch to be retried — including messages you already processed successfully.
EventBridge can trigger Lambda on a cron or rate schedule — replacing traditional cron jobs entirely. The Lambda event contains the scheduled time and rule ARN. Use this for lightweight scheduled tasks: checking watermarks, sending daily summary emails, pruning old S3 files, or triggering a Glue crawler.
import boto3, logging
from datetime import datetime, timezone, timedelta
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue = boto3.client("glue")
sns = boto3.client("sns")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"
def lambda_handler(event, context):
"""Daily 06:00 UTC trigger — start Glue crawler then kick off ETL."""
logger.info(f"Scheduled trigger fired: {event.get('time', 'unknown')}")
run_date = (datetime.now(timezone.utc) - timedelta(days=1)).strftime("%Y-%m-%d")
logger.info(f"Processing run_date: {run_date}")
try:
# ── Start Glue crawler to pick up yesterday's landed files ───
glue.start_crawler(Name="landing-zone-crawler")
logger.info("Crawler started")
# ── Start the ETL job with the run date ───────────────────
run = glue.start_job_run(
JobName="daily-silver-etl",
Arguments={"--run_date": run_date}
)
logger.info(f"ETL job started: {run['JobRunId']}")
return {"status": "ok", "run_date": run_date, "run_id": run["JobRunId"]}
except Exception as e:
logger.error(f"Scheduled trigger failed: {e}")
sns.publish(
TopicArn=ALERT_ARN,
Subject="❌ Daily ETL trigger failed",
Message=str(e)
)
raise
Lambda can spin up an EMR cluster and submit a Spark step — or just add a step to an existing long-running cluster. This is useful when file arrival should kick off a Spark job that's too large for Glue (needs custom libraries, specific instance types, etc.).
import boto3, os, logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
emr = boto3.client("emr")
CLUSTER = os.environ["EMR_CLUSTER_ID"] # pass via env var, not hardcoded
def lambda_handler(event, context):
bucket = event["Records"][0]["s3"]["bucket"]["name"]
key = event["Records"][0]["s3"]["object"]["key"]
resp = emr.add_job_flow_steps(
JobFlowId=CLUSTER,
Steps=[{
"Name": "Process landed file",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"s3://my-scripts/process_file.py",
"--bucket", bucket,
"--key", key
]
}
}]
)
step_id = resp["StepIds"][0]
logger.info(f"EMR step submitted: {step_id}")
return {"step_id": step_id}
After a pipeline completes (triggered by SNS success event), Lambda writes a structured audit record to DynamoDB — capturing run ID, status, row count, S3 output path, and duration. This builds up a complete run history that's queryable for SLA tracking and debugging.
import boto3, json, logging
from datetime import datetime, timezone
from decimal import Decimal
logger = logging.getLogger()
logger.setLevel(logging.INFO)
ddb_table = boto3.resource("dynamodb").Table("pipeline_audit")
def lambda_handler(event, context):
"""SNS success event → write audit record to DynamoDB."""
# SNS wraps the message in event["Records"][0]["Sns"]["Message"]
payload = json.loads(event["Records"][0]["Sns"]["Message"])
ddb_table.put_item(Item={
"run_id": payload["run_id"],
"pipeline_name": payload["pipeline"],
"status": payload["status"],
"rows_written": payload.get("rows_written", 0),
"output_s3_path": payload.get("output_s3_path", ""),
"duration_sec": Decimal(str(payload.get("duration_sec", 0))),
"recorded_at": datetime.now(timezone.utc).isoformat()
})
logger.info(f"Audit record written for run {payload['run_id']}")
return {"statusCode": 200}
For small files (under ~500 MB), Lambda can convert CSV to Parquet in-memory using pandas and pyarrow — no Spark cluster needed. The file is downloaded to /tmp, converted, and uploaded back to S3. For anything larger, use Glue or EMR.
import boto3, pandas as pd, logging, os
from urllib.parse import unquote_plus
logger = logging.getLogger()
logger.setLevel(logging.INFO)
s3 = boto3.client("s3")
def lambda_handler(event, context):
record = event["Records"][0]
bucket = record["s3"]["bucket"]["name"]
key = unquote_plus(record["s3"]["object"]["key"])
# ── Only process CSV files ─────────────────────────────────────
if not key.endswith(".csv"):
logger.info("Not a CSV — skipping")
return
local_csv = f"/tmp/{os.path.basename(key)}"
local_parquet = local_csv.replace(".csv", ".parquet")
out_key = key.replace("landing/", "bronze/").replace(".csv", ".parquet")
# ── Download CSV from S3 ───────────────────────────────────────
s3.download_file(bucket, key, local_csv)
logger.info(f"Downloaded {key} to {local_csv}")
# ── Convert with pandas ────────────────────────────────────────
df = pd.read_csv(local_csv)
df.to_parquet(local_parquet, index=False, engine="pyarrow", compression="snappy")
logger.info(f"Converted to Parquet: {df.shape[0]} rows, {df.shape[1]} cols")
# ── Upload Parquet to S3 bronze zone ──────────────────────────
s3.upload_file(local_parquet, bucket, out_key)
logger.info(f"Uploaded to s3://{bucket}/{out_key}")
return {"statusCode": 200, "output_key": out_key}
Lambda has two invocation modes and they handle errors very differently. Understanding this is critical — silent message loss in production is almost always caused by not knowing which mode is in use.
| Mode | Triggered By | On Error | DLQ Support |
|---|---|---|---|
| Synchronous | API Gateway, boto3 invoke (RequestResponse), Cognito | Error returned to caller immediately — no automatic retry | ❌ No |
| Asynchronous | S3 events, SNS, EventBridge | AWS retries 2 more times (total 3 attempts) with delays between | ✅ Yes |
| Poll-Based | SQS, Kinesis, DynamoDB Streams | Message stays in queue / stream until visibility timeout; routes to DLQ after max retries | ✅ SQS DLQ |
For async-triggered Lambda (S3 events, SNS), configure a DLQ — an SQS queue that receives event payloads Lambda couldn't process after all retries. Without a DLQ, failed events are silently discarded after 3 attempts — you'd have no record that a file arrived but failed to trigger your pipeline.
import boto3
lm = boto3.client("lambda")
# Attach a DLQ SQS queue to the Lambda function
lm.put_function_event_invoke_config(
FunctionName="file-arrival-handler",
MaximumRetryAttempts=2, # 0, 1, or 2 retries on async failure
MaximumEventAgeInSeconds=3600, # discard event if older than 1 hour
DestinationConfig={
"OnFailure": {
"Destination": "arn:aws:sqs:us-east-1:123456789012:lambda-dlq"
},
"OnSuccess": {
"Destination": "arn:aws:sns:us-east-1:123456789012:pipeline-success-events"
}
}
)
print("DLQ and success destination configured")
Lambda automatically sends all print() and logging output to CloudWatch Logs. Use structured JSON logging so CloudWatch Log Insights can query your logs with SQL-like syntax — finding all runs that processed over 1M rows, or all failures for a specific pipeline in the last 24 hours.
import json, logging, time
from datetime import datetime, timezone
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def log(level: str, message: str, **kwargs):
"""Emit a structured JSON log line — queryable in CloudWatch Log Insights."""
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": level,
"message": message,
**kwargs
}
print(json.dumps(entry)) # Lambda sends print() to CloudWatch automatically
def lambda_handler(event, context):
start = time.time()
log("INFO", "Lambda started",
request_id=context.aws_request_id,
function=context.function_name)
try:
# ... pipeline logic ...
rows = 4_823_441
log("INFO", "Pipeline completed",
rows_processed=rows,
duration_ms=int((time.time() - start) * 1000))
return {"statusCode": 200, "rows": rows}
except Exception as e:
log("ERROR", "Pipeline failed",
error_type=type(e).__name__,
error_message=str(e),
duration_ms=int((time.time() - start) * 1000))
raise
# CloudWatch Log Insights query to find all failures in last 24h:
# fields @timestamp, error_type, error_message
# | filter level = "ERROR"
# | sort @timestamp desc
# | limit 50
import boto3, json
lm = boto3.client("lambda")
# ── SYNCHRONOUS invoke — waits for response ───────────────────────
resp = lm.invoke(
FunctionName="file-arrival-handler",
InvocationType="RequestResponse", # wait for result
Payload=json.dumps({
"bucket": "my-data-lake",
"key": "landing/orders/2024-06-15.csv"
}).encode()
)
status_code = resp["StatusCode"] # 200 = Lambda ran (not your function's return code)
result = json.loads(resp["Payload"].read())
func_error = resp.get("FunctionError") # "Handled" or "Unhandled" if function threw
print(f"Status: {status_code}, FunctionError: {func_error}")
print(f"Result: {result}")
if func_error:
raise RuntimeError(f"Lambda function failed: {result.get('errorMessage')}")
# ── ASYNCHRONOUS invoke — fire and forget ─────────────────────────
lm.invoke(
FunctionName="daily-report-generator",
InvocationType="Event", # async: returns 202, no payload back
Payload=json.dumps({"run_date": "2024-06-15"}).encode()
)
print("Async invocation fired — not waiting for result")
resp["StatusCode"] == 200 means Lambda received your request and ran the function — not that the function logic succeeded. Always check resp.get("FunctionError") to detect runtime exceptions inside the handler.
import boto3, zipfile, io
lm = boto3.client("lambda")
# ── Package code into a ZIP in memory ─────────────────────────────
buf = io.BytesIO()
with zipfile.ZipFile(buf, "w", zipfile.ZIP_DEFLATED) as zf:
zf.write("handler.py")
zip_bytes = buf.getvalue()
# ── CREATE a new function ─────────────────────────────────────────
lm.create_function(
FunctionName="file-arrival-handler",
Runtime="python3.12",
Role="arn:aws:iam::123456789012:role/lambda-de-role",
Handler="handler.lambda_handler", # filename.function_name
Code={"ZipFile": zip_bytes},
Description="Triggered on S3 landing file arrival",
Timeout=300, # 5 minutes
MemorySize=512,
Environment={
"Variables": {
"EMR_CLUSTER_ID": "j-ABCDEF123456",
"ALERT_TOPIC_ARN": "arn:aws:sns:us-east-1:123456789012:prod-alerts"
}
}
)
# ── UPDATE code (redeploy) ────────────────────────────────────────
lm.update_function_code(
FunctionName="file-arrival-handler",
ZipFile=zip_bytes,
Publish=True # publish a new numbered version
)
# ── UPDATE configuration (env vars, memory, timeout) ─────────────
lm.update_function_configuration(
FunctionName="file-arrival-handler",
Timeout=600,
MemorySize=1024,
Environment={
"Variables": {
"EMR_CLUSTER_ID": "j-NEWCLUSTER",
"ALERT_TOPIC_ARN": "arn:aws:sns:us-east-1:123456789012:prod-alerts"
}
}
)
# ── GET function info ─────────────────────────────────────────────
info = lm.get_function(FunctionName="file-arrival-handler")
print(f"Runtime: {info['Configuration']['Runtime']}")
print(f"Memory: {info['Configuration']['MemorySize']} MB")
print(f"Timeout: {info['Configuration']['Timeout']} sec")
# ── LIST functions with paginator ─────────────────────────────────
paginator = lm.get_paginator("list_functions")
for page in paginator.paginate():
for fn in page["Functions"]:
print(f"{fn['FunctionName']:40} {fn['Runtime']:12} {fn['MemorySize']} MB")
# ── ADD S3 trigger permission (so S3 can invoke Lambda) ───────────
lm.add_permission(
FunctionName="file-arrival-handler",
StatementId="s3-invoke-permission",
Action="lambda:InvokeFunction",
Principal="s3.amazonaws.com",
SourceArn="arn:aws:s3:::my-data-lake",
SourceAccount="123456789012"
)
# ── DELETE function ───────────────────────────────────────────────
lm.delete_function(FunctionName="file-arrival-handler")
event (trigger payload) and context (runtime metadata). Memory and CPU scale together — tune memory first for performance. The three DE trigger patterns are: S3 event (file arrival → Glue job), SQS (queue-driven pipeline orchestration with automatic retries), and EventBridge schedule (replacing cron jobs). Always configure a DLQ for async-triggered functions so failed events are never silently lost. Use batchItemFailures for SQS triggers to allow partial batch success. Structure logs as JSON for CloudWatch Log Insights queryability. For invoke(), always check FunctionError — StatusCode 200 only means Lambda ran, not that your code succeeded.
Amazon CloudWatch — Observability for Data Pipelines
CloudWatch is AWS's unified observability platform. It collects metrics (numbers over time), logs (text events), and fires alarms when thresholds are breached. For Data Engineers, CloudWatch is how you answer: "Did my pipeline finish on time? How many rows did it process? Why did it fail at 3 AM?" — without SSHing into any server.
CloudWatch's three components work as a complete observability loop: Metrics tell you what is happening (numbers over time). Logs tell you why it happened (text events with details). Alarms tell you when something needs attention (thresholds on metrics that trigger SNS notifications).
AWS services automatically publish metrics to CloudWatch — you get Glue job duration, EMR step status, Lambda invocation count, and MSK consumer lag for free without writing any code. But these built-in metrics don't know your business logic. Custom metrics — rows processed, DQ score, pipeline SLA status — must be published by your pipeline code using put_metric_data().
| Type | Examples | How to Get Them | Cost |
|---|---|---|---|
| Built-In | Glue.driver.ExecutorRunTime, Lambda.Duration, EMR.StepState | Automatic — no code needed | Free (Basic Monitoring) |
| Custom | pipeline_rows_processed, dq_score, pipeline_duration_sec | put_metric_data() from your code | $0.30 per 1,000 metrics |
Every custom metric belongs to a Namespace (a folder-like grouping), has a MetricName, a numeric Value, a Unit, and optional Dimensions (key-value labels that let you slice the metric — e.g. filter by pipeline name or environment). CloudWatch stores metrics at 1-second resolution (high-res) or 1-minute resolution (standard).
import boto3
from datetime import datetime, timezone
cw = boto3.client("cloudwatch", region_name="us-east-1")
# ── Publish multiple pipeline metrics in ONE API call ─────────────
# (batch up to 1000 metrics per call — saves cost and latency)
cw.put_metric_data(
Namespace="DataPlatform/Pipelines", # your custom namespace
MetricData=[
# ① Rows written by this pipeline run
{
"MetricName": "RowsProcessed",
"Value": 4_823_441,
"Unit": "Count",
"Timestamp": datetime.now(timezone.utc),
"Dimensions": [
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
]
},
# ② Pipeline run duration in seconds
{
"MetricName": "DurationSeconds",
"Value": 312,
"Unit": "Seconds",
"Timestamp": datetime.now(timezone.utc),
"Dimensions": [
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
]
},
# ③ Data quality score (0.0 – 1.0)
{
"MetricName": "DQScore",
"Value": 0.987,
"Unit": "None", # use "None" for dimensionless ratios
"Timestamp": datetime.now(timezone.utc),
"Dimensions": [
{"Name": "PipelineName", "Value": "silver-orders-etl"}
]
},
# ④ Pipeline success/failure flag (1 = success, 0 = failure)
{
"MetricName": "PipelineSuccess",
"Value": 1,
"Unit": "Count",
"Timestamp": datetime.now(timezone.utc),
"Dimensions": [
{"Name": "PipelineName", "Value": "silver-orders-etl"}
]
}
]
)
print("Custom metrics published to CloudWatch")
DataPlatform/Pipelines, DataPlatform/DQ, DataPlatform/SLA. This groups your metrics together in the CloudWatch console, making it easy to build dashboards per domain. Never dump everything into the AWS default namespaces.
Rather than scattering put_metric_data() calls throughout your code, build a small helper class and call it at the end of every pipeline run. This ensures consistent metric names, dimensions, and error handling across all pipelines.
import boto3, logging
from datetime import datetime, timezone
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
class PipelineMetrics:
"""Publish standard pipeline metrics to CloudWatch."""
def __init__(self, pipeline_name: str, environment: str = "prod"):
self.cw = boto3.client("cloudwatch")
self.namespace = "DataPlatform/Pipelines"
self.pipeline_name = pipeline_name
self.environment = environment
self.base_dims = [
{"Name": "PipelineName", "Value": pipeline_name},
{"Name": "Environment", "Value": environment}
]
def _put(self, name: str, value: float, unit: str = "Count"):
try:
self.cw.put_metric_data(
Namespace=self.namespace,
MetricData=[{
"MetricName": name,
"Value": value,
"Unit": unit,
"Timestamp": datetime.now(timezone.utc),
"Dimensions": self.base_dims
}]
)
except ClientError as e:
# Never let metric publishing crash your pipeline
logger.warning(f"CloudWatch metric failed: {e}")
def record_success(self, rows: int, duration_sec: float, dq_score: float = 1.0):
self._put("RowsProcessed", rows, "Count")
self._put("DurationSeconds", duration_sec, "Seconds")
self._put("DQScore", dq_score, "None")
self._put("PipelineSuccess", 1, "Count")
self._put("PipelineFailure", 0, "Count")
def record_failure(self, duration_sec: float):
self._put("PipelineSuccess", 0, "Count")
self._put("PipelineFailure", 1, "Count")
self._put("DurationSeconds", duration_sec, "Seconds")
# ── Usage in any pipeline ─────────────────────────────────────────
import time
metrics = PipelineMetrics("silver-orders-etl")
start = time.time()
try:
# ... your Spark / Glue logic ...
rows_written = 4_823_441
dq_score = 0.987
metrics.record_success(rows_written, time.time() - start, dq_score)
except Exception as e:
metrics.record_failure(time.time() - start)
raise
A CloudWatch alarm watches one metric over a time window and transitions between three states based on whether the metric crosses a threshold. The AlarmActions list (SNS ARNs) is triggered on every state transition into ALARM — not on every data point.
| State | Meaning | When It Fires AlarmActions |
|---|---|---|
| OK | Metric is within threshold | OKActions list (optional — for recovery alerts) |
| ALARM | Metric has breached threshold | AlarmActions list → SNS → email/Slack/PagerDuty |
| INSUFFICIENT_DATA | Not enough data points yet | InsufficientDataActions list (optional) |
import boto3
cw = boto3.client("cloudwatch")
ALERT_ARN = "arn:aws:sns:us-east-1:123456789012:prod-pipeline-alerts"
# ── ALARM 1: Pipeline failure detected ───────────────────────────
cw.put_metric_alarm(
AlarmName="PipelineFailure-silver-orders-etl",
AlarmDescription="silver-orders-etl reported a failure",
Namespace="DataPlatform/Pipelines",
MetricName="PipelineFailure",
Dimensions=[
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
],
Statistic="Sum",
Period=300, # 5-minute evaluation window
EvaluationPeriods=1,
Threshold=1, # any failure triggers alarm
ComparisonOperator="GreaterThanOrEqualToThreshold",
AlarmActions=[ALERT_ARN],
OKActions=[ALERT_ARN],
TreatMissingData="notBreaching" # no data = assume OK (pipeline not running)
)
# ── ALARM 2: DQ score dropped below 95% ──────────────────────────
cw.put_metric_alarm(
AlarmName="DQScoreLow-silver-orders-etl",
AlarmDescription="Data quality score below 95%",
Namespace="DataPlatform/Pipelines",
MetricName="DQScore",
Dimensions=[{"Name": "PipelineName", "Value": "silver-orders-etl"}],
Statistic="Minimum",
Period=300,
EvaluationPeriods=1,
Threshold=0.95,
ComparisonOperator="LessThanThreshold",
AlarmActions=[ALERT_ARN],
TreatMissingData="notBreaching"
)
# ── ALARM 3: Pipeline duration SLA breach (> 60 min) ─────────────
cw.put_metric_alarm(
AlarmName="SLABreach-silver-orders-etl",
AlarmDescription="silver-orders-etl exceeded 60 min SLA",
Namespace="DataPlatform/Pipelines",
MetricName="DurationSeconds",
Dimensions=[
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
],
Statistic="Maximum",
Period=300,
EvaluationPeriods=1,
Threshold=3600, # 60 minutes in seconds
ComparisonOperator="GreaterThanThreshold",
AlarmActions=[ALERT_ARN],
TreatMissingData="notBreaching"
)
# ── ALARM 4: DLQ depth > 0 (messages stuck in dead letter queue) ─
cw.put_metric_alarm(
AlarmName="DLQNotEmpty-pipeline-dlq",
AlarmDescription="Messages in DLQ — pipeline events failed all retries",
Namespace="AWS/SQS", # built-in AWS namespace
MetricName="ApproximateNumberOfMessagesVisible",
Dimensions=[{"Name": "QueueName", "Value": "pipeline-dlq"}],
Statistic="Sum",
Period=60,
EvaluationPeriods=1,
Threshold=1,
ComparisonOperator="GreaterThanOrEqualToThreshold",
AlarmActions=[ALERT_ARN],
TreatMissingData="notBreaching"
)
print("All pipeline alarms created")
A composite alarm combines multiple child alarms with AND/OR logic. Use them to avoid alert storms — e.g. only page the on-call engineer when both the failure alarm AND the DQ alarm fire at the same time, rather than getting two separate pages.
# Composite alarm fires when EITHER child alarm is in ALARM state
cw.put_composite_alarm(
AlarmName="CriticalPipelineAlert-silver-orders",
AlarmDescription="Page on-call: failure OR DQ breach in silver-orders",
AlarmRule=(
'ALARM("PipelineFailure-silver-orders-etl") OR '
'ALARM("DQScoreLow-silver-orders-etl")'
),
AlarmActions=[ALERT_ARN]
)
CloudWatch Logs organizes log data into log groups (one per service or application — e.g. /aws/glue/jobs/silver-orders-etl) and log streams (one per run/instance — e.g. the job run ID). AWS services like Lambda and Glue create these automatically. For custom applications (e.g. pipeline orchestration scripts), you create them yourself.
import boto3, json, time, logging
from datetime import datetime, timezone
logger = logging.getLogger(__name__)
cw = boto3.client("logs")
LOG_GROUP = "/dataplatform/pipelines/silver-orders-etl"
LOG_STREAM = f"run-{datetime.now(timezone.utc).strftime('%Y-%m-%dT%H-%M-%S')}"
# ── 1. Create log group (idempotent — safe to call if already exists) ─
try:
cw.create_log_group(logGroupName=LOG_GROUP)
cw.put_retention_policy( # keep logs for 30 days
logGroupName=LOG_GROUP,
retentionInDays=30
)
except cw.exceptions.ResourceAlreadyExistsException:
pass
# ── 2. Create log stream for this run ─────────────────────────────
try:
cw.create_log_stream(logGroupName=LOG_GROUP, logStreamName=LOG_STREAM)
except cw.exceptions.ResourceAlreadyExistsException:
pass
# ── 3. Publish log events ─────────────────────────────────────────
# Each event: {"timestamp": epoch_ms, "message": str}
# Must be in chronological order within one put_log_events call
def now_ms() -> int:
return int(time.time() * 1000)
log_events = [
{"timestamp": now_ms(), "message": json.dumps({
"level": "INFO", "event": "pipeline_started",
"pipeline": "silver-orders-etl", "run_date": "2024-06-15"
})},
{"timestamp": now_ms() + 1, "message": json.dumps({
"level": "INFO", "event": "rows_written",
"rows": 4_823_441, "table": "silver.orders"
})},
{"timestamp": now_ms() + 2, "message": json.dumps({
"level": "INFO", "event": "pipeline_completed",
"duration_sec": 312, "dq_score": 0.987
})}
]
cw.put_log_events(
logGroupName=LOG_GROUP,
logStreamName=LOG_STREAM,
logEvents=log_events
# sequenceToken not needed for the first call to a new stream
# for subsequent calls: pass the nextSequenceToken from previous response
)
print(f"Logs published to {LOG_GROUP}/{LOG_STREAM}")
Log Insights lets you query across all log streams in a log group using a simple query language. If you structured your logs as JSON (as shown above), you can filter by any field — finding all pipeline runs that processed over 1M rows, all ERROR-level events in the last 24 hours, or the average duration per pipeline over the last week.
import boto3, time
from datetime import datetime, timezone, timedelta
cw = boto3.client("logs")
# ── Start the query ───────────────────────────────────────────────
now = datetime.now(timezone.utc)
start_time = now - timedelta(hours=24)
query_resp = cw.start_query(
logGroupName="/dataplatform/pipelines/silver-orders-etl",
startTime=int(start_time.timestamp()),
endTime=int(now.timestamp()),
queryString="""
fields @timestamp, event, rows, duration_sec, dq_score
| filter level = "INFO" and event = "pipeline_completed"
| sort @timestamp desc
| limit 20
"""
)
query_id = query_resp["queryId"]
# ── Poll until the query finishes ─────────────────────────────────
while True:
result = cw.get_query_results(queryId=query_id)
status = result["status"]
print(f"Query status: {status}")
if status in ["Complete", "Failed", "Cancelled"]:
break
time.sleep(1)
# ── Parse results ─────────────────────────────────────────────────
# Each result row is a list of {"field": name, "value": val} dicts
for row in result["results"]:
record = {item["field"]: item["value"] for item in row}
print(f" {record.get('@timestamp')} | rows={record.get('rows')} | dq={record.get('dq_score')}")
filter level = "ERROR" | fields @timestamp, error_type, error_message | sort @timestamp descAverage pipeline duration per run date:
filter event = "pipeline_completed" | stats avg(duration_sec) by run_datePipelines with DQ score below 95%:
filter event = "pipeline_completed" and dq_score < 0.95 | fields @timestamp, pipeline, dq_score
For simple keyword searches without the query language, use filter_log_events(). It scans all streams in a log group for a pattern match. Useful for quickly finding a specific run ID or error string in production.
import boto3
from datetime import datetime, timezone, timedelta
cw = boto3.client("logs")
now = datetime.now(timezone.utc)
paginator = cw.get_paginator("filter_log_events")
pages = paginator.paginate(
logGroupName="/dataplatform/pipelines/silver-orders-etl",
startTime=int((now - timedelta(hours=6)).timestamp() * 1000), # epoch ms
endTime=int(now.timestamp() * 1000),
filterPattern='"pipeline_failed"' # exact phrase match
)
for page in pages:
for event in page["events"]:
print(f"Stream: {event['logStreamName']}")
print(f"Time: {datetime.fromtimestamp(event['timestamp']/1000, tz=timezone.utc)}")
print(f"Msg: {event['message']}\n")
A CloudWatch Dashboard is a JSON-defined collection of widgets — line charts, number widgets, alarm status panels. You define the dashboard body as a JSON string and call put_dashboard(). The result is a live monitoring page in the AWS console that your team can bookmark.
import boto3, json
cw = boto3.client("cloudwatch")
dashboard_body = {
"widgets": [
# ── Widget 1: Rows processed over time ─────────────────────
{
"type": "metric", "x": 0, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Rows Processed Per Run",
"view": "timeSeries",
"period": 300,
"metrics": [[
"DataPlatform/Pipelines", "RowsProcessed",
"PipelineName", "silver-orders-etl",
"Environment", "prod"
]]
}
},
# ── Widget 2: DQ score over time ────────────────────────────
{
"type": "metric", "x": 12, "y": 0, "width": 12, "height": 6,
"properties": {
"title": "Data Quality Score",
"view": "timeSeries",
"period": 300,
"metrics": [[
"DataPlatform/Pipelines", "DQScore",
"PipelineName", "silver-orders-etl"
]]
}
},
# ── Widget 3: Alarm status panel ────────────────────────────
{
"type": "alarm", "x": 0, "y": 6, "width": 24, "height": 3,
"properties": {
"title": "Pipeline Alarm Status",
"alarms": [
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:PipelineFailure-silver-orders-etl",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:DQScoreLow-silver-orders-etl",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:DLQNotEmpty-pipeline-dlq"
]
}
}
]
}
cw.put_dashboard(
DashboardName="DataPlatform-Pipeline-Health",
DashboardBody=json.dumps(dashboard_body)
)
print("Dashboard created: DataPlatform-Pipeline-Health")
Publish a PipelineSuccess metric (value = 1) when the pipeline finishes. Set a CloudWatch alarm that fires if the sum of PipelineSuccess over the expected completion window is zero — meaning the pipeline didn't run at all. This catches the silent failure: a pipeline that simply didn't execute rather than crashed.
# Alarm: if sum of PipelineSuccess in the 06:00–07:00 UTC window is 0,
# the pipeline didn't complete on time → SLA breach
cw.put_metric_alarm(
AlarmName="SLAMiss-silver-orders-etl-daily",
AlarmDescription="silver-orders-etl did not complete by 07:00 UTC",
Namespace="DataPlatform/Pipelines",
MetricName="PipelineSuccess",
Dimensions=[
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
],
Statistic="Sum",
Period=3600, # 1-hour window
EvaluationPeriods=1,
Threshold=1,
ComparisonOperator="LessThanThreshold",
AlarmActions=[ALERT_ARN],
TreatMissingData="breaching" # ← KEY: no data = SLA breach
)
TreatMissingData="breaching". This means "if no metric data arrived in this window, treat it as a threshold breach." Without this, a pipeline that simply never ran would show as OK — the alarm would never fire for the silent failure case.
Publish RowsProcessed after every run. If today's run wrote 10M rows but yesterday wrote 5M, that's either a data explosion or a bug — both worth alerting on. Use CloudWatch Anomaly Detection to automatically set dynamic thresholds based on historical patterns, rather than hard-coding a fixed number.
# Put an anomaly detector on the RowsProcessed metric
cw.put_anomaly_detector(
Namespace="DataPlatform/Pipelines",
MetricName="RowsProcessed",
Dimensions=[
{"Name": "PipelineName", "Value": "silver-orders-etl"},
{"Name": "Environment", "Value": "prod"}
],
Stat="Sum",
Configuration={
"ExcludedTimeRanges": [] # optionally exclude known anomalies from training
}
)
print("Anomaly detector training started — takes ~2 weeks of data")
import boto3
cw = boto3.client("cloudwatch") # metrics + alarms + dashboards
logs = boto3.client("logs") # log groups, streams, events, insights
# ════════════════════════════════════════════════════════════════
# METRICS
# ════════════════════════════════════════════════════════════════
# Publish custom metric data points
cw.put_metric_data(Namespace="...", MetricData=[...])
# Get metric statistics (average, sum, min, max over a time range)
cw.get_metric_statistics(
Namespace="DataPlatform/Pipelines", MetricName="DurationSeconds",
Dimensions=[{"Name": "PipelineName", "Value": "silver-orders-etl"}],
StartTime=..., EndTime=..., Period=86400, Statistics=["Average", "Maximum"]
)
# Batch query multiple metrics simultaneously (more efficient than get_metric_statistics)
cw.get_metric_data(
MetricDataQueries=[
{"Id": "rows", "MetricStat": {
"Metric": {"Namespace": "DataPlatform/Pipelines", "MetricName": "RowsProcessed",
"Dimensions": [{"Name": "PipelineName", "Value": "silver-orders-etl"}]},
"Period": 86400, "Stat": "Sum"
}}
],
StartTime=..., EndTime=...
)
# ════════════════════════════════════════════════════════════════
# ALARMS
# ════════════════════════════════════════════════════════════════
cw.put_metric_alarm(AlarmName="...", ...) # create or update alarm
cw.put_composite_alarm(AlarmName="...", AlarmRule="...", ...)
cw.describe_alarms(AlarmNames=["..."]) # get current state + config
cw.describe_alarm_history(AlarmName="...") # state transition history
cw.set_alarm_state( # manually force state (testing)
AlarmName="test-alarm",
StateValue="ALARM",
StateReason="Manual test"
)
cw.delete_alarms(AlarmNames=["..."])
# ════════════════════════════════════════════════════════════════
# LOGS
# ════════════════════════════════════════════════════════════════
logs.create_log_group(logGroupName="...")
logs.put_retention_policy(logGroupName="...", retentionInDays=30)
logs.create_log_stream(logGroupName="...", logStreamName="...")
logs.put_log_events(logGroupName="...", logStreamName="...", logEvents=[...])
logs.get_paginator("filter_log_events").paginate(logGroupName="...", filterPattern="...")
logs.describe_log_streams(logGroupName="...")
# Log Insights
logs.start_query(logGroupName="...", queryString="...", startTime=..., endTime=...)
logs.get_query_results(queryId="...") # poll until status = "Complete"
# ════════════════════════════════════════════════════════════════
# DASHBOARDS
# ════════════════════════════════════════════════════════════════
cw.put_dashboard(DashboardName="...", DashboardBody=json.dumps({...}))
cw.get_dashboard(DashboardName="...")
cw.list_dashboards()
cw.delete_dashboards(DashboardNames=["..."])
put_metric_data()) to track rows processed, DQ score, duration, and success/failure — AWS built-ins don't know your business logic. Wire alarms to SNS topics for automated alerting on failure, DQ drops, SLA breaches, and DLQ depth spikes. Use TreatMissingData="breaching" on SLA alarms so a pipeline that simply never ran triggers the alarm. Structure all logs as JSON so CloudWatch Log Insights can query them by field. Batch metrics together in a single put_metric_data() call (up to 1000 per call) to reduce cost. Use set_alarm_state() in staging to test your SNS alerting pipeline before going to production.
AWS VPC for Data Engineers
A VPC (Virtual Private Cloud) is the private network every Glue job, EMR cluster, RDS database, and Lambda function lives inside. As a Data Engineer you rarely build VPCs from scratch — but you must understand subnets, route tables, security groups, and especially VPC Endpoints, because half of all "it works in my notebook but fails in Glue" problems are network connectivity issues, not code issues.
A VPC is an isolated, private slice of the AWS network — your own virtual data center with its own IP address range. Every resource that needs network connectivity — EMR clusters, RDS databases, Glue jobs (when configured with a VPC connection), Lambda functions (when accessing private resources) — runs inside a VPC, even though Glue and Lambda look "serverless" from the outside.
A VPC is split into subnets, each tied to a single Availability Zone (AZ) with its own slice of the VPC's IP range (CIDR block). Every subnet is classified as public or private based on one thing only: does its route table send 0.0.0.0/0 traffic to an Internet Gateway?
| Subnet Type | Route for 0.0.0.0/0 | Internet Access | Typical Resources |
|---|---|---|---|
| Public | → Internet Gateway (IGW) | Direct, two-way | NAT Gateways, Bastion hosts, ALBs |
| Private | → NAT Gateway (or none) | Outbound only (via NAT), no inbound | EMR clusters, RDS, Redshift, Glue ENIs, Lambda (VPC mode) |
| Isolated | No internet route at all | None — only VPC Endpoints | Highly sensitive databases, internal-only services |
A CIDR block (e.g. 10.0.0.0/16) defines the range of private IP addresses available to a VPC or subnet. The number after the slash is the prefix length — smaller numbers mean larger ranges. A /16 gives ~65,000 addresses for the whole VPC; each subnet typically gets a /24 (256 addresses).
| CIDR Notation | Number of IPs | Typical Use |
|---|---|---|
10.0.0.0/16 | 65,536 | Entire VPC |
10.0.1.0/24 | 256 | One public subnet (AZ-a) |
10.0.10.0/24 | 256 | One private subnet (AZ-a) — EMR/RDS |
10.0.11.0/24 | 256 | Private subnet (AZ-b) — for HA / Multi-AZ |
/28 subnet, causing scale-out failures with no obvious error in Spark itself — only in the EMR cluster provisioning logs.
import boto3
ec2 = boto3.client("ec2", region_name="us-east-1")
# ── Create the VPC ────────────────────────────────────────────────
vpc = ec2.create_vpc(
CidrBlock="10.0.0.0/16",
TagSpecifications=[{
"ResourceType": "vpc",
"Tags": [{"Key": "Name", "Value": "data-platform-vpc"}]
}]
)
vpc_id = vpc["Vpc"]["VpcId"]
# ── Create a PRIVATE subnet for EMR / RDS / Glue ENIs ───────────────
private_subnet = ec2.create_subnet(
VpcId=vpc_id,
CidrBlock="10.0.10.0/24",
AvailabilityZone="us-east-1a",
TagSpecifications=[{
"ResourceType": "subnet",
"Tags": [{"Key": "Name", "Value": "private-subnet-emr-rds-a"}]
}]
)
private_subnet_id = private_subnet["Subnet"]["SubnetId"]
# ── Create a PUBLIC subnet for the NAT Gateway ──────────────────────
public_subnet = ec2.create_subnet(
VpcId=vpc_id,
CidrBlock="10.0.1.0/24",
AvailabilityZone="us-east-1a",
TagSpecifications=[{
"ResourceType": "subnet",
"Tags": [{"Key": "Name", "Value": "public-subnet-nat-a"}]
}]
)
print(f"VPC: {vpc_id} | Private subnet: {private_subnet_id}")
AWS Cost Optimization
for Data Engineers
Cloud costs spiral quickly in data pipelines. This section covers every lever a Data Engineer controls — from Spot Instances and Reserved capacity to S3 lifecycle automation, Glue DPU tuning, Athena query cost control, Lambda memory right-sizing, and Redshift pause/resume. Master these and you can cut pipeline costs by 40–70%.
Spot Instances are spare EC2 capacity that AWS sells at up to 90% discount compared to On-Demand pricing. The catch: AWS can reclaim them with a 2-minute warning when that capacity is needed elsewhere. For data pipelines this is usually fine — your EMR or EKS cluster just retries the failed tasks.
In EMR, always keep the Master node and Core nodes On-Demand (they hold HDFS data and the YARN resource manager). Put all Task nodes on Spot — they only run tasks, hold no HDFS data, and losing them causes a retry, not a cluster failure.
import boto3
emr = boto3.client('emr', region_name='us-east-1')
response = emr.run_job_flow(
Name='cost-optimised-cluster',
ReleaseLabel='emr-6.15.0',
Applications=[{'Name': 'Spark'}],
Instances={
# Master — On-Demand (controls the cluster)
'MasterInstanceType': 'm5.xlarge',
# Core — On-Demand (holds HDFS blocks)
'SlaveInstanceType': 'm5.2xlarge',
'InstanceCount': 2, # master + 1 core = 2 On-Demand
# Task — Spot (pure compute, no HDFS data)
'InstanceGroups': [
{
'Name': 'SpotTaskNodes',
'Market': 'SPOT',
'InstanceRole': 'TASK',
'InstanceType': 'm5.2xlarge',
'InstanceCount': 8,
'BidPrice': '0.20', # max price; AWS uses market price
}
],
'KeepJobFlowAliveWhenNoSteps': False,
'TerminationProtected': False,
},
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
AutoTerminateAfterNoSteps=True,
)
print(f"Cluster: {response['JobFlowId']}")
When AWS reclaims a Spot node, the 2-minute notification triggers EMR's graceful decommission — it attempts to move in-progress tasks to other nodes. Spark's built-in task retry (default 4 retries) handles the rest. Configure spark.task.maxFailures to be tolerant.
# In your EMR Configurations override:
configurations = [
{
'Classification': 'spark-defaults',
'Properties': {
'spark.task.maxFailures': '8', # default 4 → raise to 8 for Spot
'spark.stage.maxConsecutiveAttempts': '8',
'spark.blacklist.enabled': 'true', # blacklist repeatedly failing nodes
'spark.blacklist.task.maxTaskAttemptsPerNode': '2',
}
}
]
A reliable production pattern: provision 20% On-Demand task nodes as a base, and 80% Spot task nodes for burst capacity. Even if all Spot nodes are reclaimed, the job continues (slowly) on On-Demand nodes.
| Node Type | Market | Why |
|---|---|---|
| Master (1x) | On-Demand | YARN ResourceManager — cannot be lost |
| Core (2x) | On-Demand | HDFS NameNode + DataNode — data locality |
| Task — base (2x) | On-Demand | Fallback if all Spot reclaimed |
| Task — burst (8x) | Spot | 60–80% cheaper; retried on interruption |
You commit to using a specific EC2 instance type in a specific region for 1 or 3 years in exchange for a 30–60% discount. Use RIs for your always-on baseline infrastructure — EMR master/core nodes that run every day, persistent Redshift clusters, long-running Glue workers.
| RI Type | Flexibility | Discount | Use For |
|---|---|---|---|
| Standard RI | Fixed instance type | ~60% | Predictable, same-type workloads |
| Convertible RI | Can change instance family | ~40% | When you may resize later |
| Scheduled RI | Specific time windows | ~30% | Daily batch jobs at fixed times |
Savings Plans are more flexible than RIs. You commit to a dollar amount per hour (e.g., $5/hr) rather than a specific instance type. AWS applies the discount automatically to any matching usage. Compute Savings Plans cover EC2, Lambda, and Fargate — ideal for Data Engineers who run varied workloads.
S3 has multiple storage tiers — the colder the tier, the cheaper the storage but the higher the retrieval cost. Lifecycle policies automatically move objects between tiers based on age, so you never pay Standard prices for year-old raw data.
| Storage Class | Cost (per GB/mo) | Retrieval | Best For |
|---|---|---|---|
| Standard | ~$0.023 | Instant, free | Active data (last 30 days) |
| Intelligent-Tiering | ~$0.023 + monitoring | Instant | Unknown access patterns |
| Standard-IA | ~$0.0125 | Instant, per-GB fee | 30–90 day old data, infrequent access |
| Glacier Instant Retrieval | ~$0.004 | Milliseconds | 90–180 day data, rare access |
| Glacier Flexible Retrieval | ~$0.0036 | Minutes–hours | Compliance archives |
| Glacier Deep Archive | ~$0.00099 | Hours | 7+ year regulatory data |
The typical data lake lifecycle: keep active data in Standard, move raw/bronze data after 30 days to IA, archive after 90 days, expire after your retention window. Configure this once per bucket/prefix and AWS handles it automatically forever.
import boto3
s3 = boto3.client('s3')
# Apply tiered lifecycle to raw/ prefix (Bronze zone)
s3.put_bucket_lifecycle_configuration(
Bucket='my-data-lake',
LifecycleConfiguration={
'Rules': [
{
'ID': 'raw-zone-tiering',
'Status': 'Enabled',
'Filter': {'Prefix': 'raw/'}, # only Bronze/raw data
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA' # after 30 days → IA
},
{
'Days': 90,
'StorageClass': 'GLACIER_IR' # after 90 days → Glacier IR
},
{
'Days': 365,
'StorageClass': 'DEEP_ARCHIVE' # after 1 year → cheapest tier
},
],
'Expiration': {
'Days': 2555 # delete after 7 years (compliance)
},
# Also clean up incomplete multipart uploads
'AbortIncompleteMultipartUpload': {
'DaysAfterInitiation': 7
}
},
{
'ID': 'gold-zone-no-archive',
'Status': 'Enabled',
'Filter': {'Prefix': 'gold/'}, # Gold stays hot for dashboards
'Transitions': [
{'Days': 180, 'StorageClass': 'STANDARD_IA'}
]
}
]
}
)
print("Lifecycle policy applied ✅")
When a large file upload fails mid-way, the partial chunks remain in S3 and you are billed for them. They are invisible in the console and can accumulate to gigabytes. The AbortIncompleteMultipartUpload lifecycle rule deletes them automatically after N days.
The single biggest EMR cost mistake: leaving a cluster running after the Spark job finishes. An idle m5.2xlarge cluster costs ~$0.38/hr per node — that's $273/month per idle node. Always set AutoTerminateAfterNoSteps=True or use a Lambda to terminate after the last step succeeds.
# In run_job_flow: set auto-terminate
response = emr.run_job_flow(
Name='daily-batch-job',
Instances={
'KeepJobFlowAliveWhenNoSteps': False, # terminate when steps finish
# ... other config
},
# OR use the top-level flag:
AutoTerminateAfterNoSteps=True,
Steps=[
{
'Name': 'SparkJob',
'ActionOnFailure': 'TERMINATE_CLUSTER', # also terminate on failure
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': ['spark-submit', '--py-files', 's3://mybucket/jobs.zip',
's3://mybucket/main.py']
}
}
]
)
Over-provisioning is silent waste. Use the Spark UI to check executor utilisation after a job run — if executors are mostly idle, you have too many. Start with a smaller cluster, measure, then scale. EMR's managed scaling does this automatically.
# Managed scaling: EMR auto-resizes based on YARN pending containers
response = emr.run_job_flow(
Name='auto-scaled-cluster',
ManagedScalingPolicy={
'ComputeLimits': {
'UnitType': 'Instances',
'MinimumCapacityUnits': 2, # minimum 2 nodes
'MaximumCapacityUnits': 20, # scale up to 20 nodes
'MaximumOnDemandCapacityUnits': 4, # max On-Demand nodes
'MaximumCoreCapacityUnits': 2, # core nodes stay small
}
},
# rest of config...
)
With EMR Serverless, you pay only for the vCPU-seconds and GB-memory-seconds your Spark job actually uses. No idle cluster cost. No under/over-provisioning. For intermittent jobs (a few times per day), Serverless is almost always cheaper than a persistent cluster.
A Data Processing Unit (DPU) is the billing unit for Glue. One DPU = 4 vCPUs + 16 GB RAM. Glue charges $0.44 per DPU-hour, billed in 10-minute increments. By default, Glue allocates 10 DPUs to every job — whether your job needs it or not. This is where most teams waste money.
For small jobs (under 1 GB of data), 2 DPUs is often enough. For medium jobs (1–50 GB), try 5 DPUs first. Use the Glue Job Metrics (CloudWatch) to see actual executor utilisation, then reduce DPUs accordingly. Also consider G.1X vs G.2X workers — G.1X gives 4 vCPUs / 16 GB and costs less per worker than G.2X (8 vCPUs / 32 GB).
import boto3
glue = boto3.client('glue')
glue.create_job(
Name='optimised-etl-job',
Role='arn:aws:iam::123456789012:role/GlueRole',
Command={
'Name': 'glueetl',
'ScriptLocation': 's3://my-scripts/etl.py',
'PythonVersion': '3',
},
# Worker type: G.1X = 4 vCPU / 16 GB (cost-efficient for most jobs)
# G.2X = 8 vCPU / 32 GB (for memory-intensive transforms)
WorkerType='G.1X',
NumberOfWorkers=4, # was 10 by default — start small!
GlueVersion='4.0',
DefaultArguments={
'--enable-metrics': '', # publish to CloudWatch for monitoring
'--enable-job-insights': 'true', # Glue recommends right DPU count
'--job-bookmark-option': 'job-bookmark-enable',
},
# Timeout protects against runaway cost
Timeout=60, # max 60 minutes; kills job after this
)
# Start the job
glue.start_job_run(JobName='optimised-etl-job')
--enable-job-insights in your job arguments. After a run, Glue will show a recommendation like "You could use 3 workers instead of 4" in the CloudWatch Logs. This is free automated right-sizing advice.
Not everything needs a full Spark cluster. If you're running a small metadata update, a file rename, or an API call, use a Glue Python Shell job — it runs on a single small VM (0.0625 DPU) and costs a fraction of a Spark ETL job.
glue.create_job(
Name='lightweight-metadata-job',
Role='arn:aws:iam::123456789012:role/GlueRole',
Command={
'Name': 'pythonshell', # NOT 'glueetl'
'ScriptLocation': 's3://my-scripts/metadata_update.py',
'PythonVersion': '3',
},
MaxCapacity=0.0625, # 1/16 of a DPU — minimal cost
)
Athena charges $5 per TB of data scanned. That means a query that scans 10 TB costs $50 every time it runs. The two most powerful cost levers are: (1) partition pruning so Athena only scans relevant partitions, and (2) columnar format (Parquet/ORC) so it only reads relevant columns.
| Scenario | Data Scanned | Cost per Query |
|---|---|---|
| Full table scan on CSV, no partitions | 10 TB | $50.00 |
| Full table scan on Parquet (column pruning) | 2 TB | $10.00 |
| Parquet + partition filter (WHERE dt='2024-01-01') | 50 GB | $0.25 |
Always store data partitioned by date (or another high-cardinality filter column) and always include the partition column in your WHERE clause. Without this filter, Athena scans every partition. With it, Athena skips 99% of files.
-- BAD: scans ALL partitions = $50/query
SELECT order_id, amount
FROM sales
WHERE customer_id = 12345; -- customer_id is NOT the partition key
-- GOOD: partition filter reduces scan to 1 day = $0.25/query
SELECT order_id, amount
FROM sales
WHERE dt = '2024-01-15' -- dt IS the partition key
AND customer_id = 12345;
Athena Workgroups let you set a per-query data scan limit and a per-day cost limit per team. If a query would scan more than your limit, Athena cancels it before any cost is incurred. This prevents runaway analytical queries from your data consumers from causing surprise bills.
import boto3
athena = boto3.client('athena')
athena.create_work_group(
Name='analytics-team',
Configuration={
'ResultConfiguration': {
'OutputLocation': 's3://my-athena-results/analytics-team/'
},
'EnforceWorkGroupConfiguration': True,
'BytesScannedCutoffPerQuery': 10 * 1024**3, # 10 GB max per query
'PublishCloudWatchMetricsEnabled': True, # track usage in CloudWatch
'EngineVersion': {
'SelectedEngineVersion': 'Athena engine version 3'
}
},
Description='Analytics team — 10 GB scan limit per query'
)
# Run query scoped to this workgroup
response = athena.start_query_execution(
QueryString='SELECT * FROM sales WHERE dt = \'2024-01-15\'',
QueryExecutionContext={'Database': 'prod_db'},
WorkGroup='analytics-team', # 🔑 always specify workgroup
)
Athena caches results for up to 7 days. If the same query is run again within the cache window, Athena returns the cached result at zero cost (no data scanned). Enable this at the workgroup level. Perfect for dashboard queries that run every 5 minutes on static data.
response = athena.start_query_execution(
QueryString='SELECT region, SUM(revenue) FROM sales GROUP BY 1',
QueryExecutionContext={'Database': 'prod_db'},
WorkGroup='analytics-team',
ResultReuseConfiguration={
'ResultReuseByAgeConfiguration': {
'Enabled': True,
'MaxAgeInMinutes': 60 # reuse cached result if < 60 mins old
}
}
)
Lambda charges on two dimensions: number of invocations ($0.20 per 1M requests) and GB-seconds of duration ($0.0000166667 per GB-second). GB-seconds = memory allocated (GB) × duration (seconds). Allocating more memory means higher cost per second but faster execution — you need to find the sweet spot.
AWS provides an open-source Lambda Power Tuning step-function that runs your function at multiple memory settings (128 MB to 10,240 MB) and plots cost vs speed. Use it to find the optimal memory for your specific function. Most data engineering Lambda functions are CPU-bound and benefit from 512 MB–1024 MB.
import boto3
lam = boto3.client('lambda')
# Right-size Lambda memory after profiling
lam.update_function_configuration(
FunctionName='file-arrival-trigger',
MemorySize=512, # MB — tuned from default 128 MB
Timeout=300, # 5 minutes max (fail fast)
Environment={
'Variables': {
'GLUE_JOB_NAME': 'my-etl-job',
}
}
)
# Check current config
config = lam.get_function_configuration(FunctionName='file-arrival-trigger')
print(f"Memory: {config['MemorySize']}MB, Timeout: {config['Timeout']}s")
Redshift provisioned clusters charge by the hour — even when idle. For dev/staging clusters or analytics clusters only needed during business hours, pause them overnight and on weekends. A paused cluster retains all data and configuration, you only stop paying for compute. You can automate this with boto3 or Redshift's built-in scheduler.
import boto3
redshift = boto3.client('redshift')
def pause_cluster(cluster_id: str):
"""Call this at end of business hours (e.g., 7 PM via EventBridge)."""
redshift.pause_cluster(ClusterIdentifier=cluster_id)
print(f"Cluster {cluster_id} paused ✅")
def resume_cluster(cluster_id: str):
"""Call this at start of business hours (e.g., 7 AM via EventBridge)."""
redshift.resume_cluster(ClusterIdentifier=cluster_id)
print(f"Cluster {cluster_id} resuming... ⏳")
# Check cluster status
def get_cluster_status(cluster_id: str) -> str:
resp = redshift.describe_clusters(ClusterIdentifier=cluster_id)
return resp['Clusters'][0]['ClusterStatus']
# Returns: 'available', 'paused', 'pausing', 'resuming'
# Usage
pause_cluster('my-analytics-cluster')
For unpredictable or low-frequency query workloads, Redshift Serverless charges per RPU-second (Redshift Processing Unit). You pay nothing when idle. For dev environments or occasional reporting, Serverless is almost always cheaper than a provisioned cluster.
| # | Action | Service | Typical Saving |
|---|---|---|---|
| 1 | Use Spot for EMR task nodes | EMR | 60–80% compute |
| 2 | Auto-terminate clusters after job | EMR | Eliminate idle cost |
| 3 | S3 lifecycle: move raw data to IA after 30d | S3 | 45–95% storage |
| 4 | Always filter on partition column in Athena | Athena | 80–99% scan cost |
| 5 | Right-size Glue DPUs (start with 4, not 10) | Glue | 40–60% DPU cost |
| 6 | Pause Redshift dev clusters at night/weekends | Redshift | 60–70% compute |
| 7 | Use Compute Savings Plans for baseline EC2/Lambda | All | 30–40% overall |
| 8 | Enable Athena result caching for dashboards | Athena | Near-zero repeat queries |
| 9 | AbortIncompleteMultipartUpload lifecycle rule | S3 | Small but free |
| 10 | Use Python Shell for lightweight Glue work | Glue | 94% vs Spark job |
AWS Data Governance
Data Governance is about knowing who can access what data, where it came from, and whether it can be trusted. On AWS, the governance stack centers on Lake Formation (permissions), Glue Catalog (metadata), and patterns for lineage, PII detection, and cross-account sharing. A senior Data Engineer must be able to design and implement this stack — not just write pipelines.
AWS Lake Formation is a centralized permission layer that sits on top of S3 + Glue Catalog. Instead of writing complex bucket policies and IAM statements for every user and table, you grant permissions at the database, table, column, or row level through a single Lake Formation console or API call. Glue, Athena, Redshift Spectrum, and EMR all honor these permissions automatically.
Before Lake Formation can govern data, you must register the S3 location with Lake Formation. This tells Lake Formation "I am taking ownership of permissions for this S3 path — no longer controlled by raw bucket policies alone." After registration, access to this location requires a Lake Formation permission grant, not just an IAM policy.
import boto3
lf = boto3.client("lakeformation")
# Register the S3 data lake root with Lake Formation
# RoleArn is the service-linked role LF uses to access S3
lf.register_resource(
ResourceArn="arn:aws:s3:::company-data-lake",
UseServiceLinkedRole=True
)
print("✅ S3 location registered with Lake Formation")
# List all registered S3 locations
resp = lf.list_resources()
for resource in resp["ResourceInfoList"]:
print(resource["ResourceArn"], "→", resource.get("RoleArn", "service-linked"))
s3://company-data-lake). All sub-paths (bronze/, silver/, gold/) are automatically governed by Lake Formation once the root is registered.
After registering S3 and crawling data into the Glue Catalog, you grant Lake Formation permissions to IAM principals (users, roles, groups). The most common grants are SELECT (read), INSERT (write), ALTER (schema changes), and DROP (delete table).
import boto3
lf = boto3.client("lakeformation")
# Grant SELECT on a table to a data analyst IAM role
lf.grant_permissions(
Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/DataAnalystRole"},
Resource={
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "gold_layer",
"Name": "sales_summary"
}
},
Permissions=["SELECT"],
PermissionsWithGrantOption=[] # analyst cannot re-grant to others
)
print("✅ SELECT granted on gold_layer.sales_summary")
# Grant SELECT on all tables in a database to an ETL role
lf.grant_permissions(
Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/GlueETLRole"},
Resource={
"Database": {
"CatalogId": "123456789012",
"Name": "silver_layer"
}
},
Permissions=["SELECT", "DESCRIBE"]
)
# Revoke permissions
lf.revoke_permissions(
Principal={"DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/OldRole"},
Resource={
"Table": {
"CatalogId": "123456789012",
"DatabaseName": "gold_layer",
"Name": "sales_summary"
}
},
Permissions=["SELECT"]
)
The AWS Glue Data Catalog is a fully managed Hive-compatible metadata store. It stores the schema, partition info, table location, and data format for every table in your data lake. Athena, EMR Spark, Redshift Spectrum, Glue ETL jobs, and Lake Formation all share this single catalog — meaning a table created by Glue is instantly queryable by Athena without any extra registration.
raw_layer, bronze_layer, silver_layer, gold_layer.year=2024/month=01/day=15.Use the Glue client to programmatically inspect the catalog — list all tables, get their schemas, check partition counts. This is useful in governance audits, data discovery tools, and metadata-driven pipeline config generation.
import boto3
glue = boto3.client("glue")
# List all databases in the catalog
paginator = glue.get_paginator("get_databases")
for page in paginator.paginate():
for db in page["DatabaseList"]:
print(f"Database: {db['Name']}")
# Get all tables in a database with their schema
paginator = glue.get_paginator("get_tables")
for page in paginator.paginate(DatabaseName="gold_layer"):
for table in page["TableList"]:
name = table["Name"]
location = table["StorageDescriptor"]["Location"]
fmt = table["StorageDescriptor"]["InputFormat"]
columns = table["StorageDescriptor"]["Columns"]
print(f"Table: {name} | S3: {location} | Format: {fmt}")
for col in columns:
print(f" {col['Name']:30s} {col['Type']}")
# Get partition count for a table
part_paginator = glue.get_paginator("get_partitions")
count = 0
for page in part_paginator.paginate(
DatabaseName="gold_layer", TableName="sales_summary"):
count += len(page["Partitions"])
print(f"Partition count: {count}")
Data lineage is the ability to answer: "Where did this data come from? What transformations happened to it? Where did it go?" Without lineage, when a column value looks wrong, you cannot trace back to the root cause. Lineage is also required for regulatory compliance (GDPR, HIPAA) — to prove you can find and delete every copy of a person's PII across the entire data platform.
total_revenue = $0 for January. With lineage, you trace: Gold sales_summary ← Silver orders_clean ← Bronze orders_raw ← RDS orders table. You find the Glue job that ran Bronze → Silver had a filter bug that dropped all January rows. Without lineage, you'd spend hours guessing.
The most practical way to implement lineage for a mid-size data platform is to write a lineage record to a DynamoDB table at the start and end of every pipeline job. Each record captures the source, target, transformation job, run time, and row counts. Tools like a simple internal portal or Athena queries can then visualize the lineage graph.
import boto3, uuid
from datetime import datetime, timezone
from decimal import Decimal
dynamo = boto3.resource("dynamodb")
table = dynamo.Table("data_lineage")
def record_lineage(
job_name: str,
source: str, # e.g. "rds.company_db.orders"
target: str, # e.g. "s3://company-lake/silver/orders_clean/"
rows_in: int,
rows_out: int,
status: str, # SUCCEEDED / FAILED
run_id: str = None,
error_msg: str = None
):
run_id = run_id or str(uuid.uuid4())
now = datetime.now(timezone.utc).isoformat()
table.put_item(Item={
"run_id": run_id,
"job_name": job_name,
"source": source,
"target": target,
"rows_in": Decimal(rows_in),
"rows_out": Decimal(rows_out),
"status": status,
"run_time": now,
"error_msg": error_msg or ""
})
return run_id
# Usage: call at start and end of every Glue/EMR/Lambda job
run_id = str(uuid.uuid4())
# At pipeline start
record_lineage(
job_name = "orders-bronze-to-silver",
source = "s3://company-lake/raw/rds/orders/",
target = "s3://company-lake/delta/silver/orders_clean/",
rows_in = 142830,
rows_out = 142830,
status = "SUCCEEDED",
run_id = run_id
)
print(f"✅ Lineage recorded: {run_id}")
# Query lineage for a specific target table
resp = table.query(
IndexName="target-index", # GSI on target column
KeyConditionExpression=boto3.dynamodb.conditions.Key("target").eq(
"s3://company-lake/delta/silver/orders_clean/"
)
)
for item in resp["Items"]:
print(f" {item['run_time']} {item['job_name']} rows_in={item['rows_in']} status={item['status']}")
OpenLineage is an open standard (backed by Astronomer, Databricks, and others) that defines a common JSON event format for emitting lineage from any pipeline tool — Spark, Airflow, dbt, Flink. Marquez is an open-source lineage server that collects OpenLineage events and provides a UI to visualize the full lineage graph. For large platforms, adopting OpenLineage is far better than building a custom DynamoDB lineage table.
START and COMPLETE lineage events for every Spark job — no manual code changes needed in your ETL scripts.
# Add to spark-submit --conf or SparkSession config
spark = SparkSession.builder \
.appName("orders-silver-etl") \
.config("spark.extraListeners",
"io.openlineage.spark.agent.OpenLineageSparkListener") \
.config("spark.openlineage.transport.type", "http") \
.config("spark.openlineage.transport.url",
"http://marquez-server:5000") \
.config("spark.openlineage.namespace", "company-data-platform") \
.getOrCreate()
# Everything after this is automatically tracked by OpenLineage
# reads, writes, transformations — all captured as lineage events
df = spark.read.parquet("s3://company-lake/raw/orders/")
df_clean = df.filter(df.status != "CANCELLED")
df_clean.write.format("delta").save("s3://company-lake/silver/orders_clean/")
# → OpenLineage emits: raw/orders → [filter transform] → silver/orders_clean
PII (Personally Identifiable Information) includes names, email addresses, phone numbers, SSNs, IP addresses, and anything that can identify a person. Under GDPR, HIPAA, and PCI-DSS, you must know where all PII lives in your data lake, control who can see it, and be able to delete it on request (GDPR right to erasure). A Data Engineer must build PII detection into the ingestion pipeline — not as an afterthought.
AWS Glue's sensitive data detection feature (part of Glue Data Quality) can automatically scan datasets for PII patterns — emails, credit card numbers, SSNs, phone numbers, and more. It uses built-in detectors that match common PII patterns with regex and ML-based classifiers. You attach it to a Glue job to scan data as it flows through the pipeline.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract, count, when
spark = SparkSession.builder.appName("pii-scanner").getOrCreate()
# Load raw data
df = spark.read.parquet("s3://company-lake/raw/customers/")
# PII regex patterns
EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
PHONE_REGEX = r'(\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
SSN_REGEX = r'\b\d{3}-\d{2}-\d{4}\b'
CC_REGEX = r'\b(?:\d{4}[-\s]?){3}\d{4}\b'
# Scan every string column for PII matches
pii_report = {}
for field in df.schema.fields:
if str(field.dataType) == "StringType()":
col_name = field.name
pii_count = df.select(
count(when(col(col_name).rlike(EMAIL_REGEX), 1)).alias("email"),
count(when(col(col_name).rlike(PHONE_REGEX), 1)).alias("phone"),
count(when(col(col_name).rlike(SSN_REGEX), 1)).alias("ssn"),
count(when(col(col_name).rlike(CC_REGEX), 1)).alias("cc"),
).collect()[0]
detected = {k: v for k, v in pii_count.asDict().items() if v > 0}
if detected:
pii_report[col_name] = detected
print(f"⚠️ PII detected in column '{col_name}': {detected}")
if not pii_report:
print("✅ No PII detected in dataset")
# Output: ⚠️ PII detected in column 'email_address': {'email': 85200}
# Output: ⚠️ PII detected in column 'phone': {'phone': 85200}
A fast first-pass approach before regex scanning is to check column names for PII keywords. If a column is called email, phone_number, ssn, date_of_birth, or credit_card, it almost certainly contains PII. This heuristic lets you flag PII instantly during schema discovery without scanning any data.
import boto3
# PII keyword hints in column names
PII_KEYWORDS = {
"email", "phone", "mobile", "ssn", "social_security",
"credit_card", "card_number", "dob", "date_of_birth",
"ip_address", "passport", "national_id", "tax_id",
"first_name", "last_name", "full_name", "address", "postcode"
}
glue = boto3.client("glue")
def scan_catalog_for_pii(database: str):
"""Scan all tables in a Glue Catalog database for PII column names."""
pii_findings = []
paginator = glue.get_paginator("get_tables")
for page in paginator.paginate(DatabaseName=database):
for table in page["TableList"]:
tbl_name = table["Name"]
columns = table["StorageDescriptor"]["Columns"]
for col in columns:
col_lower = col["Name"].lower()
if any(kw in col_lower for kw in PII_KEYWORDS):
pii_findings.append({
"database": database,
"table": tbl_name,
"column": col["Name"],
"type": col["Type"]
})
return pii_findings
findings = scan_catalog_for_pii("bronze_layer")
for f in findings:
print(f"⚠️ {f['database']}.{f['table']}.{f['column']} ({f['type']}) → likely PII")
Once PII is detected, you must decide how to protect it. There are three primary strategies, each with different trade-offs between security, usability, and reversibility.
| Strategy | What It Does | Reversible? | Use Case |
|---|---|---|---|
| Nullification | Replace PII with NULL | No | When the value is never needed downstream |
| Static Masking | Replace with fixed placeholder (***-**-1234) | No | Display in BI dashboards — readable format, no real value |
| Tokenization | Replace with a token that maps to original value in a secure lookup table | Yes (with key) | Need to join on PII key, but not expose the raw value |
| Hashing (SHA-256) | Deterministic one-way hash of PII | No | Consistent anonymised join key (same email always same hash) |
Apply masking transformations in your Bronze → Silver ETL job. Raw PII stays in Bronze (with restricted access). Silver and above contain only masked values accessible to analysts.
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col, sha2, concat_ws, lit,
regexp_replace, when
)
spark = SparkSession.builder.appName("pii-masking").getOrCreate()
# Bronze layer — contains raw PII (restricted access in Lake Formation)
df_bronze = spark.read.delta("s3://company-lake/delta/bronze/customers/")
df_silver = df_bronze \
.withColumn(
"email_masked",
# Static masking: keep domain, hash local part
regexp_replace(col("email"), r'^[^@]+', "****")
# Result: ****@gmail.com (domain visible, local part masked)
) \
.withColumn(
"phone_masked",
# Keep last 4 digits only
regexp_replace(col("phone"), r'\d(?=\d{4})', "*")
# Result: ***-***-1234
) \
.withColumn(
"customer_hash",
# Deterministic hash for joining without exposing raw email
sha2(concat_ws("|", col("email"), lit("SALT_2024")), 256)
) \
.withColumn("ssn", lit(None).cast("string")) \ # nullify SSN completely
.withColumn("dob", lit(None).cast("date")) \ # nullify DOB
.drop("email", "phone") # drop raw PII columns from silver
df_silver.write \
.format("delta") \
.mode("overwrite") \
.save("s3://company-lake/delta/silver/customers_masked/")
print("✅ PII masked — Silver layer written")
Lake Formation supports column-level security — you can hide an entire column from certain roles without physically removing it from the table. When a user without the column permission queries the table via Athena or Redshift Spectrum, the column simply doesn't appear in the results. This is called a column mask policy and is enforced at query time by Lake Formation.
import boto3
lf = boto3.client("lakeformation")
# Grant SELECT only on specific non-PII columns to the analyst role
# Column: customer_hash, order_count, total_spend — NO email, NO phone, NO ssn
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/DataAnalystRole"
},
Resource={
"TableWithColumns": {
"CatalogId": "123456789012",
"DatabaseName": "silver_layer",
"Name": "customers",
"ColumnNames": [
"customer_id",
"customer_hash",
"country",
"segment",
"created_at"
# email, phone, ssn, dob NOT in this list → hidden from analyst
]
}
},
Permissions=["SELECT"]
)
print("✅ Column-level grant set: analyst sees only safe columns")
LF-Tags (Lake Formation Tags) are key-value pairs you attach to databases, tables, and columns to classify data. Instead of writing individual permission grants per table, you write a tag-based grant once: "DataAnalystRole can SELECT all tables tagged with classification=public". As new tables are crawled, you simply tag them, and they automatically inherit the right permissions — no per-table grant needed.
import boto3
lf = boto3.client("lakeformation")
# Step 1: Create LF-Tag keys and allowed values
lf.create_lf_tag(TagKey="classification", TagValues=["public", "internal", "confidential", "pii"])
lf.create_lf_tag(TagKey="data_layer", TagValues=["bronze", "silver", "gold"])
lf.create_lf_tag(TagKey="domain", TagValues=["sales", "finance", "hr", "ops"])
# Step 2: Tag a table
lf.add_lf_tags_to_resource(
Resource={
"Table": {
"DatabaseName": "gold_layer",
"Name": "sales_summary"
}
},
LFTags=[
{"TagKey": "classification", "TagValues": ["internal"]},
{"TagKey": "data_layer", "TagValues": ["gold"]},
{"TagKey": "domain", "TagValues": ["sales"]},
]
)
print("✅ Table tagged: classification=internal, data_layer=gold, domain=sales")
# Step 3: Grant access by LF-Tag (attribute-based access control)
# "DataAnalystRole can SELECT all tables where classification=internal AND domain=sales"
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier":
"arn:aws:iam::123456789012:role/SalesAnalystRole"
},
Resource={
"LFTagPolicy": {
"ResourceType": "TABLE",
"Expression": [
{"TagKey": "classification", "TagValues": ["internal", "public"]},
{"TagKey": "domain", "TagValues": ["sales"]},
]
}
},
Permissions=["SELECT", "DESCRIBE"]
)
print("✅ Tag-based grant: SalesAnalystRole can read all internal/public sales tables")
You can also tag individual columns with sensitivity labels — for example, tag all email columns with sensitivity=pii. Then grant column permissions based on those tags. This way, adding a new PII column to a table automatically makes it restricted, without updating any permission grants manually.
import boto3
lf = boto3.client("lakeformation")
# Tag the 'email' column as PII
lf.add_lf_tags_to_resource(
Resource={
"TableWithColumns": {
"DatabaseName": "bronze_layer",
"Name": "customers_raw",
"ColumnNames": ["email", "phone", "ssn", "date_of_birth"]
}
},
LFTags=[
{"TagKey": "sensitivity", "TagValues": ["pii"]}
]
)
print("✅ PII columns tagged — access now requires sensitivity=pii permission grant")
In large enterprises, the data lake (Account A) is separate from the analytics/BI account (Account B) for cost separation, security isolation, or organisational boundaries. Data engineers in Account A produce Gold layer tables; analysts in Account B need to query them via Athena. Lake Formation handles this elegantly via cross-account data sharing — no data copying required.
Cross-account sharing uses AWS Resource Access Manager (RAM) to share the Glue Catalog resource, and then Lake Formation to grant table-level permissions to the consumer account. The consumer account then creates a resource link in their own catalog pointing to the shared table.
import boto3
lf = boto3.client("lakeformation")
CONSUMER_ACCOUNT_ID = "123456789099" # Account B
# Step 1: Grant permission to the consumer ACCOUNT (not a role inside it)
# This makes the catalog entry visible to the consumer account
lf.grant_permissions(
Principal={
"DataLakePrincipalIdentifier": CONSUMER_ACCOUNT_ID
},
Resource={
"Table": {
"CatalogId": "111111111111", # producer account (Account A)
"DatabaseName": "gold_layer",
"Name": "sales_summary"
}
},
Permissions=["SELECT", "DESCRIBE"],
PermissionsWithGrantOption=["SELECT", "DESCRIBE"]
# PermissionsWithGrantOption allows Account B to further share with its own roles
)
print("✅ Cross-account share granted to Account B")
import boto3
# In Account B — create a resource link in the local Glue Catalog
# pointing to the shared table in Account A
glue_b = boto3.client("glue")
glue_b.create_table(
DatabaseName="shared_gold", # local database in Account B's catalog
TableInput={
"Name": "sales_summary_link",
"TargetTable": {
"CatalogId": "111111111111", # producer's account ID
"DatabaseName": "gold_layer",
"Name": "sales_summary"
}
}
)
print("✅ Resource link created — Athena in Account B can now query sales_summary")
# Athena query in Account B now works:
# SELECT * FROM shared_gold.sales_summary_link LIMIT 10;
# → reads from Account A's S3 directly, LF enforces permissions
- Lake Formation is the central permission enforcement layer — governs who can access which databases, tables, columns, and rows.
- Glue Catalog is the central metadata store — all AWS analytics services (Athena, Glue, EMR, Redshift Spectrum) share it.
- Data Lineage answers "where did this data come from?" — implement via DynamoDB records per pipeline run, or use OpenLineage + Marquez for an open-standard approach.
- PII Detection must happen at ingestion — use column name heuristics first, then regex scanning. Flag PII before data reaches the catalog.
- Masking strategies: nullify for unused PII, hash for join keys, static mask for display, tokenize for reversible lookup.
- LF-Tags enable attribute-based access control — tag tables and columns once, write grants once, and new tables automatically inherit correct permissions.
- Cross-account sharing via Lake Formation + RAM eliminates data copying — consumer accounts get permission-based access to producer account S3 data.
Terraform for Data Engineers (IaC)
Infrastructure as Code (IaC) means your S3 buckets, Glue jobs, EMR clusters, Lambda functions, and IAM roles are defined in code — version-controlled, repeatable, and promotable across environments. Terraform is the industry-standard IaC tool for AWS data platforms. A senior Data Engineer must be able to write and own Terraform for every service in their pipeline stack.
Without IaC, infrastructure is created by hand through the AWS console. This creates three serious problems: snowflake environments (prod and staging drift apart over time because changes are made manually), no audit trail (you don't know who created what or when), and slow recovery (if you need to rebuild the data platform in a new account or region, you have to re-click everything from memory). IaC solves all three.
| Without IaC (Console Clicking) | With Terraform (IaC) |
|---|---|
| Environments drift apart | Dev/staging/prod are identical by definition |
| No audit trail of changes | Every change is a git commit with author + reason |
| Rebuilding takes days/weeks | terraform apply rebuilds in minutes |
| New team member must learn console | New member reads .tf files to understand the whole stack |
| Hard to review changes before applying | terraform plan shows exact diff before any change |
Every Terraform configuration has four building blocks. A provider tells Terraform which cloud to talk to (AWS, GCP, Azure). A resource is an actual infrastructure object you want to create (an S3 bucket, a Glue job, an IAM role). A variable is an input parameter so the same code works across environments. An output exposes values from created resources for use by other modules or scripts.
# provider.tf — tell Terraform to use AWS
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
# variables.tf — input parameters
variable "aws_region" { default = "us-east-1" }
variable "environment" { default = "dev" } # dev / staging / prod
variable "project_name" { default = "company-datalake" }
# main.tf — create an S3 bucket resource
resource "aws_s3_bucket" "data_lake" {
bucket = "${var.project_name}-${var.environment}"
# Result: "company-datalake-dev" or "company-datalake-prod"
tags = {
Environment = var.environment
Project = var.project_name
ManagedBy = "Terraform"
}
}
# outputs.tf — expose values for other modules
output "data_lake_bucket_name" {
value = aws_s3_bucket.data_lake.bucket
}
output "data_lake_bucket_arn" {
value = aws_s3_bucket.data_lake.arn
}
Terraform tracks everything it has created in a state file (terraform.tfstate). This file maps your .tf resource definitions to the real AWS resources. If you delete the state file, Terraform loses track of what exists. For teams, the state file must be stored remotely (in S3) so everyone shares the same state, and a DynamoDB table is used to prevent two people from running terraform apply simultaneously (state locking).
# backend.tf — store state remotely in S3, lock with DynamoDB
terraform {
backend "s3" {
bucket = "company-terraform-state" # dedicated state bucket
key = "data-platform/terraform.tfstate"
region = "us-east-1"
encrypt = true # SSE-S3 encryption
dynamodb_table = "terraform-state-lock" # prevents concurrent applies
}
}
# Create the lock table (run once manually or in a bootstrap Terraform)
resource "aws_dynamodb_table" "tf_lock" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = { Purpose = "TerraformStateLock" }
}
Terraform workspaces allow a single set of .tf files to manage multiple environments. Each workspace has its own state file in S3, so terraform workspace select prod followed by terraform apply only affects production resources. Combine workspaces with a terraform.tfvars file per environment for environment-specific variable values.
# Create workspaces for each environment
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
# Switch to prod and apply
terraform workspace select prod
terraform plan -var-file=prod.tfvars
terraform apply -var-file=prod.tfvars
# List all workspaces
terraform workspace list
# default
# dev
# * prod ← current
# staging
# Resources automatically get env-specific names
resource "aws_s3_bucket" "data_lake" {
bucket = "company-datalake-${terraform.workspace}"
# dev → company-datalake-dev
# staging → company-datalake-staging
# prod → company-datalake-prod
}
A Terraform module is a reusable block of resources grouped together. Instead of copy-pasting Glue job definitions for every pipeline, you write a glue_job module once and call it with different parameters. Modules are the equivalent of functions in programming — they promote reuse and consistency across the data platform.
# modules/glue_job/main.tf — the reusable module
variable "job_name" {}
variable "script_location" {}
variable "role_arn" {}
variable "glue_version" { default = "4.0" }
variable "worker_type" { default = "G.1X" }
variable "num_workers" { default = 5 }
resource "aws_glue_job" "this" {
name = var.job_name
role_arn = var.role_arn
glue_version = var.glue_version
worker_type = var.worker_type
number_of_workers = var.num_workers
command {
script_location = var.script_location
python_version = "3"
}
default_arguments = {
"--job-language" = "python"
"--enable-glue-datacatalog" = "true"
"--enable-metrics" = "true"
"--enable-continuous-cloudwatch-log" = "true"
}
}
output "job_name" { value = aws_glue_job.this.name }
##################################################################
# main.tf — calling the module for multiple pipelines
module "orders_bronze" {
source = "./modules/glue_job"
job_name = "orders-bronze-etl-${terraform.workspace}"
script_location = "s3://company-scripts/glue/orders_bronze.py"
role_arn = aws_iam_role.glue_role.arn
num_workers = 10
}
module "customers_bronze" {
source = "./modules/glue_job"
job_name = "customers-bronze-etl-${terraform.workspace}"
script_location = "s3://company-scripts/glue/customers_bronze.py"
role_arn = aws_iam_role.glue_role.arn
num_workers = 5
}
The four commands are your daily Terraform workflow. init downloads provider plugins. plan shows what will change without touching anything. apply executes the changes. destroy tears everything down. In production CI/CD, a human reviews the plan output before apply is triggered.
# Step 1: Download providers and initialise backend
terraform init
# Step 2: Preview what will be created/changed/destroyed
terraform plan -var-file=dev.tfvars -out=tfplan
# Output shows: + create ~ update - destroy
# ALWAYS review this before applying in prod
# Step 3: Apply the plan
terraform apply tfplan
# or interactively: terraform apply -var-file=dev.tfvars
# Step 4: Tear down (only for dev/test environments)
terraform destroy -var-file=dev.tfvars
# Import an existing resource into Terraform state
# (for resources created manually before Terraform was adopted)
terraform import aws_s3_bucket.data_lake company-datalake-dev
# Show current state
terraform show
terraform state list
A production data lake S3 bucket needs versioning (for accidental deletion recovery), lifecycle rules (to auto-transition old data to cheaper storage classes), encryption (SSE-KMS), and a bucket policy that denies non-HTTPS access. Here is the complete Terraform definition.
resource "aws_s3_bucket" "data_lake" {
bucket = "company-datalake-${terraform.workspace}"
tags = local.common_tags
}
# Block all public access
resource "aws_s3_bucket_public_access_block" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# Enable versioning
resource "aws_s3_bucket_versioning" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
versioning_configuration { status = "Enabled" }
}
# SSE-KMS encryption
resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.data_lake.arn
}
bucket_key_enabled = true # reduces KMS API costs by 99%
}
}
# Lifecycle: move old data to cheaper storage
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "bronze-lifecycle"
status = "Enabled"
filter { prefix = "bronze/" }
transition {
days = 90
storage_class = "STANDARD_IA" # after 90 days
}
transition {
days = 365
storage_class = "GLACIER" # after 1 year
}
expiration { days = 2555 } # delete after 7 years
}
rule {
id = "abort-incomplete-multipart"
status = "Enabled"
filter {}
abort_incomplete_multipart_upload { days_after_initiation = 7 }
}
}
Every AWS service (Glue, EMR, Lambda) needs an IAM execution role that grants it permission to access S3, CloudWatch, Secrets Manager, etc. Here are the three most common roles a Data Engineer provisions.
# ── Glue ETL Execution Role ──────────────────────────────
resource "aws_iam_role" "glue_role" {
name = "GlueETLRole-${terraform.workspace}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "glue.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "glue_service" {
role = aws_iam_role.glue_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
}
resource "aws_iam_role_policy" "glue_s3" {
name = "GlueS3Access"
role = aws_iam_role.glue_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"]
Resource = [
aws_s3_bucket.data_lake.arn,
"${aws_s3_bucket.data_lake.arn}/*"
]
},
{
Effect = "Allow"
Action = ["secretsmanager:GetSecretValue"]
Resource = "arn:aws:secretsmanager:*:*:secret:company/*"
}
]
})
}
# ── Lambda Execution Role ─────────────────────────────────
resource "aws_iam_role" "lambda_role" {
name = "LambdaDataPipelineRole-${terraform.workspace}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "lambda.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "lambda_basic" {
role = aws_iam_role.lambda_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}
# ── EMR Service + Job Flow Roles ──────────────────────────
resource "aws_iam_role" "emr_service_role" {
name = "EMRServiceRole-${terraform.workspace}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "elasticmapreduce.amazonaws.com" }
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "emr_service" {
role = aws_iam_role.emr_service_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}
Provision the entire Glue stack — catalog database, crawler that discovers S3 data, and ETL jobs — as Terraform resources. This means adding a new pipeline is a git commit + terraform apply, not a console session.
# Glue Catalog Databases
resource "aws_glue_catalog_database" "bronze" {
name = "bronze_layer_${terraform.workspace}"
}
resource "aws_glue_catalog_database" "silver" {
name = "silver_layer_${terraform.workspace}"
}
resource "aws_glue_catalog_database" "gold" {
name = "gold_layer_${terraform.workspace}"
}
# Glue Crawler — discovers schema from S3 Bronze path
resource "aws_glue_crawler" "bronze_orders" {
name = "bronze-orders-crawler-${terraform.workspace}"
role = aws_iam_role.glue_role.arn
database_name = aws_glue_catalog_database.bronze.name
s3_target {
path = "s3://${aws_s3_bucket.data_lake.bucket}/bronze/rds/orders/"
}
schedule = "cron(0 6 * * ? *)" # daily at 06:00 UTC
table_prefix = "rds_"
schema_change_policy {
update_behavior = "UPDATE_IN_DATABASE"
delete_behavior = "LOG"
}
}
# Glue ETL Job
resource "aws_glue_job" "orders_bronze_etl" {
name = "orders-bronze-etl-${terraform.workspace}"
role_arn = aws_iam_role.glue_role.arn
glue_version = "4.0"
worker_type = "G.1X"
number_of_workers = 10
command {
script_location = "s3://${aws_s3_bucket.data_lake.bucket}/scripts/orders_bronze.py"
python_version = "3"
}
default_arguments = {
"--SOURCE_TABLE" = "orders"
"--TARGET_PATH" = "s3://${aws_s3_bucket.data_lake.bucket}/bronze/rds/orders/"
"--ENVIRONMENT" = terraform.workspace
"--enable-metrics" = "true"
"--job-bookmark-option" = "job-bookmark-enable"
}
execution_property { max_concurrent_runs = 1 }
}
Terraform can deploy a Lambda function from a ZIP file, attach its IAM role, and wire up its S3 trigger — all in one terraform apply. This is the standard pattern for deploying pipeline trigger Lambdas.
# Package the Lambda code into a ZIP
data "archive_file" "lambda_zip" {
type = "zip"
source_dir = "${path.module}/lambda_src/"
output_path = "${path.module}/lambda.zip"
}
# Deploy the Lambda function
resource "aws_lambda_function" "pipeline_trigger" {
function_name = "data-pipeline-trigger-${terraform.workspace}"
role = aws_iam_role.lambda_role.arn
handler = "handler.lambda_handler"
runtime = "python3.12"
filename = data.archive_file.lambda_zip.output_path
source_code_hash = data.archive_file.lambda_zip.output_base64sha256
timeout = 300 # 5 minutes
memory_size = 512
environment {
variables = {
ENVIRONMENT = terraform.workspace
GLUE_JOB = aws_glue_job.orders_bronze_etl.name
AUDIT_TABLE = aws_dynamodb_table.pipeline_audit.name
}
}
}
# Allow S3 to invoke the Lambda
resource "aws_lambda_permission" "s3_trigger" {
statement_id = "AllowS3Invoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.pipeline_trigger.function_name
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.data_lake.arn
}
# Wire up S3 event notification → Lambda on new file arrival
resource "aws_s3_bucket_notification" "landing_trigger" {
bucket = aws_s3_bucket.data_lake.id
lambda_function {
lambda_function_arn = aws_lambda_function.pipeline_trigger.arn
events = ["s3:ObjectCreated:*"]
filter_prefix = "landing/"
filter_suffix = ".parquet"
}
depends_on = [aws_lambda_permission.s3_trigger]
}
Every data platform needs a DynamoDB table for pipeline audit logs and watermark tracking. Here is the complete Terraform definition with a GSI on pipeline_name for efficient queries by pipeline.
resource "aws_dynamodb_table" "pipeline_audit" {
name = "pipeline-audit-${terraform.workspace}"
billing_mode = "PAY_PER_REQUEST"
hash_key = "run_id"
attribute {
name = "run_id"
type = "S"
}
attribute {
name = "pipeline_name"
type = "S"
}
attribute {
name = "run_time"
type = "S"
}
global_secondary_index {
name = "pipeline-name-index"
hash_key = "pipeline_name"
range_key = "run_time"
projection_type = "ALL"
}
ttl {
attribute_name = "expires_at"
enabled = true # auto-delete old audit records after 90 days
}
point_in_time_recovery { enabled = true }
tags = local.common_tags
}
# Watermark table — tracks incremental processing state
resource "aws_dynamodb_table" "pipeline_watermarks" {
name = "pipeline-watermarks-${terraform.workspace}"
billing_mode = "PAY_PER_REQUEST"
hash_key = "pipeline_id"
attribute {
name = "pipeline_id"
type = "S"
}
tags = local.common_tags
}
Glue, EMR, and RDS run inside a VPC. VPC Endpoints are critical — they keep traffic between Glue/EMR and S3/Secrets Manager on the AWS private network, not the public internet. Here is the essential VPC configuration for a data platform.
resource "aws_vpc" "data_platform" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = { Name = "data-platform-${terraform.workspace}" }
}
resource "aws_subnet" "private_a" {
vpc_id = aws_vpc.data_platform.id
cidr_block = "10.0.1.0/24"
availability_zone = "us-east-1a"
tags = { Name = "private-a-${terraform.workspace}", Tier = "Private" }
}
resource "aws_subnet" "private_b" {
vpc_id = aws_vpc.data_platform.id
cidr_block = "10.0.2.0/24"
availability_zone = "us-east-1b"
tags = { Name = "private-b-${terraform.workspace}", Tier = "Private" }
}
# S3 Gateway Endpoint — free, keeps S3 traffic private
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.data_platform.id
service_name = "com.amazonaws.us-east-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
tags = { Name = "s3-endpoint" }
}
# Secrets Manager Interface Endpoint — keeps secrets retrieval private
resource "aws_vpc_endpoint" "secretsmanager" {
vpc_id = aws_vpc.data_platform.id
service_name = "com.amazonaws.us-east-1.secretsmanager"
vpc_endpoint_type = "Interface"
subnet_ids = [aws_subnet.private_a.id, aws_subnet.private_b.id]
security_group_ids = [aws_security_group.endpoints.id]
private_dns_enabled = true
tags = { Name = "secretsmanager-endpoint" }
}
# Glue security group — allow Glue jobs to reach RDS, MSK, S3 endpoint
resource "aws_security_group" "glue" {
name = "glue-sg-${terraform.workspace}"
vpc_id = aws_vpc.data_platform.id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
# Glue requires self-referencing rule for Spark driver↔executor communication
ingress {
from_port = 0
to_port = 65535
protocol = "tcp"
self = true
}
}
Provision monitoring resources in Terraform so every environment gets the same alarms automatically — no manual alarm creation in console. Here is a Glue job failure alarm wired to SNS.
# SNS topic for pipeline alerts
resource "aws_sns_topic" "pipeline_alerts" {
name = "pipeline-alerts-${terraform.workspace}"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.pipeline_alerts.arn
protocol = "email"
endpoint = "data-team@company.com"
}
# CloudWatch alarm: Glue job failure count > 0
resource "aws_cloudwatch_metric_alarm" "glue_job_failure" {
alarm_name = "glue-job-failure-${terraform.workspace}"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
metric_name = "glue.driver.aggregate.numFailedTasks"
namespace = "Glue"
period = 300
statistic = "Sum"
threshold = 0
alarm_description = "Glue job has failed tasks"
alarm_actions = [aws_sns_topic.pipeline_alerts.arn]
ok_actions = [aws_sns_topic.pipeline_alerts.arn]
treat_missing_data = "notBreaching"
}
In production, nobody runs terraform apply locally against the prod account. Instead, all Terraform changes go through a pull request workflow: PR opened → GitHub Actions runs terraform plan → human reviews the plan → PR merged → GitHub Actions runs terraform apply. This prevents unreviewed changes from hitting production infrastructure.
name: Terraform Data Platform
on:
pull_request:
paths: ["terraform/**"]
push:
branches: [main]
paths: ["terraform/**"]
env:
TF_VERSION: "1.7.0"
AWS_REGION: "us-east-1"
TF_WORKSPACE: "prod" # or use branch-based logic
jobs:
terraform:
runs-on: ubuntu-latest
permissions:
id-token: write # for OIDC auth to AWS — no static keys needed
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC — no stored keys)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/GitHubActionsRole
aws-region: ${{ env.AWS_REGION }}
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
working-directory: terraform/
run: terraform init
- name: Terraform Format Check
working-directory: terraform/
run: terraform fmt -check -recursive
- name: Terraform Validate
working-directory: terraform/
run: terraform validate
- name: Terraform Plan (on PR)
if: github.event_name == 'pull_request'
working-directory: terraform/
run: |
terraform workspace select ${{ env.TF_WORKSPACE }}
terraform plan -var-file=prod.tfvars -out=tfplan -no-color 2>&1 | tee plan.txt
echo "## Terraform Plan" >> $GITHUB_STEP_SUMMARY
cat plan.txt >> $GITHUB_STEP_SUMMARY
- name: Terraform Apply (on merge to main)
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
working-directory: terraform/
run: |
terraform workspace select ${{ env.TF_WORKSPACE }}
terraform apply -var-file=prod.tfvars -auto-approve
id-token: write permission + configure-aws-credentials with role-to-assume) instead of storing AWS access keys as GitHub secrets. OIDC issues short-lived credentials automatically for each workflow run — far more secure.
If your team created resources manually before adopting Terraform, you can bring them under Terraform management using terraform import. After importing, Terraform knows the resource exists and will include it in future plans and applies without recreating it.
# Import an existing S3 bucket
terraform import aws_s3_bucket.data_lake company-datalake-prod
# Import an existing Glue job
terraform import aws_glue_job.orders_bronze_etl orders-bronze-etl-prod
# Import an existing DynamoDB table
terraform import aws_dynamodb_table.pipeline_audit pipeline-audit-prod
# Import an existing Lambda function
terraform import aws_lambda_function.pipeline_trigger data-pipeline-trigger-prod
# Import an existing IAM role
terraform import aws_iam_role.glue_role GlueETLRole-prod
# After import, run plan to confirm zero drift
terraform plan
# Goal: "No changes. Your infrastructure matches the configuration."
- IaC with Terraform makes infrastructure repeatable, version-controlled, and reviewable — essential for maintaining identical dev/staging/prod environments.
- Remote state in S3 + DynamoDB lock table is mandatory for team use — never store state locally in a shared project.
- Workspaces let one set of
.tffiles manage multiple environments with workspace-aware resource naming. - Modules are reusable infrastructure components — write a
glue_jobmodule once, call it for every pipeline. - Every DE must be able to Terraform: S3 buckets (versioning, lifecycle, encryption), IAM roles (Glue, EMR, Lambda), Glue resources (database, crawler, job), Lambda (with S3 trigger), DynamoDB (audit + watermark tables), VPC + endpoints, CloudWatch alarms.
- CI/CD pattern: PR →
terraform planas PR comment → human review → merge →terraform apply. Use OIDC for AWS auth — no stored keys. - terraform import brings manually-created resources under Terraform management without recreating them.
Data Quality Engineering
Data Quality (DQ) is not an afterthought — it is a first-class engineering concern built into every layer of the pipeline. Bad data that silently flows into Gold tables destroys trust in the entire data platform. A senior Data Engineer builds DQ checks at ingestion, validates between layers, quarantines bad records, and publishes DQ metrics to dashboards so the team can catch problems before analysts do.
DQ checks must run at every layer transition — not just at the end. Catching a problem at the Bronze → Silver boundary is far cheaper than discovering it after Gold tables have been built on top of bad Silver data. The rule is: validate early, validate often, never let bad data silently pass through.
Great Expectations (GX) is the most widely adopted open-source Python DQ framework. You write Expectations — assertions about what your data should look like (column X should never be null, column Y should be between 0 and 100, etc.). GX evaluates those expectations against your actual data and produces a validation result with a pass/fail per expectation and a summary score. It integrates with Pandas, Spark, and SQL databases.
expect_column_values_to_not_be_null("order_id") is the data equivalent of assert order_id is not None.
An Expectation Suite is a named collection of expectations for a specific dataset. You define one suite per table or dataset. Suites are stored as JSON files and versioned alongside your pipeline code.
import great_expectations as gx
from great_expectations.core import ExpectationSuite
context = gx.get_context() # loads GX config from gx/ directory
# Create an expectation suite for the orders Bronze table
suite = context.add_expectation_suite("orders_bronze_suite")
# ── Schema expectations ──────────────────────────────────
suite.add_expectation(gx.expectations.ExpectTableColumnsToMatchOrderedList(
column_list=["order_id", "customer_id", "amount", "status", "created_at"]
))
# ── Not-null expectations ────────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="customer_id"))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column="amount"))
# ── Uniqueness expectation ───────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeUnique(column="order_id"))
# ── Value range expectation ──────────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeBetween(
column="amount", min_value=0.01, max_value=1_000_000
))
# ── Allowed values expectation ───────────────────────────
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeInSet(
column="status",
value_set=["PENDING", "CONFIRMED", "SHIPPED", "DELIVERED", "CANCELLED"]
))
# ── Row count expectation ────────────────────────────────
suite.add_expectation(gx.expectations.ExpectTableRowCountToBeBetween(
min_value=1000, max_value=10_000_000
))
context.save_expectation_suite(suite)
print("✅ Expectation suite saved: orders_bronze_suite")
A Validator connects an Expectation Suite to an actual dataset (Pandas DataFrame, Spark DataFrame, or SQL table) and runs the validations. The result tells you which expectations passed and which failed, with detailed statistics per column.
import great_expectations as gx
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("gx-validation").getOrCreate()
context = gx.get_context()
# Load the data to validate
df = spark.read.parquet("s3://company-lake/bronze/rds/orders/")
# Convert Spark DF to a GX Spark Datasource batch
datasource = context.sources.add_or_update_spark("spark_datasource")
data_asset = datasource.add_dataframe_asset("orders_asset")
batch_request = data_asset.build_batch_request(dataframe=df)
# Run validation
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="orders_bronze_suite"
)
results = validator.validate()
# Check overall pass/fail
if results.success:
print("✅ All DQ checks passed — proceeding to Silver transform")
else:
failed = [r for r in results.results if not r.success]
print(f"❌ {len(failed)} DQ checks FAILED — stopping pipeline")
for r in failed:
print(f" FAILED: {r.expectation_config.expectation_type}"
f" | {r.result}")
raise ValueError("DQ validation failed — pipeline halted")
A Checkpoint bundles a batch request + expectation suite + actions into a single reusable object. Actions define what happens after validation — save data docs to S3, send a Slack alert on failure, update a DQ score in DynamoDB. Checkpoints are the right way to run GX in production pipelines and Airflow DAGs.
import great_expectations as gx
context = gx.get_context()
# Create a checkpoint with post-validation actions
checkpoint = context.add_or_update_checkpoint(
name="orders_bronze_checkpoint",
validations=[{
"batch_request": batch_request,
"expectation_suite_name": "orders_bronze_suite"
}],
action_list=[
# Save HTML data docs to S3 after every run
{
"name": "store_validation_result",
"action": {"class_name": "StoreValidationResultAction"}
},
{
"name": "update_data_docs",
"action": {"class_name": "UpdateDataDocsAction"}
},
# Slack notification on failure
{
"name": "slack_on_failure",
"action": {
"class_name": "SlackNotificationAction",
"slack_webhook": "https://hooks.slack.com/services/XXX/YYY/ZZZ",
"notify_on": "failure"
}
}
]
)
# Run the checkpoint — returns CheckpointResult
result = context.run_checkpoint("orders_bronze_checkpoint")
if not result.success:
raise ValueError("DQ Checkpoint failed — pipeline halted")
Data Docs are auto-generated HTML reports that show every expectation, its result (pass/fail), and statistics. They are published to S3 after each checkpoint run and hosted as a static website, giving the data team a live dashboard of DQ health across all datasets without any extra tooling.
# great_expectations/great_expectations.yml
data_docs_sites:
s3_site:
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: company-dq-reports
prefix: data-docs/
site_index_builder:
class_name: DefaultSiteIndexBuilder
# After UpdateDataDocsAction runs, open:
# https://company-dq-reports.s3.amazonaws.com/data-docs/index.html
# → shows all suites, all checkpoint runs, pass/fail history
In production, GX checkpoints run as a PythonOperator task in the Airflow DAG — positioned between the extraction task (Bronze write) and the transformation task (Silver write). If the checkpoint fails, Airflow marks the task as failed and stops the downstream Silver/Gold tasks automatically.
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
import great_expectations as gx
@dag(schedule_interval="@daily", start_date=days_ago(1))
def orders_pipeline():
@task
def extract_to_bronze():
# Glue job or Spark job writes raw data to Bronze S3
import boto3
glue = boto3.client("glue")
run = glue.start_job_run(JobName="orders-bronze-etl")
# poll until complete (omitted for brevity)
return run["JobRunId"]
@task
def validate_bronze():
"""Run GX checkpoint — fails task if DQ checks don't pass."""
context = gx.get_context()
result = context.run_checkpoint("orders_bronze_checkpoint")
if not result.success:
raise ValueError("Bronze DQ validation failed — Silver transform blocked")
print("✅ Bronze DQ passed")
@task
def transform_to_silver():
# Only runs if validate_bronze task succeeded
glue = boto3.client("glue")
glue.start_job_run(JobName="orders-silver-transform")
# DAG structure: Bronze → DQ Gate → Silver
bronze = extract_to_bronze()
dq = validate_bronze()
silver = transform_to_silver()
bronze >> dq >> silver
orders_pipeline()
AWS Glue Data Quality is a native AWS service that lets you define DQ rules using a declarative language called DQDL (Data Quality Definition Language) and attach them directly to Glue ETL jobs or run them standalone against Glue Catalog tables. It produces a DQ score (0–100) and a per-rule pass/fail result, all stored in the Glue Catalog without any external framework to install.
DQDL rules cover four categories: Completeness (not-null rate), Uniqueness (no duplicates), Freshness (data is recent enough), and Accuracy (values in expected range or set). You write rules in a simple declarative syntax.
import boto3
glue = boto3.client("glue")
# DQDL ruleset — define rules for the orders Bronze table
RULESET = """
Rules = [
# Completeness — critical columns must be 100% populated
IsComplete "order_id",
IsComplete "customer_id",
IsComplete "amount",
# Uniqueness — order_id must be a primary key (no duplicates)
IsUnique "order_id",
# Accuracy — amount must be positive and realistic
ColumnValues "amount" between 0.01 and 1000000,
# Accuracy — status must be one of allowed values
ColumnValues "status" in [ "PENDING", "CONFIRMED", "SHIPPED", "DELIVERED", "CANCELLED" ],
# Completeness — allow up to 5% null in non-critical columns
Completeness "description" >= 0.95,
# Freshness — at least one record updated in last 48 hours
ColumnStatistic "MAX" "created_at" > (now() - 172800),
# Volume — row count sanity check
RowCount >= 1000
]
"""
# Create the ruleset attached to the Glue Catalog table
glue.create_data_quality_ruleset(
Name="orders-bronze-dq-ruleset",
Description="DQ rules for orders Bronze table",
Ruleset=RULESET,
TargetTable={
"TableName": "rds_orders",
"DatabaseName": "bronze_layer"
}
)
print("✅ Glue DQ ruleset created")
You can run a DQ evaluation standalone or attach it to a Glue ETL job so it runs automatically after the job writes data. The key production pattern is: run the DQ evaluation → poll until complete → check the score → stop the pipeline if score is below threshold.
import boto3, time
glue = boto3.client("glue")
# Start the DQ evaluation run
resp = glue.start_data_quality_ruleset_evaluation_run(
DataSource={
"GlueTable": {
"DatabaseName": "bronze_layer",
"TableName": "rds_orders"
}
},
Role="arn:aws:iam::123456789012:role/GlueETLRole",
RulesetNames=["orders-bronze-dq-ruleset"]
)
run_id = resp["RunId"]
print(f"🔍 DQ evaluation started: {run_id}")
# Poll until the evaluation completes
while True:
status_resp = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
status = status_resp["Status"]
if status in ("SUCCEEDED", "FAILED", "ERROR"):
break
print(f" Status: {status} — waiting...")
time.sleep(15)
if status != "SUCCEEDED":
raise RuntimeError(f"DQ evaluation run failed with status: {status}")
# Inspect per-rule results
result_ids = status_resp.get("ResultIds", [])
for result_id in result_ids:
result = glue.get_data_quality_result(ResultId=result_id)
score = result["Score"] # overall 0.0 – 1.0
rules = result["RuleResults"] # per-rule pass/fail
print(f"DQ Score: {score:.2%}")
failed_rules = [r for r in rules if r["Result"] == "FAIL"]
if failed_rules:
for r in failed_rules:
print(f" ❌ FAILED: {r['Name']} — {r.get('EvaluationMessage','')}")
# Fail pipeline if score below 95%
DQ_THRESHOLD = 0.95
if score < DQ_THRESHOLD:
raise ValueError(
f"DQ score {score:.2%} below threshold {DQ_THRESHOLD:.2%} — pipeline halted"
)
print("✅ DQ passed — proceeding to Silver transform")
Every DQ evaluation run should write its score and per-rule results to the pipeline audit DynamoDB table. This gives you a historical record of DQ health per dataset per day — essential for trend analysis and SLA reporting.
import boto3, uuid
from datetime import datetime, timezone
from decimal import Decimal
dynamo = boto3.resource("dynamodb")
table = dynamo.Table("pipeline-audit-prod")
def write_dq_audit(pipeline_name: str, table_name: str,
score: float, passed: bool,
failed_rules: list, run_id: str = None):
table.put_item(Item={
"run_id": run_id or str(uuid.uuid4()),
"pipeline_name": pipeline_name,
"table_name": table_name,
"dq_score": Decimal(str(round(score, 4))),
"dq_passed": passed,
"failed_rules": failed_rules, # list of failed rule names
"run_time": datetime.now(timezone.utc).isoformat(),
"record_type": "DQ_RESULT"
})
# Call after every DQ evaluation
write_dq_audit(
pipeline_name = "orders-bronze-etl",
table_name = "bronze_layer.rds_orders",
score = score,
passed = score >= 0.95,
failed_rules = [r["Name"] for r in failed_rules]
)
print("✅ DQ result written to audit table")
Before writing any data to Bronze, validate that the incoming schema matches the expected schema. New or missing columns in the source are common causes of silent pipeline corruption. Fail fast at schema validation so the problem is caught immediately.
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
# Define the expected schema for the orders table
EXPECTED_SCHEMA = StructType([
StructField("order_id", StringType(), nullable=False),
StructField("customer_id", StringType(), nullable=False),
StructField("amount", DoubleType(), nullable=False),
StructField("status", StringType(), nullable=False),
StructField("created_at", TimestampType(), nullable=True),
])
def validate_schema(df, expected_schema: StructType):
"""Fail fast if incoming schema doesn't match expected schema."""
actual_cols = {f.name: f.dataType for f in df.schema.fields}
expected_cols = {f.name: f.dataType for f in expected_schema.fields}
missing = set(expected_cols) - set(actual_cols)
extra = set(actual_cols) - set(expected_cols)
type_err = {
col: (expected_cols[col], actual_cols[col])
for col in expected_cols
if col in actual_cols and expected_cols[col] != actual_cols[col]
}
errors = []
if missing: errors.append(f"Missing columns: {missing}")
if extra: errors.append(f"Unexpected columns: {extra}")
if type_err: errors.append(f"Type mismatches: {type_err}")
if errors:
raise ValueError("Schema validation FAILED:\n" + "\n".join(errors))
print("✅ Schema validation passed")
# Call before writing to Bronze
df_source = spark.read.jdbc(url=JDBC_URL, table="orders", properties=JDBC_PROPS)
validate_schema(df_source, EXPECTED_SCHEMA)
df_source.write.parquet("s3://company-lake/bronze/rds/orders/")
These are the three most common DQ checks a Data Engineer implements directly in PySpark — before or after each layer write.
from pyspark.sql.functions import col, count, when, isnan
def check_null_rates(df, columns: list, max_null_pct: float = 0.05):
"""Fail if any column has more than max_null_pct nulls."""
total = df.count()
for col_name in columns:
null_count = df.filter(col(col_name).isNull()).count()
null_pct = null_count / total
if null_pct > max_null_pct:
raise ValueError(
f"NULL check FAILED: '{col_name}' has {null_pct:.1%} nulls "
f"(max allowed: {max_null_pct:.1%})"
)
print(f" ✅ {col_name}: {null_pct:.2%} nulls (OK)")
def check_value_range(df, col_name: str, min_val: float, max_val: float):
"""Fail if any value is outside [min_val, max_val]."""
out_of_range = df.filter(
(col(col_name) < min_val) | (col(col_name) > max_val)
).count()
if out_of_range > 0:
raise ValueError(
f"Range check FAILED: '{col_name}' has {out_of_range} "
f"values outside [{min_val}, {max_val}]"
)
print(f" ✅ {col_name}: all values in range [{min_val}, {max_val}]")
def check_referential_integrity(df_fact, df_dim, fact_key: str, dim_key: str):
"""Fail if fact rows have keys that don't exist in the dimension table."""
orphans = df_fact.join(df_dim, df_fact[fact_key] == df_dim[dim_key], "left_anti")
count = orphans.count()
if count > 0:
raise ValueError(
f"Referential integrity FAILED: {count} rows in '{fact_key}' "
f"have no matching '{dim_key}' in dimension"
)
print(f" ✅ Referential integrity OK: all {fact_key} keys found in dimension")
# Usage
df_orders = spark.read.delta("s3://company-lake/delta/bronze/orders/")
df_customers = spark.read.delta("s3://company-lake/delta/bronze/customers/")
check_null_rates(df_orders, ["order_id", "customer_id", "amount"], max_null_pct=0.0)
check_value_range(df_orders, "amount", min_val=0.01, max_val=1_000_000)
check_referential_integrity(df_orders, df_customers, "customer_id", "customer_id")
Instead of failing the entire pipeline on the first bad record, the quarantine pattern separates good and bad records. Good records proceed to Bronze/Silver as normal. Bad records are written to a quarantine/ S3 path with an added rejection_reason column for debugging. The pipeline succeeds, but a CloudWatch alarm fires if quarantine count exceeds a threshold.
from pyspark.sql.functions import col, lit, when, current_timestamp
df_raw = spark.read.parquet("s3://company-lake/landing/orders/")
# Tag each row with a rejection reason (null = valid record)
df_tagged = df_raw.withColumn(
"rejection_reason",
when(col("order_id").isNull(), lit("NULL_ORDER_ID"))
.when(col("amount") <= 0, lit("AMOUNT_NOT_POSITIVE"))
.when(col("amount") > 1_000_000, lit("AMOUNT_EXCEEDS_MAX"))
.when(~col("status").isin(
"PENDING", "CONFIRMED", "SHIPPED",
"DELIVERED", "CANCELLED"), lit("INVALID_STATUS"))
.otherwise(lit(None))
).withColumn("quarantine_timestamp", current_timestamp())
# Split into valid and quarantine records
df_valid = df_tagged.filter(col("rejection_reason").isNull()) \
.drop("rejection_reason", "quarantine_timestamp")
df_quarantine = df_tagged.filter(col("rejection_reason").isNotNull())
valid_count = df_valid.count()
reject_count = df_quarantine.count()
print(f"Valid: {valid_count} | Quarantined: {reject_count}")
# Write valid records to Bronze
df_valid.write.format("delta").mode("append") \
.save("s3://company-lake/delta/bronze/orders/")
# Write bad records to quarantine path
if reject_count > 0:
df_quarantine.write.format("delta").mode("append") \
.partitionBy("rejection_reason") \
.save("s3://company-lake/quarantine/orders/")
print(f"⚠️ {reject_count} records quarantined")
# Fail pipeline if quarantine exceeds 1% of total
total = valid_count + reject_count
if reject_count / total > 0.01:
raise ValueError(
f"Quarantine rate {reject_count/total:.2%} exceeds 1% threshold"
)
Row count reconciliation is one of the simplest and most effective DQ checks: after writing data, compare the row count of the source against the row count written to the target. Any mismatch indicates dropped records somewhere in the pipeline.
import boto3
from decimal import Decimal
def reconcile_row_counts(
spark, source_df, target_path: str,
pipeline_name: str, tolerance_pct: float = 0.0
):
"""Compare source row count to target row count after write."""
source_count = source_df.count()
target_count = spark.read.format("delta").load(target_path).count()
diff = abs(source_count - target_count)
diff_pct = diff / source_count if source_count > 0 else 0
print(f"Reconciliation: source={source_count:,} target={target_count:,} diff={diff:,} ({diff_pct:.3%})")
if diff_pct > tolerance_pct:
raise ValueError(
f"Row count mismatch in {pipeline_name}: "
f"source={source_count} target={target_count} diff={diff_pct:.3%}"
)
print(f" ✅ Row counts reconciled for {pipeline_name}")
return {"source": source_count, "target": target_count, "diff_pct": diff_pct}
# Usage after Bronze write
df_source = spark.read.jdbc(url=JDBC_URL, table="orders", properties=JDBC_PROPS)
df_source.write.format("delta").mode("append") \
.save("s3://company-lake/delta/bronze/orders/")
reconcile_row_counts(
spark,
source_df = df_source,
target_path = "s3://company-lake/delta/bronze/orders/",
pipeline_name = "orders-rds-to-bronze",
tolerance_pct = 0.0 # zero tolerance for Bronze — every row must land
)
After writing to Bronze or Silver, verify that no duplicate primary keys were introduced. This catches bugs in MERGE logic, append-mode re-runs, and source systems that send the same records twice.
from pyspark.sql.functions import count, col
def check_no_duplicates(df, primary_key_cols: list):
"""Verify that the given columns form a unique key."""
total = df.count()
distinct = df.select(primary_key_cols).distinct().count()
if total != distinct:
duplicates = total - distinct
raise ValueError(
f"Deduplication check FAILED: {duplicates:,} duplicate keys detected "
f"on columns {primary_key_cols} (total={total:,} distinct={distinct:,})"
)
print(f" ✅ No duplicates on {primary_key_cols} ({total:,} rows, all unique)")
# Check Silver orders after MERGE
df_silver = spark.read.delta("s3://company-lake/delta/silver/orders_clean/")
check_no_duplicates(df_silver, ["order_id"])
Every DQ check result should be published as a custom CloudWatch metric so you can build dashboards and alarms on DQ health — alerting the team when DQ score drops below threshold or quarantine rate spikes.
import boto3
cw = boto3.client("cloudwatch")
def publish_dq_metrics(pipeline_name: str, table_name: str,
dq_score: float, quarantine_count: int,
source_rows: int):
cw.put_metric_data(
Namespace="DataPlatform/DQ",
MetricData=[
{
"MetricName": "DQScore",
"Dimensions": [
{"Name": "PipelineName", "Value": pipeline_name},
{"Name": "TableName", "Value": table_name}
],
"Value": dq_score * 100, # publish as 0–100 for readability
"Unit": "Percent"
},
{
"MetricName": "QuarantineCount",
"Dimensions": [
{"Name": "PipelineName", "Value": pipeline_name},
{"Name": "TableName", "Value": table_name}
],
"Value": quarantine_count,
"Unit": "Count"
},
{
"MetricName": "SourceRowCount",
"Dimensions": [
{"Name": "PipelineName", "Value": pipeline_name}
],
"Value": source_rows,
"Unit": "Count"
}
]
)
print("✅ DQ metrics published to CloudWatch")
# CloudWatch Alarm: alert when DQ score drops below 95%
cw.put_metric_alarm(
AlarmName = "dq-score-below-threshold-orders-bronze",
MetricName = "DQScore",
Namespace = "DataPlatform/DQ",
Dimensions = [
{"Name": "PipelineName", "Value": "orders-bronze-etl"},
{"Name": "TableName", "Value": "bronze_layer.rds_orders"}
],
ComparisonOperator = "LessThanThreshold",
Threshold = 95.0,
Period = 86400, # 24 hours
EvaluationPeriods = 1,
Statistic = "Minimum",
AlarmActions = ["arn:aws:sns:us-east-1:123456789012:pipeline-alerts"],
TreatMissingData = "breaching" # alarm if pipeline didn't run either
)
- DQ gates belong at every layer transition — Bronze ingest, Bronze→Silver, Silver→Gold. Never let bad data silently flow through.
- Great Expectations: write Expectation Suites per dataset, validate against Spark DataFrames with Validators, run via Checkpoints in Airflow DAGs, publish Data Docs to S3 for visibility.
- Glue Data Quality: AWS-native DQDL rulesets attached to Glue Catalog tables — Completeness, Uniqueness, Freshness, Accuracy. Poll
get_data_quality_ruleset_evaluation_run()and fail pipeline if score below threshold. - Schema validation: check expected columns and types before writing to Bronze — fail fast on schema drift.
- Null, range, referential integrity: implement directly in PySpark as DQ gate functions between pipeline stages.
- Quarantine pattern: split valid vs bad records; bad records go to
quarantine/S3 path withrejection_reasoncolumn; fail pipeline only if quarantine rate exceeds threshold. - Row count reconciliation: source count must equal target count — zero tolerance at Bronze boundary.
- Deduplication check: verify primary key uniqueness after every MERGE or append write.
- CloudWatch DQ metrics: publish
DQScoreandQuarantineCountper pipeline run; set alarms to alert when DQ degrades.
Streaming Pipelines
Production streaming pipelines on AWS combine Kafka / MSK as the event backbone, Spark Structured Streaming for stateful processing, and S3 / Delta Lake / Iceberg as the durable sink. This section covers every concept you need — from trigger modes and watermarking to exactly-once guarantees and foreachBatch custom sink patterns.
Spark Structured Streaming treats a Kafka topic as an unbounded table. Each new message is a new row appended to that table. You use spark.readStream with the Kafka source, specifying your MSK bootstrap servers, topic name, and starting offset position.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, schema_of_json
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType
spark = SparkSession.builder \
.appName("MSK-Streaming-Pipeline") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# ── Define schema of your JSON Kafka message payload ─────────────
order_schema = StructType([
StructField("order_id", StringType(), True),
StructField("customer_id", StringType(), True),
StructField("amount", DoubleType(), True),
StructField("event_time", StringType(), True), # ISO-8601 string
StructField("status", StringType(), True)
])
# ── Connect to MSK and read the stream ───────────────────────────
raw_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092,b-2.mycluster.kafka.us-east-1.amazonaws.com:9092") \
.option("subscribe", "orders-topic") \ # single topic
.option("startingOffsets", "latest") \ # "earliest" to replay all
.option("maxOffsetsPerTrigger", 100000) \ # rate-limit per micro-batch
.option("failOnDataLoss", "false") \ # don't fail if offsets gap
.load()
# ── Kafka gives you: key, value, topic, partition, offset, timestamp
# value is always binary — cast to string first, then parse JSON ──
parsed_stream = raw_stream \
.select(
col("value").cast("string").alias("json_str"),
col("partition"),
col("offset"),
col("timestamp").alias("kafka_ts")
) \
.withColumn("data", from_json(col("json_str"), order_schema)) \
.select("data.*", "kafka_ts")
# parsed_stream now has: order_id, customer_id, amount, event_time, status, kafka_ts
parsed_stream.printSchema()
.cast("string") before from_json(). MSK with IAM auth requires additional Kafka SASL options — see the MSK section (29.11) for auth config.
Writing to S3 as Parquet is the most common pattern on AWS. You partition by year/month/day derived from your event timestamp so downstream Athena and Glue queries can do efficient partition pruning.
from pyspark.sql.functions import to_timestamp, year, month, dayofmonth
# ── Add partition columns from event_time ────────────────────────
enriched = parsed_stream \
.withColumn("event_ts", to_timestamp(col("event_time"))) \
.withColumn("year", year(col("event_ts"))) \
.withColumn("month", month(col("event_ts"))) \
.withColumn("day", dayofmonth(col("event_ts")))
# ── Write to S3 with Parquet + partitioning ──────────────────────
query = enriched.writeStream \
.format("parquet") \
.option("path", "s3://my-data-lake/bronze/orders/") \
.option("checkpointLocation", "s3://my-checkpoints/orders/") \
.partitionBy("year", "month", "day") \
.outputMode("append") \
.trigger(processingTime="5 minutes") \
.start()
query.awaitTermination() # blocks — keep the driver alive
Delta Lake is the preferred sink over raw Parquet for streaming because it provides ACID transactions, handles partial failures atomically, and enables downstream MERGE for CDC. Delta's transaction log acts as a built-in exactly-once guarantee for Spark Structured Streaming.
# ── Delta Lake dependency must be on the cluster ─────────────────
# EMR: --packages io.delta:delta-core_2.12:2.4.0
# Databricks: built-in
query = enriched.writeStream \
.format("delta") \
.option("path", "s3://my-data-lake/bronze/orders_delta/") \
.option("checkpointLocation", "s3://my-checkpoints/orders_delta/") \
.outputMode("append") \
.trigger(processingTime="5 minutes") \
.start()
# ── Time travel — you can query any version after the fact ────────
# spark.read.format("delta").option("versionAsOf", 5).load("s3://...")
# spark.read.format("delta").option("timestampAsOf", "2024-01-15").load("s3://...")
Iceberg is the other major open table format. It's the preferred choice on AWS when using Athena, Glue, or EMR — it has native Athena support and the Glue catalog treats Iceberg tables as first-class objects without needing manifest files (unlike Delta).
# ── Spark config for Iceberg + Glue Catalog (set at SparkSession) ─
spark = SparkSession.builder \
.appName("Iceberg-Streaming") \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.glue_catalog",
"org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.warehouse",
"s3://my-data-lake/") \
.getOrCreate()
# ── The target Iceberg table must exist first ─────────────────────
# CREATE TABLE glue_catalog.bronze.orders (
# order_id STRING, customer_id STRING, amount DOUBLE,
# event_time TIMESTAMP, status STRING)
# USING iceberg PARTITIONED BY (days(event_time));
query = enriched.writeStream \
.format("iceberg") \
.outputMode("append") \
.option("checkpointLocation", "s3://my-checkpoints/orders_iceberg/") \
.toTable("glue_catalog.bronze.orders") # write directly to catalog table
query.awaitTermination()
In distributed streaming systems, events don't always arrive in the order they were generated. A mobile app might buffer events and send them in bulk minutes later. A network hiccup might delay messages. Without watermarking, Spark would have to hold all state forever waiting for late events — this causes unbounded memory growth and kills your streaming job.
Watermarking tells Spark: "I'm willing to wait up to X time units for late data. After that, drop it and release the state." It's a tradeoff between data completeness (waiting longer) and state size (dropping sooner).
withWatermark(eventTimeCol, delayThreshold) tells Spark two things: (1) use eventTimeCol as the clock for measuring event time, and (2) track the maximum event time seen so far and subtract delayThreshold to get the current watermark. State older than the watermark gets evicted.
from pyspark.sql.functions import window, sum, count, to_timestamp, col
# ── Cast event_time string → proper timestamp ─────────────────────
with_ts = parsed_stream \
.withColumn("event_ts", to_timestamp(col("event_time")))
# ── Apply watermark: tolerate up to 10 minutes of late data ───────
watermarked = with_ts.withWatermark("event_ts", "10 minutes")
# ── Tumbling window aggregation: revenue per 5-min window ─────────
windowed_agg = watermarked \
.groupBy(
window(col("event_ts"), "5 minutes"), # 5-minute tumbling window
col("status")
) \
.agg(
sum("amount").alias("total_revenue"),
count("*").alias("order_count")
)
# ── window column gives: window.start, window.end ─────────────────
result = windowed_agg.select(
col("window.start").alias("window_start"),
col("window.end").alias("window_end"),
col("status"),
col("total_revenue"),
col("order_count")
)
# ── With watermark + window, Append mode is allowed ───────────────
query = result.writeStream \
.format("delta") \
.option("path", "s3://my-data-lake/silver/order_windows/") \
.option("checkpointLocation", "s3://my-checkpoints/order_windows/") \
.outputMode("append") \ # Append: only write completed windows
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
withWatermark + window aggregation + outputMode("append"): Spark only writes a window's result AFTER the watermark has passed beyond that window's end time. This means results are delayed by up to your threshold — but state is bounded. Without watermark, you must use outputMode("complete") which rewrites ALL results every batch.
A tumbling window divides time into fixed non-overlapping buckets (e.g. every 5 minutes). A sliding window overlaps — a 10-minute window sliding every 2 minutes means each event falls into 5 different windows simultaneously.
from pyspark.sql.functions import window
# ── Tumbling: 5-min non-overlapping buckets ───────────────────────
tumbling = watermarked.groupBy(
window(col("event_ts"), "5 minutes") # windowDuration only
).agg(sum("amount").alias("revenue"))
# ── Sliding: 10-min window, slides every 2 minutes ───────────────
sliding = watermarked.groupBy(
window(col("event_ts"), "10 minutes", "2 minutes") # windowDuration, slideDuration
).agg(sum("amount").alias("revenue"))
# ── Session: gap-based (events within 30 min of each other) ──────
# Available in Spark 3.2+ via session_window()
from pyspark.sql.functions import session_window
session = watermarked.groupBy(
session_window(col("event_ts"), "30 minutes"),
col("customer_id")
).agg(sum("amount").alias("session_revenue"))
The most common trigger. Spark waits for the specified interval, collects all available data since the last batch, and processes it. If processing takes longer than the interval, the next batch starts immediately (no skipping).
from pyspark.sql.streaming import Trigger
# ── 1. ProcessingTime — run every 5 minutes ───────────────────────
query = df.writeStream \
.trigger(processingTime="5 minutes") \ # or Trigger.ProcessingTime("5 minutes")
.format("delta") \
.option("checkpointLocation", "s3://...") \
.start()
# ── 2. Once — process all available data, then stop ───────────────
# Old approach — still works but deprecated in favour of AvailableNow
query = df.writeStream \
.trigger(once=True) \ # processes one mega-batch then exits
.format("delta") \
.option("checkpointLocation", "s3://...") \
.start()
query.awaitTermination()
# ── 3. AvailableNow — modern replacement for Once ─────────────────
# Processes ALL available data in multiple micro-batches, then stops
# Much better for large backlogs — doesn't create one huge batch
query = df.writeStream \
.trigger(availableNow=True) \ # Spark 3.3+
.format("delta") \
.option("checkpointLocation", "s3://...") \
.start()
query.awaitTermination() # blocks until all data processed and job exits
# ── 4. Continuous — sub-millisecond latency (experimental) ───────
# Not micro-batch — uses epoch-based commits. Limited operations only.
query = df.writeStream \
.trigger(continuous="1 second") \ # commit every 1 second
.format("kafka") \
.option("kafka.bootstrap.servers", "...") \
.option("topic", "output-topic") \
.start()
| Trigger | Use Case | Latency | Cost on AWS |
|---|---|---|---|
processingTime="5m" | Near-real-time dashboards | ~5 min | Medium — cluster stays running |
availableNow=True | Scheduled batch-style streaming | Minutes (run on schedule) | Low — cluster auto-terminates |
once=True | Legacy scheduled runs | Minutes | Low |
continuous="1s" | Sub-millisecond Kafka-to-Kafka | <1ms | High — dedicated executors |
Output mode controls which rows are written to the sink per micro-batch. The choice depends on whether you have aggregations and whether you need historical results or only new ones.
| Mode | What Gets Written | Requires | Common Use |
|---|---|---|---|
| append | Only NEW rows added since last batch. Rows already written are never modified. | Watermark if aggregating | Raw event landing to S3/Delta Bronze |
| update | Only rows that CHANGED since last batch (new + updated aggregates). | Aggregation or stateful op | JDBC upserts via foreachBatch |
| complete | ALL result rows rewritten every batch — full result table. | Aggregation (no watermark needed) | Real-time leaderboards, small lookups |
# ── APPEND: raw events, no aggregation — write each event once ────
raw_events.writeStream \
.outputMode("append") \
.format("delta") \
.option("checkpointLocation", "s3://ckpt/raw/") \
.start()
# ── UPDATE: streaming aggregation, only changed rows written ──────
# Good for JDBC sinks that support upsert
running_totals = parsed_stream \
.groupBy("status") \
.agg(sum("amount").alias("total"), count("*").alias("cnt"))
running_totals.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, batch_id: upsert_to_rds(df, batch_id)) \
.option("checkpointLocation", "s3://ckpt/totals/") \
.start()
# ── COMPLETE: rewrite entire result table every batch ─────────────
# Good for small aggregations like top-10 lists
leaderboard = parsed_stream \
.groupBy("customer_id") \
.agg(sum("amount").alias("total_spent")) \
.orderBy(col("total_spent").desc()) \
.limit(10)
leaderboard.writeStream \
.outputMode("complete") \
.format("memory") \ # in-memory table for dashboards
.queryName("top_customers") \
.start()
# Query the in-memory table via SQL:
# spark.sql("SELECT * FROM top_customers").show()
outputMode("complete") with Delta rewrites the entire table every micro-batch. For large aggregations this causes massive write amplification. Use update + foreachBatch with MERGE INTO Delta instead for production aggregation pipelines.
A Spark Structured Streaming checkpoint is a directory (always on S3 in AWS) that stores three things: the offset log (which Kafka offsets were read), the commit log (which batches were successfully written), and the state store (aggregation/join state for stateful operations). Together these enable exact recovery after any failure.
# ── Rule 1: Each query needs its OWN checkpoint path ─────────────
# Never share a checkpoint between two different queries!
query1 = stream_a.writeStream \
.option("checkpointLocation", "s3://my-ckpt/pipeline-orders/") \
.format("delta").start()
query2 = stream_b.writeStream \
.option("checkpointLocation", "s3://my-ckpt/pipeline-events/") \
.format("delta").start()
# ── Rule 2: Use S3 on a reliable prefix (not temp/scratch) ────────
# Good: s3://my-checkpoints/prod/orders/
# Bad: /tmp/ckpt/ (lost on restart)
# Bad: s3://my-data-lake/bronze/ (mixed with data)
# ── Rule 3: On restart, reuse the SAME checkpoint path ───────────
# Spark will read offset log, find last committed batch,
# and resume from the next un-processed offset automatically.
# ── Rule 4: To reset (replay from scratch), DELETE the checkpoint
# But this will cause duplicate data unless you also clear the sink!
import boto3
s3 = boto3.client("s3")
# s3.delete_object(Bucket="my-ckpt", Key="pipeline-orders/") ← careful!
foreachBatch gives you access to each micro-batch as a static DataFrame. This unlocks everything that native writeStream sinks can't do: writing to multiple destinations, JDBC upserts, REST API calls, custom Delta MERGE operations, and fan-out patterns.
from delta.tables import DeltaTable
def upsert_to_delta(batch_df, batch_id):
"""
Called once per micro-batch.
batch_df — static DataFrame of this batch's data
batch_id — monotonically increasing integer (0, 1, 2, ...)
"""
print(f"Processing batch {batch_id}, rows: {batch_df.count()}")
# ── Deduplicate within this batch first ──────────────────────────
# A batch might have duplicate order_ids if the producer sent retries
deduped = batch_df.dropDuplicates(["order_id"])
# ── MERGE INTO Delta: upsert by order_id ─────────────────────────
target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
target.alias("t").merge(
deduped.alias("s"),
condition="t.order_id = s.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# ── Wire foreachBatch into the streaming query ────────────────────
query = parsed_stream.writeStream \
.foreachBatch(upsert_to_delta) \
.option("checkpointLocation", "s3://my-ckpt/silver-orders/") \
.outputMode("update") \ # use "update" or "append" with foreachBatch
.trigger(processingTime="5 minutes") \
.start()
query.awaitTermination()
One of the most powerful patterns: write each micro-batch to multiple sinks simultaneously. Cache the batch DataFrame first so Spark doesn't re-compute it for each write.
import boto3, json
from delta.tables import DeltaTable
dynamodb = boto3.resource("dynamodb")
sns = boto3.client("sns")
audit_table = dynamodb.Table("pipeline-audit")
def multi_sink_write(batch_df, batch_id):
# ── Cache! Re-used across multiple actions ────────────────────────
batch_df.cache()
row_count = batch_df.count()
# ── Sink 1: Write to Delta Lake Bronze ───────────────────────────
batch_df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save("s3://my-data-lake/bronze/orders/")
# ── Sink 2: Write high-value orders to separate table ─────────────
high_value = batch_df.filter(col("amount") > 1000)
high_value.write \
.format("delta") \
.mode("append") \
.save("s3://my-data-lake/bronze/high_value_orders/")
# ── Sink 3: Write audit record to DynamoDB ────────────────────────
audit_table.put_item(Item={
"run_id": f"streaming-batch-{batch_id}",
"batch_id": batch_id,
"row_count": row_count,
"status": "success"
})
# ── Sink 4: Alert on large batches via SNS ────────────────────────
if row_count > 500000:
sns.publish(
TopicArn="arn:aws:sns:us-east-1:123456789012:pipeline-alerts",
Message=f"Large batch detected: batch_id={batch_id}, rows={row_count}",
Subject="Streaming Pipeline — Large Batch Alert"
)
# ── Always unpersist after all sinks are done ─────────────────────
batch_df.unpersist()
query = parsed_stream.writeStream \
.foreachBatch(multi_sink_write) \
.option("checkpointLocation", "s3://my-ckpt/multi-sink/") \
.trigger(processingTime="2 minutes") \
.start()
query.awaitTermination()
If the job restarts mid-batch, foreachBatch may be called again for the same batch_id. Make your sink logic idempotent using batch_id as a guard — check if that batch was already written before writing again.
def idempotent_write(batch_df, batch_id):
# ── Check if this batch was already committed ─────────────────────
control_table = dynamodb.Table("streaming-control")
existing = control_table.get_item(
Key={"pipeline_id": "orders-pipeline", "batch_id": str(batch_id)}
)
if "Item" in existing:
print(f"Batch {batch_id} already committed — skipping (replay safety)")
return
# ── Write the data ────────────────────────────────────────────────
batch_df.write.format("delta").mode("append") \
.save("s3://my-data-lake/bronze/orders/")
# ── Mark this batch as done in control table ──────────────────────
control_table.put_item(Item={
"pipeline_id": "orders-pipeline",
"batch_id": str(batch_id),
"committed_at": str(__import__("datetime").datetime.utcnow())
})
In Python pipelines, you often need to produce events to Kafka from a Spark job (e.g. after processing, publish enriched events downstream). Use the kafka-python library or write a Spark DataFrame back to Kafka via writeStream.
from pyspark.sql.functions import to_json, struct
# ── Kafka requires "value" (and optionally "key") columns ─────────
# Serialize the enriched DataFrame as JSON → Kafka value
kafka_ready = enriched.select(
col("order_id").alias("key"), # partition key — same order → same partition
to_json(struct("*")).alias("value") # full row as JSON string
)
# ── Write enriched events back to a downstream Kafka topic ────────
query = kafka_ready.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
.option("topic", "orders-enriched") \
.option("checkpointLocation", "s3://my-ckpt/kafka-output/") \
.outputMode("append") \
.start()
query.awaitTermination()
For production pipelines on MSK, use Avro serialization with AWS Glue Schema Registry (MSK's managed schema registry). Avro is more compact than JSON and enforces schema on both producer and consumer sides — preventing schema drift from breaking downstream pipelines.
# ── Dependency: add the AWS Glue Schema Registry Serde JAR ────────
# --packages software.amazon.glue:schema-registry-serde:1.1.14
from pyspark.sql.functions import col
# ── Read raw Avro bytes from MSK ──────────────────────────────────
raw = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
.option("subscribe", "orders-avro") \
.option("startingOffsets", "latest") \
.load()
# ── Deserialize Avro value using from_avro() ──────────────────────
from pyspark.sql.avro.functions import from_avro
avro_schema = """
{
"type": "record",
"name": "Order",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "customer_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "event_time", "type": "string"},
{"name": "status", "type": "string"}
]
}
"""
parsed = raw.select(
from_avro(col("value"), avro_schema).alias("data")
).select("data.*")
parsed.printSchema()
# root
# |-- order_id: string
# |-- customer_id: string
# |-- amount: double
# |-- event_time: string
# |-- status: string
In production, some Kafka messages will be malformed — bad JSON, wrong schema, or null values where non-null is expected. Instead of crashing the streaming job, route these records to a dead letter topic (DLT) in Kafka for later inspection and replay.
from pyspark.sql.functions import from_json, col, lit, to_json, struct
def process_with_dlt(batch_df, batch_id):
# ── Parse JSON — invalid records get null in "data" column ────────
parsed = batch_df.select(
col("value").cast("string").alias("raw"),
from_json(col("value").cast("string"), order_schema).alias("data")
)
# ── Split: good records vs bad (null data = parse failure) ────────
good = parsed.filter(col("data.order_id").isNotNull()).select("data.*")
bad = parsed.filter(col("data.order_id").isNull()) \
.select(col("raw").alias("value")) # send raw string to DLT
# ── Write good records to Delta ───────────────────────────────────
if good.count() > 0:
good.write.format("delta").mode("append") \
.save("s3://my-data-lake/bronze/orders/")
# ── Route bad records to Kafka dead letter topic ──────────────────
if bad.count() > 0:
bad.write \
.format("kafka") \
.option("kafka.bootstrap.servers", "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092") \
.option("topic", "orders-topic-dlq") \ # dead letter topic
.save()
query = raw_stream.writeStream \
.foreachBatch(process_with_dlt) \
.option("checkpointLocation", "s3://my-ckpt/orders-dlt/") \
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
This is the canonical AWS streaming architecture. MSK receives events from producers (apps, microservices). EMR runs a persistent Spark Structured Streaming job that reads from MSK, applies watermarking, does light transformations, and lands data into a Delta Lake Bronze table on S3. Downstream batch jobs then refine Bronze → Silver → Gold.
import boto3, json
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
col, from_json, to_timestamp, year, month, dayofmonth,
current_timestamp, lit
)
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from delta.tables import DeltaTable
# ──────────────────────────────────────────────────────────────────
# 1. SparkSession with Delta support
# ──────────────────────────────────────────────────────────────────
spark = SparkSession.builder \
.appName("MSK-to-Delta-Bronze") \
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
# ──────────────────────────────────────────────────────────────────
# 2. Load config from Secrets Manager (never hardcode creds)
# ──────────────────────────────────────────────────────────────────
sm = boto3.client("secretsmanager", region_name="us-east-1")
secret = json.loads(
sm.get_secret_value(SecretId="prod/msk/streaming-pipeline")["SecretString"]
)
bootstrap_servers = secret["bootstrap_servers"]
topic = secret["topic"]
delta_path = secret["delta_path"]
checkpoint_path = secret["checkpoint_path"]
# ──────────────────────────────────────────────────────────────────
# 3. Schema
# ──────────────────────────────────────────────────────────────────
order_schema = StructType([
StructField("order_id", StringType(), True),
StructField("customer_id", StringType(), True),
StructField("amount", DoubleType(), True),
StructField("event_time", StringType(), True),
StructField("status", StringType(), True)
])
# ──────────────────────────────────────────────────────────────────
# 4. Read from MSK
# ──────────────────────────────────────────────────────────────────
raw = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", bootstrap_servers) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("maxOffsetsPerTrigger", 200000) \
.option("failOnDataLoss", "false") \
.load()
# ──────────────────────────────────────────────────────────────────
# 5. Parse + watermark + enrich
# ──────────────────────────────────────────────────────────────────
parsed = raw \
.select(from_json(col("value").cast("string"), order_schema).alias("d")) \
.select("d.*") \
.withColumn("event_ts", to_timestamp(col("event_time"))) \
.withColumn("ingested_at", current_timestamp()) \
.withColumn("year", year(col("event_ts"))) \
.withColumn("month", month(col("event_ts"))) \
.withColumn("day", dayofmonth(col("event_ts"))) \
.withWatermark("event_ts", "10 minutes") # tolerate 10-min late events
cw = boto3.client("cloudwatch", region_name="us-east-1")
# ──────────────────────────────────────────────────────────────────
# 6. foreachBatch: write to Delta + publish metrics
# ──────────────────────────────────────────────────────────────────
def write_batch(batch_df, batch_id):
batch_df.cache()
row_count = batch_df.count()
if row_count > 0:
batch_df.write \
.format("delta") \
.mode("append") \
.partitionBy("year", "month", "day") \
.option("mergeSchema", "true") \
.save(delta_path)
# Publish rows_processed metric to CloudWatch
cw.put_metric_data(
Namespace="DataPipeline/Streaming",
MetricData=[{
"MetricName": "RowsProcessed",
"Value": row_count,
"Unit": "Count",
"Dimensions": [{"Name": "Pipeline", "Value": "orders-bronze"}]
}]
)
batch_df.unpersist()
print(f"Batch {batch_id}: wrote {row_count} rows")
# ──────────────────────────────────────────────────────────────────
# 7. Start the streaming query
# ──────────────────────────────────────────────────────────────────
query = parsed.writeStream \
.foreachBatch(write_batch) \
.option("checkpointLocation", checkpoint_path) \
.trigger(processingTime="5 minutes") \
.start()
query.awaitTermination()
maxOffsetsPerTrigger prevents runaway executor memory on burst traffic.
CDC (Change Data Capture) Pipelines
Change Data Capture lets you stream every INSERT, UPDATE, and DELETE from a source database into your data lake in near real-time — without expensive full table scans. This section covers CDC fundamentals, AWS DMS, Debezium on MSK, and landing CDC events into Delta Lake and Iceberg using MERGE INTO.
There are three ways to move data from a source database to your data lake. Each has a different cost, latency, and completeness profile.
| Strategy | How It Works | Captures Deletes? | Latency | Cost |
|---|---|---|---|---|
| Full Load | Read the entire table every run. Overwrite the target. | Yes (missing rows = deleted) | Hours | High — full scan each time |
| Incremental | Read only rows where updated_at > last_watermark. Append or upsert. | No — soft deletes only | Minutes | Medium |
| CDC | Capture every change event (I/U/D) from the database transaction log. Stream in real-time. | Yes — DELETE events captured | Seconds | Low — log-based, no table scan |
Log-based CDC reads the database's internal transaction log (WAL in PostgreSQL, binlog in MySQL, redo log in Oracle). It captures every change with zero impact on the source — no extra queries, no table locks. Query-based CDC polls the source table with WHERE updated_at > ?. Simpler to set up but can't capture hard deletes and adds query load to the source.
Every CDC event carries an operation type and (depending on the tool) a before image (the row before the change) and an after image (the row after). Understanding this structure is essential for writing correct MERGE logic.
// INSERT event — op = "c" (create)
{
"op": "c",
"before": null,
"after": { "order_id": "O-001", "amount": 99.99, "status": "PLACED" },
"ts_ms": 1700000000000
}
// UPDATE event — op = "u" (update)
{
"op": "u",
"before": { "order_id": "O-001", "amount": 99.99, "status": "PLACED" },
"after": { "order_id": "O-001", "amount": 99.99, "status": "SHIPPED" },
"ts_ms": 1700000060000
}
// DELETE event — op = "d" (delete), tombstone follows
{
"op": "d",
"before": { "order_id": "O-001", "amount": 99.99, "status": "SHIPPED" },
"after": null,
"ts_ms": 1700000120000
}
// SNAPSHOT / READ event — op = "r" (initial full load row)
{
"op": "r",
"before": null,
"after": { "order_id": "O-999", "amount": 150.00, "status": "DELIVERED" },
"ts_ms": 1700000000000
}
CDC pipelines face two failure scenarios: (1) the source emits a change event but the consumer crashes before committing — causing a replay (at-least-once). (2) The consumer commits the offset but fails before writing to the sink — causing data loss (at-most-once). True exactly-once requires checkpointed offsets (Kafka + Spark checkpoint) AND an idempotent sink (Delta MERGE or JDBC upsert keyed by primary key + change timestamp).
AWS DMS has three components: a Replication Instance (a managed EC2 that runs the migration), Source and Target Endpoints (connections to your database and your destination), and a Replication Task (the actual job that moves data). For CDC, you run a task in Full load + ongoing replication mode — DMS first does a snapshot of the table, then continuously streams changes from the transaction log.
When DMS writes to S3, it creates Parquet (or CSV) files with a special Op column: I for Insert, U for Update, D for Delete. Your downstream Spark job reads these files and applies the changes to your Delta table using MERGE.
import boto3, json
dms = boto3.client("dms", region_name="us-east-1")
# ── Step 1: Create source endpoint (PostgreSQL) ───────────────────
src = dms.create_endpoint(
EndpointIdentifier="postgres-source",
EndpointType="source",
EngineName="postgres",
ServerName="mydb.cluster-xyz.us-east-1.rds.amazonaws.com",
Port=5432,
DatabaseName="orders_db",
Username="dms_user",
Password="{{resolved from Secrets Manager}}",
PostgreSQLSettings={
"SlotName": "dms_replication_slot", # WAL slot
"CaptureDdls": True,
"HeartbeatEnable": True,
"HeartbeatFrequency": 5,
"DatabaseMode": "default"
}
)
# ── Step 2: Create target endpoint (S3) ──────────────────────────
tgt = dms.create_endpoint(
EndpointIdentifier="s3-target",
EndpointType="target",
EngineName="s3",
S3Settings={
"BucketName": "my-data-lake",
"BucketFolder": "cdc-landing/orders/",
"CompressionType": "GZIP",
"DataFormat": "parquet",
"ParquetVersion": "parquet-2-0",
"EnableStatistics": True,
"IncludeOpForFullLoad": True, # add Op column even for full load rows
"CdcInsertsAndUpdates": True, # capture I and U
"CdcDeletesEnabled": True, # capture D
"TimestampColumnName": "dms_timestamp", # add arrival timestamp
"ServiceAccessRoleArn": "arn:aws:iam::123456789012:role/dms-s3-role"
}
)
# ── Step 3: Create replication task ──────────────────────────────
task = dms.create_replication_task(
ReplicationTaskIdentifier="orders-cdc-task",
SourceEndpointArn=src["Endpoint"]["EndpointArn"],
TargetEndpointArn=tgt["Endpoint"]["EndpointArn"],
ReplicationInstanceArn="arn:aws:dms:us-east-1:123456789012:rep:REPLICATION_INSTANCE_ARN",
MigrationType="full-load-and-cdc", # snapshot first, then ongoing
TableMappings=json.dumps({
"rules": [{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-orders",
"object-locator": {
"schema-name": "public",
"table-name": "orders"
},
"rule-action": "include"
}]
}),
ReplicationTaskSettings=json.dumps({
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": True,
"FullLobMode": False
},
"Logging": {
"EnableLogging": True
}
})
)
# ── Step 4: Start the task ────────────────────────────────────────
dms.start_replication_task(
ReplicationTaskArn=task["ReplicationTask"]["ReplicationTaskArn"],
StartReplicationTaskType="start-replication" # or "resume-processing"
)
print("DMS task started — full load + ongoing CDC running")
DMS can also write CDC events directly to MSK. This is the preferred pattern when you want multiple consumers to process the same CDC stream — e.g. one Spark job writes to Delta Bronze, another updates a real-time cache in DynamoDB, and a third triggers alerting logic.
dms.create_endpoint(
EndpointIdentifier="msk-target",
EndpointType="target",
EngineName="kafka",
KafkaSettings={
"Broker": "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092",
"Topic": "orders-cdc",
"MessageFormat": "json", # or "avro"
"IncludeTransactionDetails": True,
"IncludePartitionValue": True,
"PartitionIncludeSchemaTable": True,
"IncludeTableAlterOperations": True,
"IncludeControlDetails": True,
"IncludeNullAndEmpty": False,
"SecurityProtocol": "ssl" # MSK TLS
}
)
import time
def wait_for_dms_task(task_arn, terminal_states=("stopped", "failed")):
while True:
resp = dms.describe_replication_tasks(
Filters=[{"Name": "replication-task-arn", "Values": [task_arn]}]
)
task = resp["ReplicationTasks"][0]
state = task["Status"]
stats = task.get("ReplicationTaskStats", {})
print(f"State: {state} | "
f"TablesLoaded: {stats.get('TablesLoaded',0)} | "
f"TablesLoading: {stats.get('TablesLoading',0)} | "
f"TablesErrored: {stats.get('TablesErrored',0)}")
if state in terminal_states:
break
if state == "running" and stats.get("TablesLoaded", 0) > 0:
print("Full load complete — CDC ongoing")
break # for full-load-and-cdc, don't wait forever
time.sleep(15)
wait_for_dms_task(task["ReplicationTask"]["ReplicationTaskArn"])
wal_level = logical on the source RDS instance — requires a reboot. (2) DMS replication slots consume WAL disk space on the source — monitor and set max_slot_wal_keep_size. (3) DMS task failure mid-load leaves partial data — always use a staging prefix and swap atomically. (4) LOB columns need SupportLobs=True — affects performance.
Debezium is an open-source CDC platform built on top of Kafka Connect. It runs as a Kafka Connect connector — you deploy it alongside MSK (or on MSK Connect, AWS's managed Kafka Connect), configure it to point at your source database, and it reads the transaction log and publishes a structured CDC event to a Kafka topic for every change. No polling, no extra load on the source, sub-second latency.
{
"name": "orders-postgres-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "mydb.cluster-xyz.us-east-1.rds.amazonaws.com",
"database.port": "5432",
"database.user": "debezium",
"database.password": "${file:/opt/kafka/secrets/db.properties:password}",
"database.dbname": "orders_db",
"database.server.name": "myserver",
"table.include.list": "public.orders,public.customers",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "debezium_pub",
"snapshot.mode": "initial",
"decimal.handling.mode": "double",
"time.precision.mode": "adaptive",
"tombstones.on.delete": "true",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.add.fields": "op,ts_ms,source.ts_ms",
"transforms.unwrap.delete.handling.mode": "rewrite"
}
}
unwrap transform, Debezium events have a complex nested structure: {"before": {...}, "after": {...}, "source": {...}, "op": "u"}. The ExtractNewRecordState SMT (Single Message Transform) flattens this — the value becomes just the after fields, with __op, __ts_ms appended. This makes downstream Spark parsing much simpler.
{
"name": "mysql-cdc",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql.cluster-xyz.us-east-1.rds.amazonaws.com",
"database.port": "3306",
"database.user": "debezium",
"database.password": "secret",
"database.server.id": "12345", // unique server ID for binlog
"database.server.name": "mysqlserver",
"database.include.list": "orders_db",
"table.include.list": "orders_db.orders",
"database.history.kafka.bootstrap.servers": "b-1.mycluster:9092",
"database.history.kafka.topic": "schema-changes.orders_db",
"snapshot.mode": "initial",
"include.schema.changes": "true"
}
}
When a row is deleted, Debezium publishes two Kafka messages to the topic: (1) a DELETE event with "op": "d" and the before-image, and (2) a tombstone — a message with the same key but a null value. The tombstone tells Kafka log compaction to remove that key from the compacted log. Your Spark consumer should filter out tombstones (null values) before processing.
# Kafka value is null for tombstone records — filter them out
non_tombstone = raw_kafka_stream.filter(col("value").isNotNull())
# Now parse the remaining (valid) CDC events
cdc_events = non_tombstone \
.select(from_json(col("value").cast("string"), cdc_schema).alias("d")) \
.select("d.*")
The core pattern for landing CDC into Delta: for each micro-batch, MERGE the incoming events into the target Delta table. Rows with matching primary keys get updated; new keys get inserted; delete events remove rows. This produces a current-state table that always reflects the latest state of the source database.
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, max as spark_max
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType
from delta.tables import DeltaTable
# ── Schema for the unwrapped Debezium event ───────────────────────
# (after ExtractNewRecordState SMT, fields are flattened)
cdc_schema = StructType([
StructField("order_id", StringType(), True),
StructField("customer_id", StringType(), True),
StructField("amount", DoubleType(), True),
StructField("status", StringType(), True),
StructField("__op", StringType(), True), # c / u / d / r
StructField("__ts_ms", LongType(), True) # source change timestamp
])
DELTA_PATH = "s3://my-data-lake/silver/orders/"
CKPT_PATH = "s3://my-ckpt/cdc-orders/"
KAFKA_BROKER = "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092"
def apply_cdc_batch(batch_df, batch_id):
"""Apply a micro-batch of CDC events to the Delta target table."""
if batch_df.isEmpty():
return
batch_df.cache()
# ── Deduplicate: keep only the LATEST change per order_id in this batch
# A batch might have: INSERT O-001, then UPDATE O-001
# We only want to apply the latest state
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc
window = Window.partitionBy("order_id").orderBy(desc("__ts_ms"))
latest = batch_df \
.withColumn("_rn", row_number().over(window)) \
.filter(col("_rn") == 1) \
.drop("_rn")
# ── Split: upserts (c/u/r) vs deletes (d) ────────────────────────
upserts = latest.filter(col("__op").isin("c", "u", "r")) \
.drop("__op", "__ts_ms")
deletes = latest.filter(col("__op") == "d")
delta_table = DeltaTable.forPath(spark, DELTA_PATH)
# ── Apply upserts via MERGE ───────────────────────────────────────
if upserts.count() > 0:
delta_table.alias("t").merge(
upserts.alias("s"),
"t.order_id = s.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# ── Apply deletes ─────────────────────────────────────────────────
if deletes.count() > 0:
# Build a condition like: order_id IN ('O-001', 'O-002', ...)
delete_ids = [row["order_id"] for row in deletes.select("order_id").collect()]
delta_table.delete(col("order_id").isin(delete_ids))
batch_df.unpersist()
print(f"Batch {batch_id}: upserts={upserts.count()}, deletes={deletes.count()}")
# ── Read CDC stream from MSK ──────────────────────────────────────
raw = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BROKER) \
.option("subscribe", "myserver.public.orders") \
.option("startingOffsets", "earliest") \
.option("failOnDataLoss", "false") \
.load()
cdc_stream = raw \
.filter(col("value").isNotNull()) \ # drop tombstones
.select(
from_json(col("value").cast("string"), cdc_schema).alias("d")
).select("d.*")
query = cdc_stream.writeStream \
.foreachBatch(apply_cdc_batch) \
.option("checkpointLocation", CKPT_PATH) \
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
s3://my-data-lake/silver/orders/ always contains the current state of the orders table — every insert is reflected, every update is applied, every delete removes the row. Downstream Athena queries, dbt models, and Redshift Spectrum tables always see fresh, consistent data.
Instead of a current-state table, you can build a full history table (SCD Type 2) where every change creates a new row with effective_start, effective_end, and is_current columns. This lets you query the state of any record at any point in time.
from pyspark.sql.functions import current_timestamp, lit
from delta.tables import DeltaTable
def apply_scd2_batch(batch_df, batch_id):
if batch_df.isEmpty(): return
batch_df.cache()
upserts = batch_df.filter(col("__op").isin("c", "u", "r")) \
.drop("__op", "__ts_ms") \
.withColumn("effective_start", current_timestamp()) \
.withColumn("effective_end", lit(None).cast("timestamp")) \
.withColumn("is_current", lit(True))
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders_history/")
if upserts.count() > 0:
delta_table.alias("t").merge(
upserts.alias("s"),
"t.order_id = s.order_id AND t.is_current = true"
).whenMatchedUpdate(set={ # expire the old row
"is_current": "false",
"effective_end": "current_timestamp()"
}).whenNotMatchedInsertAll() \ # insert new row
.execute()
# Insert the NEW current row (the merge above only closed the old one)
upserts.write.format("delta").mode("append") \
.save("s3://my-data-lake/silver/orders_history/")
batch_df.unpersist()
CDC streaming creates many small Delta files — one or more per micro-batch. Over time this degrades query performance. Run a scheduled OPTIMIZE job (Databricks) or an EMR Spark compaction job every few hours to compact small files.
from delta.tables import DeltaTable
# ── Run after your CDC stream has been running for a while ────────
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
# OPTIMIZE: compact small Parquet files into 1 GB files
# ZORDER by order_id: co-locate related order data for faster point lookups
spark.sql("""
OPTIMIZE delta.`s3://my-data-lake/silver/orders/`
ZORDER BY (order_id, customer_id)
""")
# VACUUM: remove files older than 7 days (default retention)
# Don't go below 7 days if you rely on time travel!
delta_table.vacuum(retentionHours=168) # 168 hours = 7 days
print("Compaction complete")
Iceberg supports row-level DML (MERGE, UPDATE, DELETE) since Iceberg 0.13 / Spark 3.x. The MERGE syntax is very similar to Delta, making it easy to port your CDC logic between the two formats. Iceberg uses copy-on-write (default) or merge-on-read strategies for row-level changes.
def apply_cdc_to_iceberg(batch_df, batch_id):
if batch_df.isEmpty(): return
batch_df.cache()
# ── Deduplicate within batch ──────────────────────────────────────
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc
window = Window.partitionBy("order_id").orderBy(desc("__ts_ms"))
latest = batch_df.withColumn("_rn", row_number().over(window)) \
.filter(col("_rn") == 1).drop("_rn")
upserts = latest.filter(col("__op").isin("c", "u", "r"))
deletes = latest.filter(col("__op") == "d")
# ── Register batch as a temp view for SQL MERGE ───────────────────
upserts.drop("__op", "__ts_ms").createOrReplaceTempView("cdc_upserts")
deletes.select("order_id").createOrReplaceTempView("cdc_deletes")
# ── MERGE upserts into Iceberg target table ───────────────────────
spark.sql("""
MERGE INTO glue_catalog.silver.orders AS t
USING cdc_upserts AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET
t.customer_id = s.customer_id,
t.amount = s.amount,
t.status = s.status
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, status)
VALUES (s.order_id, s.customer_id, s.amount, s.status)
""")
# ── DELETE rows from Iceberg target table ─────────────────────────
if deletes.count() > 0:
spark.sql("""
DELETE FROM glue_catalog.silver.orders
WHERE order_id IN (SELECT order_id FROM cdc_deletes)
""")
batch_df.unpersist()
# ── Wire into the streaming query ────────────────────────────────
query = cdc_stream.writeStream \
.foreachBatch(apply_cdc_to_iceberg) \
.option("checkpointLocation", "s3://my-ckpt/cdc-iceberg/") \
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
rewrite_data_files procedure in a scheduled Glue job.
# ── Compact small files after CDC streaming ───────────────────────
spark.sql("""
CALL glue_catalog.system.rewrite_data_files(
table => 'silver.orders',
options => map(
'target-file-size-bytes', '134217728', -- 128 MB target
'min-input-files', '5' -- only compact if 5+ small files
)
)
""")
# ── Remove old snapshots (Iceberg keeps history like Delta time travel)
spark.sql("""
CALL glue_catalog.system.expire_snapshots(
table => 'silver.orders',
older_than => TIMESTAMP '2024-01-01 00:00:00',
retain_last => 5
)
""")
# ── Remove orphan files (files not in any snapshot) ───────────────
spark.sql("""
CALL glue_catalog.system.delete_orphan_files(
table => 'silver.orders'
)
""")
Schedule a compaction job every 4 hours using EventBridge to trigger a Lambda that adds an EMR step running the Iceberg/Delta OPTIMIZE logic.
import boto3, json, os
emr = boto3.client("emr", region_name="us-east-1")
def lambda_handler(event, context):
cluster_id = os.environ["EMR_CLUSTER_ID"]
resp = emr.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[{
"Name": "Delta-Compaction-Orders",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--conf", "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension",
"--conf", "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog",
"s3://my-code-bucket/scripts/compact_orders.py"
]
}
}]
)
step_id = resp["StepIds"][0]
print(f"Compaction step submitted: {step_id}")
return {"statusCode": 200, "stepId": step_id}
Delta Lake & Apache Iceberg
Delta Lake and Apache Iceberg are open table formats that bring ACID transactions, schema evolution, time travel, and upserts to files stored on S3. Without them, S3 is just a dump of Parquet files — you can't update a row, delete a record, or guarantee consistency. With them, S3 becomes a fully-featured transactional data warehouse. This section covers everything you need for production — ACID guarantees, MERGE INTO, OPTIMIZE/VACUUM, hidden partitioning, and when to choose Delta vs Iceberg.
Before Delta Lake and Iceberg existed, data engineers stored data on S3 as raw Parquet files. This worked for simple append-only workloads but broke down in production because of five fundamental problems.
| Problem | What Goes Wrong | Example |
|---|---|---|
| No ACID Transactions | Two writers writing simultaneously corrupt data — partial writes are visible to readers | Glue job and Spark job both append → duplicate rows |
| No UPDATE / DELETE | You cannot modify or remove a specific row — only append or rewrite the entire partition | GDPR erasure request impossible without full rewrite |
| No Schema Evolution Safety | Adding a column to new files breaks readers that expect the old schema | Athena query fails after producer adds a column |
| No Time Travel | Once data is overwritten, the old version is gone forever | Analyst runs wrong query, overwrites Gold table — unrecoverable |
| Slow Partition Discovery | Listing millions of S3 files to discover partitions is slow and expensive | Glue crawler takes 45 minutes on a large lake |
Both Delta Lake and Iceberg solve these problems by adding a metadata layer on top of the existing Parquet files. The data files themselves are still Parquet on S3 — but the table format adds a transaction log that tracks every change: which files were added, which were removed, what the schema is, and when each operation happened. This transaction log is what enables ACID, time travel, schema evolution, and fast metadata operations.
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Created by | Databricks (open-sourced) | Netflix + Apple | Uber |
| AWS Native | EMR + Glue support | ✅ Athena + Glue native | EMR support |
| MERGE / Upsert | ✅ Strong | ✅ Strong | ✅ (CoW + MoR) |
| Time Travel | ✅ Version/timestamp | ✅ Snapshot-based | ✅ Limited |
| Athena Support | Via manifest file | ✅ Native, first-class | Via manifest file |
| Schema Evolution | ✅ Add/rename/drop | ✅ Full DDL support | ✅ Partial |
| Best with | Databricks, EMR, Glue | Athena, Glue, EMR, Flink | Spark streaming CDC |
| Recommendation on AWS | Best for EMR/Glue Spark | Best for Athena + Glue | Only for specific CDC patterns |
The _delta_log/ directory is the heart of Delta Lake. Every write operation (INSERT, UPDATE, DELETE, MERGE) appends a new JSON commit file to this log. Each commit file records exactly which Parquet files were added and which were removed. Readers always check the log first to determine which files are part of the "current" version of the table — this is how Delta achieves snapshot isolation and ACID compliance.
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
# ── Build SparkSession with Delta Lake support ────────────────────
builder = SparkSession.builder \
.appName("DeltaLakeDemo") \
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# ── Write a DataFrame to Delta format ────────────────────────────
data = [
("O-001", "C-101", 99.99, "PLACED"),
("O-002", "C-102", 149.50, "SHIPPED"),
("O-003", "C-103", 25.00, "PLACED")
]
df = spark.createDataFrame(data, ["order_id", "customer_id", "amount", "status"])
DELTA_PATH = "s3://my-data-lake/silver/orders/"
df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("status") \
.save(DELTA_PATH)
print("✅ Delta table written — _delta_log/ created on S3")
# ── Read back the Delta table ─────────────────────────────────────
df_read = spark.read.format("delta").load(DELTA_PATH)
df_read.show()
# ── Check the transaction log ─────────────────────────────────────
# s3://my-data-lake/silver/orders/_delta_log/00000000000000000000.json
# Contains: {"add": {"path": "status=PLACED/part-xxx.parquet", ...}}
# ── Register as a table in Glue Catalog ──────────────────────────
spark.sql("""
CREATE TABLE IF NOT EXISTS silver.orders
USING delta
LOCATION 's3://my-data-lake/silver/orders/'
""")
Delta provides all four ACID properties. Understanding what each means practically helps you design pipelines correctly.
Delta tracks the table schema in the transaction log. When you add a column to the source data, you have two options: use mergeSchema=true to automatically add the new column to the Delta schema, or use overwriteSchema=true to replace the schema entirely. Old files that don't have the new column return NULL for that column when read.
from pyspark.sql.functions import lit
# Original table has: order_id, customer_id, amount, status
# New data source now also includes: discount_pct
new_data = [
("O-004", "C-104", 200.00, "PLACED", 0.10),
("O-005", "C-105", 75.00, "PLACED", 0.05)
]
df_new = spark.createDataFrame(
new_data, ["order_id", "customer_id", "amount", "status", "discount_pct"]
)
# ── mergeSchema=True: add new column to existing Delta table schema
df_new.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \ # auto-adds discount_pct column
.save(DELTA_PATH)
# Old rows (O-001, O-002, O-003) now return NULL for discount_pct
spark.read.format("delta").load(DELTA_PATH).show()
# order_id | customer_id | amount | status | discount_pct
# O-001 | C-101 | 99.99 | PLACED | null ← old row
# O-004 | C-104 | 200.00 | PLACED | 0.10 ← new row
# ── ALTER TABLE: rename or drop a column ─────────────────────────
spark.sql("ALTER TABLE silver.orders RENAME COLUMN discount_pct TO discount")
spark.sql("ALTER TABLE silver.orders DROP COLUMN discount")
# ── overwriteSchema: replace schema entirely (destructive!) ───────
df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \ # WARNING: loses old schema definition
.save(DELTA_PATH)
Delta Lake (and Iceberg) retain old versions of the table in the transaction log. You can read any previous version — either by version number or by timestamp. This is invaluable for: debugging ("what did this table look like before yesterday's bad pipeline run?"), auditing, reproducing ML training datasets, and recovering from accidental deletes or incorrect MERGE operations.
from delta.tables import DeltaTable
# ── Read a specific version number ───────────────────────────────
df_v0 = spark.read \
.format("delta") \
.option("versionAsOf", 0) \ # version 0 = initial write
.load(DELTA_PATH)
df_v2 = spark.read \
.format("delta") \
.option("versionAsOf", 2) \ # version 2 = after second commit
.load(DELTA_PATH)
# ── Read as of a specific timestamp ──────────────────────────────
df_yesterday = spark.read \
.format("delta") \
.option("timestampAsOf", "2024-01-14T23:59:59.000Z") \
.load(DELTA_PATH)
# ── SQL syntax for time travel ────────────────────────────────────
spark.sql("""
SELECT * FROM silver.orders
VERSION AS OF 2
""").show()
spark.sql("""
SELECT * FROM silver.orders
TIMESTAMP AS OF '2024-01-14 23:59:59'
""").show()
# ── View the full history of a Delta table ────────────────────────
delta_table = DeltaTable.forPath(spark, DELTA_PATH)
history = delta_table.history()
history.select("version", "timestamp", "operation", "operationParameters").show()
# version | timestamp | operation | operationParameters
# 2 | 2024-01-15 08:00:05 | MERGE | {predicate: "t.order_id = s.order_id"}
# 1 | 2024-01-15 07:00:02 | WRITE | {mode: "Append"}
# 0 | 2024-01-14 06:00:00 | WRITE | {mode: "Overwrite"}
# ── RESTORE a table to a previous version ────────────────────────
# USE WITH CARE — this is irreversible without another restore
delta_table.restoreToVersion(1) # restore to version 1
# OR: delta_table.restoreToTimestamp("2024-01-14T23:59:59.000Z")
delta_table.restoreToVersion(prev_version) — the table is back to its pre-bug state in seconds. Without Delta: you'd need to re-run the entire pipeline from the source, taking hours.
MERGE INTO is the most important Delta Lake operation. It takes a source DataFrame (new/changed data) and merges it into the target Delta table. For each row in the source: if a matching row exists in the target (by primary key), update it; if not, insert it.
from delta.tables import DeltaTable
from pyspark.sql.functions import current_timestamp
# ── Target: existing Delta table ──────────────────────────────────
target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
# ── Source: incoming changes (from CDC, streaming, or batch load) ─
updates = spark.createDataFrame([
("O-001", "C-101", 99.99, "SHIPPED"), # existing row — status changed
("O-006", "C-106", 300.00, "PLACED"), # new row
], ["order_id", "customer_id", "amount", "status"])
# ── MERGE: update matching rows, insert new rows ──────────────────
target.alias("t").merge(
updates.alias("s"),
condition="t.order_id = s.order_id" # join condition
).whenMatchedUpdateAll() \ # UPDATE all columns on match
.whenNotMatchedInsertAll() \ # INSERT on no match
.execute()
# ── MERGE with selective updates and conditions ───────────────────
target.alias("t").merge(
updates.alias("s"),
"t.order_id = s.order_id"
).whenMatchedUpdate(
condition="s.amount > t.amount", # only update if new amount is larger
set={
"status": "s.status",
"amount": "s.amount",
"updated_at": "current_timestamp()"
}
).whenNotMatchedInsert(
values={
"order_id": "s.order_id",
"customer_id": "s.customer_id",
"amount": "s.amount",
"status": "s.status",
"updated_at": "current_timestamp()"
}
).execute()
Delta's MERGE also supports a whenMatchedDelete() clause — if a matching row exists in the target AND a condition is met, delete the target row. This is the standard pattern for applying hard deletes from CDC events.
from pyspark.sql.functions import col
# Source contains CDC events with __op column: c/u/d
cdc_batch = spark.createDataFrame([
("O-001", "C-101", 99.99, "DELIVERED", "u"), # update
("O-003", "C-103", 25.00, "CANCELLED", "d"), # delete
("O-007", "C-107", 500.00, "PLACED", "c"), # insert
], ["order_id", "customer_id", "amount", "status", "__op"])
target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
target.alias("t").merge(
cdc_batch.alias("s"),
"t.order_id = s.order_id"
).whenMatchedDelete(
condition="s.__op = 'd'" # DELETE if it's a delete event
).whenMatchedUpdate(
condition="s.__op = 'u'", # UPDATE if it's an update event
set={
"status": "s.status",
"amount": "s.amount"
}
).whenNotMatchedInsert(
condition="s.__op = 'c'", # INSERT only for create events
values={
"order_id": "s.order_id",
"customer_id": "s.customer_id",
"amount": "s.amount",
"status": "s.status"
}
).execute()
print("✅ CDC batch applied: insert + update + delete")
Slowly Changing Dimension Type 2 keeps the full history of changes: each update closes the old row (sets effective_end and is_current=false) and inserts a new row with the new values. MERGE INTO Delta makes this pattern clean and atomic.
from pyspark.sql.functions import current_timestamp, lit
from delta.tables import DeltaTable
# SCD2 table structure:
# customer_id | name | country | effective_start | effective_end | is_current
target = DeltaTable.forPath(spark, "s3://my-data-lake/silver/customers_scd2/")
new_data = spark.createDataFrame([
("C-101", "Alice Smith", "US"), # existing customer — country changed
("C-200", "New Person", "IN"), # new customer
], ["customer_id", "name", "country"])
# Step 1: Close the old row for changed customers
target.alias("t").merge(
new_data.alias("s"),
"t.customer_id = s.customer_id AND t.is_current = true"
" AND (t.name != s.name OR t.country != s.country)" # only if data changed
).whenMatchedUpdate(set={
"is_current": "false",
"effective_end": "current_timestamp()"
}).execute()
# Step 2: Insert new rows (new customers + new versions of changed ones)
new_rows = new_data \
.withColumn("effective_start", current_timestamp()) \
.withColumn("effective_end", lit(None).cast("timestamp")) \
.withColumn("is_current", lit(True))
new_rows.write.format("delta").mode("append").save(
"s3://my-data-lake/silver/customers_scd2/"
)
print("✅ SCD2 applied")
Every MERGE, every streaming micro-batch, and every incremental append creates new small Parquet files. Over time, a Delta table can accumulate thousands of tiny files. Querying a table with 50,000 x 1 MB files is much slower than querying one with 50 x 1 GB files — more S3 LIST operations, more file opens, more task scheduling overhead.
OPTIMIZE reads all small files in the table, combines them into larger files (~1 GB each), and rewrites them. Old small files are retained for time travel but marked as removed in the transaction log (so queries skip them). Run OPTIMIZE on a schedule — nightly or after a large batch of micro-batch streaming.
# ── OPTIMIZE: compact all small files in the table ───────────────
spark.sql("OPTIMIZE silver.orders")
# ── OPTIMIZE a specific partition only ───────────────────────────
spark.sql("OPTIMIZE silver.orders WHERE status = 'PLACED'")
# ── OPTIMIZE via Delta Table API ─────────────────────────────────
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
delta_table.optimize().executeCompaction()
# ── OPTIMIZE with target file size (default is 1 GB) ─────────────
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 134217728) # 128 MB
spark.sql("OPTIMIZE silver.orders")
ZORDER BY is a multi-dimensional clustering technique that arranges data within files so that rows with similar values of the ZORDER columns are stored together. When a query filters on those columns, Delta can skip entire files (data skipping) rather than scanning every file. ZORDER is most effective on high-cardinality columns that are frequently used in WHERE clauses.
# ── ZORDER by order_id and customer_id ───────────────────────────
# After this: queries like WHERE order_id = 'O-001' skip most files
spark.sql("""
OPTIMIZE silver.orders
ZORDER BY (order_id, customer_id)
""")
# ZORDER via API
delta_table.optimize().executeZOrderBy("order_id", "customer_id")
# ── Check data skipping stats after ZORDER ───────────────────────
spark.sql("""
SELECT
file.path,
file.stats
FROM delta.`s3://my-data-lake/silver/orders/`.`_delta_log`.`*.json`
""")
# stats shows: min/max of order_id and customer_id per file
# → Delta skips files where min/max range doesn't contain the query value
VACUUM permanently deletes Parquet files that are no longer referenced by the Delta transaction log AND are older than the retention threshold (default 7 days). Without regular VACUUM, deleted/overwritten files accumulate indefinitely on S3, and your storage cost grows even though the table data hasn't grown.
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
# ── VACUUM: delete files older than 7 days (default) ─────────────
delta_table.vacuum()
# ── VACUUM with custom retention (minimum 168 hours = 7 days) ────
delta_table.vacuum(retentionHours=168) # 7 days
# ── SQL: VACUUM with dry run (see what WOULD be deleted) ─────────
spark.sql("VACUUM silver.orders DRY RUN").show() # no actual deletion
spark.sql("VACUUM silver.orders RETAIN 168 HOURS") # actually delete
# ⚠️ NEVER go below 168 hours (7 days) if you use time travel!
# If you VACUUM files needed for a time-travel query, that query will fail.
# ── To allow shorter retention (dev/test only — NOT production) ───
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(retentionHours=1) # DEV ONLY — removes all old versions
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")
On EMR, you install Delta Lake via the --packages option to spark-submit, or via a bootstrap action that pre-installs the JAR. EMR 6.9+ has native Delta support you can enable via cluster configuration.
# Option 1: --packages (downloads from Maven on cluster startup)
spark-submit \
--packages io.delta:delta-core_2.12:2.4.0 \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
s3://my-code/scripts/delta_pipeline.py
# Option 2: EMR 6.9+ native Delta support via Configurations
[
{
"Classification": "delta-defaults",
"Properties": {
"delta.enableDeltaTableUtils": "true"
}
}
]
In AWS Glue 4.0+, Delta Lake is a supported format. You add a Glue job parameter --datalake-formats delta and Glue automatically adds the Delta JAR. No manual dependency management needed.
# In Glue job DefaultArguments:
# "--datalake-formats": "delta"
# "--conf": "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension"
# "--conf": "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
# In your Glue PySpark script:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from delta.tables import DeltaTable
sc = SparkContext()
glue = GlueContext(sc)
spark = glue.spark_session
# Now use Delta normally
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
delta_table.history().show()
glue.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={"path": "s3://my-data-lake/silver/orders/"},
format="delta"
)
Athena does not natively read Delta's _delta_log — it needs a manifest file that lists the active Parquet files. Delta can generate this manifest automatically. After generating, you create an Athena external table pointing to the manifest. This requires regenerating the manifest after every write — so it adds operational overhead compared to Iceberg's native Athena support.
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://my-data-lake/silver/orders/")
# ── Generate symlink manifest for Athena ─────────────────────────
delta_table.generate("symlink_format_manifest")
# Creates: s3://my-data-lake/silver/orders/_symlink_format_manifest/
# manifest file listing all active Parquet file paths
# ── Create Athena external table pointing to manifest ─────────────
spark.sql("""
CREATE EXTERNAL TABLE IF NOT EXISTS silver_athena.orders (
order_id STRING,
customer_id STRING,
amount DOUBLE,
status STRING
)
PARTITIONED BY (year STRING, month STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://my-data-lake/silver/orders/_symlink_format_manifest/'
""")
# Run MSCK REPAIR TABLE to load partitions, then query from Athena
# ⚠️ Must re-run delta_table.generate("symlink_format_manifest")
# after EVERY write to keep the Athena table in sync.
Iceberg's metadata structure is three-tiered: Table Metadata File (the entry point — lists all snapshots), Manifest Lists (one per snapshot — lists all manifest files for that snapshot), and Manifest Files (list the actual Parquet data files with per-file statistics). This hierarchy enables extremely fast metadata operations even on tables with billions of files.
On AWS, you use Iceberg with the Glue Data Catalog as the metadata store. The Glue Catalog stores the pointer to Iceberg's latest metadata file. Athena, Glue Spark, and EMR Spark all look up the table from Glue — so one table definition works across all three tools.
from pyspark.sql import SparkSession
# ── SparkSession config for Iceberg + Glue Catalog ────────────────
spark = SparkSession.builder \
.appName("IcebergDemo") \
.config("spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config("spark.sql.catalog.glue_catalog",
"org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl",
"org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.catalog.glue_catalog.warehouse",
"s3://my-data-lake/") \
.config("spark.sql.catalog.glue_catalog.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()
# ── Create an Iceberg table via SQL ───────────────────────────────
spark.sql("""
CREATE TABLE IF NOT EXISTS glue_catalog.silver.orders (
order_id STRING NOT NULL,
customer_id STRING,
amount DOUBLE,
status STRING,
created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at)) -- hidden partitioning!
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'snappy',
'format-version' = '2' -- enables row-level deletes
)
""")
print("✅ Iceberg table created in Glue Catalog")
# ── Write a DataFrame to the Iceberg table ───────────────────────
data = [
("O-001", "C-101", 99.99, "PLACED"),
("O-002", "C-102", 149.50, "SHIPPED"),
]
df = spark.createDataFrame(data, ["order_id", "customer_id", "amount", "status"])
df.writeTo("glue_catalog.silver.orders").append()
# ── Query the table ───────────────────────────────────────────────
spark.table("glue_catalog.silver.orders").show()
In Hive-style partitioning (and Delta), you must include the partition column value explicitly: partitionBy("year", "month", "day") and then manually add year/month/day columns. Iceberg's hidden partitioning lets you partition on transforms of a column — days(created_at), months(created_at), bucket(16, customer_id) — without adding extra columns to your data. Iceberg computes the partition value automatically from the existing column.
-- Hidden partitioning on a timestamp column
CREATE TABLE glue_catalog.silver.events (
event_id STRING,
user_id STRING,
event_type STRING,
occurred_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(occurred_at)); -- automatically partitions by date
-- NO year/month/day columns in the data! Iceberg handles it internally.
-- Query: Iceberg rewrites WHERE occurred_at > '2024-01-01' as a partition filter
SELECT * FROM glue_catalog.silver.events
WHERE occurred_at >= TIMESTAMP '2024-01-15 00:00:00';
-- Iceberg automatically prunes partitions before 2024-01-15. No manual filter!
-- Other transform options:
-- PARTITIONED BY (hours(occurred_at)) -- hourly partitions
-- PARTITIONED BY (months(occurred_at)) -- monthly partitions
-- PARTITIONED BY (years(occurred_at)) -- yearly partitions
-- PARTITIONED BY (bucket(16, customer_id)) -- hash bucket by customer_id
-- PARTITIONED BY (truncate(10, product_code)) -- string prefix partitioning
-- PARTITIONED BY (days(created_at), bucket(8, customer_id)) -- multi-level
Iceberg versions data through snapshots. Each write creates a new snapshot with a unique snapshot ID and timestamp. You can query any snapshot by ID or timestamp — just like Delta time travel.
# ── View all snapshots ────────────────────────────────────────────
spark.sql("""
SELECT snapshot_id, committed_at, operation, summary
FROM glue_catalog.silver.orders.snapshots
""").show()
# ── Time travel by snapshot ID ────────────────────────────────────
spark.sql("""
SELECT * FROM glue_catalog.silver.orders
VERSION AS OF 12345678901234567
""").show()
# ── Time travel by timestamp ──────────────────────────────────────
spark.sql("""
SELECT * FROM glue_catalog.silver.orders
TIMESTAMP AS OF '2024-01-14 23:59:59'
""").show()
# ── Iceberg DataFrame API time travel ────────────────────────────
df_old = spark.read \
.option("snapshot-id", "12345678901234567") \
.format("iceberg") \
.load("glue_catalog.silver.orders")
# ── Roll back to a previous snapshot ─────────────────────────────
spark.sql("""
CALL glue_catalog.system.rollback_to_snapshot(
'silver.orders', 12345678901234567
)
""")
# ── Expire old snapshots (cleanup) ───────────────────────────────
spark.sql("""
CALL glue_catalog.system.expire_snapshots(
table => 'silver.orders',
older_than => TIMESTAMP '2024-01-01 00:00:00',
retain_last => 5
)
""")
Iceberg supports MERGE INTO, UPDATE, and DELETE via standard SQL. This is especially powerful with Athena — you can run UPDATE and DELETE directly from Athena without Spark.
-- MERGE INTO (upsert)
MERGE INTO glue_catalog.silver.orders AS t
USING glue_catalog.staging.orders_updates AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN
UPDATE SET t.status = s.status, t.amount = s.amount
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, status, created_at)
VALUES (s.order_id, s.customer_id, s.amount, s.status, s.created_at);
-- UPDATE specific rows
UPDATE glue_catalog.silver.orders
SET status = 'DELIVERED'
WHERE order_id = 'O-001';
-- DELETE rows (GDPR erasure)
DELETE FROM glue_catalog.silver.orders
WHERE customer_id = 'C-101';
-- These all work in Athena natively (no Spark required for Iceberg!)
-- This is Iceberg's biggest advantage over Delta on AWS.
Iceberg's compaction is done via a stored procedure: rewrite_data_files. It consolidates small files into larger ones and can also re-sort the data for better data skipping. Run it on a schedule after heavy streaming or incremental loads.
# ── Compact small files → 128 MB target size ─────────────────────
spark.sql("""
CALL glue_catalog.system.rewrite_data_files(
table => 'silver.orders',
options => map(
'target-file-size-bytes', '134217728',
'min-input-files', '5',
'max-concurrent-file-group-rewrites', '10'
)
)
""")
# ── Sort while compacting (like ZORDER in Delta) ──────────────────
spark.sql("""
CALL glue_catalog.system.rewrite_data_files(
table => 'silver.orders',
strategy => 'sort',
sort_order => 'order_id ASC NULLS LAST, customer_id ASC',
options => map('target-file-size-bytes', '134217728')
)
""")
# ── Remove orphan files not referenced by any snapshot ───────────
spark.sql("""
CALL glue_catalog.system.delete_orphan_files(
table => 'silver.orders',
older_than => TIMESTAMP '2024-01-01 00:00:00'
)
""")
# ── Rewrite position-delete files as merge-on-read optimization ──
spark.sql("""
CALL glue_catalog.system.rewrite_position_delete_files(
table => 'silver.orders'
)
""")
Iceberg supports two strategies for row-level updates and deletes. Copy-on-Write (CoW) rewrites the affected data files immediately on every update/delete — reads are fast (no merging needed), but writes are expensive. Merge-on-Read (MoR) writes a small delete/update file rather than rewriting the whole data file — writes are fast, but reads need to merge the data file with the delete file. CoW is better for read-heavy tables; MoR is better for CDC-heavy tables with frequent small updates.
| Strategy | Write Cost | Read Cost | Best For |
|---|---|---|---|
| Copy-on-Write (default) | Higher — rewrites files | Lower — no merge needed | Gold layer, dashboard tables, read-heavy |
| Merge-on-Read | Lower — small delete files | Higher — merge at read time | CDC landing, Silver layer with frequent updates |
-- Default: Copy-on-Write for all operations
CREATE TABLE glue_catalog.gold.orders_summary (...)
USING iceberg
TBLPROPERTIES (
'write.delete.mode' = 'copy-on-write',
'write.update.mode' = 'copy-on-write',
'write.merge.mode' = 'copy-on-write'
);
-- Merge-on-Read for CDC landing tables (fast writes)
CREATE TABLE glue_catalog.silver.orders_cdc_landing (...)
USING iceberg
TBLPROPERTIES (
'write.delete.mode' = 'merge-on-read',
'write.update.mode' = 'merge-on-read',
'format-version' = '2' -- required for MoR
);
| Feature | Delta Lake | Apache Iceberg |
|---|---|---|
| ACID Transactions | ✅ Transaction log | ✅ Snapshot-based |
| Time Travel | ✅ version/timestamp | ✅ snapshot-based |
| MERGE INTO | ✅ Full MERGE API | ✅ MERGE/UPDATE/DELETE |
| Athena Write (MERGE) | ❌ Read-only via manifest | ✅ Native MERGE in Athena |
| Partition Evolution | Limited | ✅ Full partition spec evolution |
| Hidden Partitioning | ❌ Manual year/month/day columns | ✅ days()/months()/bucket() |
| Small File Compaction | ✅ OPTIMIZE | ✅ rewrite_data_files |
| Data Skipping | ✅ Column stats + ZORDER | ✅ Column stats per manifest |
| Merge-on-Read | ❌ Not supported | ✅ format-version=2 |
| Glue 4.0 Native | ✅ --datalake-formats delta | ✅ --datalake-formats iceberg |
- Raw Parquet on S3 lacks ACID, UPDATE/DELETE, schema evolution safety, and time travel — open table formats solve all four.
- Delta Lake uses a
_delta_log/transaction log. ACID via atomic commits. Time travel by version or timestamp. MERGE INTO for upserts and CDC. OPTIMIZE + ZORDER for query performance. VACUUM for cost control. - Apache Iceberg uses a three-tier metadata structure (metadata → manifest list → manifest files). Hidden partitioning eliminates manual year/month/day columns. Native Athena MERGE/UPDATE/DELETE. Copy-on-Write (read-optimized) or Merge-on-Read (write-optimized) strategies.
- On AWS: use Iceberg when Athena is a primary interface (native DML support). Use Delta when Databricks or Glue ETL is the primary compute and Athena is secondary.
- OPTIMIZE/ZORDER (Delta) and rewrite_data_files (Iceberg) must run on a schedule — streaming and CDC create many small files that degrade query performance without compaction.
- VACUUM (Delta) and expire_snapshots (Iceberg) reclaim S3 storage from old versions — run at minimum weekly with 7-day retention to balance time travel and cost.
Data Modeling for Data Engineers
Dimensional modeling, fact and dimension table design, Slowly Changing Dimensions (SCD), Data Vault, and dbt — the techniques used to turn raw lake data into a well-structured, query-friendly Gold layer.
Dimensional modeling is the most widely used technique in data warehousing. It organizes data into two types of tables: fact tables (what happened — metrics, events, measurements) and dimension tables (context — who, what, where, when). It was popularized by Ralph Kimball and is the backbone of most Gold-layer designs in a lakehouse.
Fact tables hold measurable numeric data (revenue, quantity, duration) and foreign keys to dimension tables. The grain of a fact table is the most important design decision — it defines exactly what one row represents.
| Type | What One Row Represents | Example |
|---|---|---|
| Transaction Fact | One discrete event | One order line item on a specific date |
| Periodic Snapshot | State at a fixed interval | Account balance at end of each month |
| Accumulating Snapshot | Progress of a process through stages | Order lifecycle: placed → shipped → delivered |
| Factless Fact | An event with no numeric measure | Student enrolled in a course (just the relationship) |
CREATE TABLE fact_order_lines (
order_line_sk BIGINT, -- surrogate key
order_date_sk INT, -- FK → dim_date
customer_sk INT, -- FK → dim_customer
product_sk INT, -- FK → dim_product
store_sk INT, -- FK → dim_store
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
net_revenue DECIMAL(10,2),
load_dts TIMESTAMP -- audit column
);
from pyspark.sql import functions as F
# Read silver layer (cleaned orders)
orders = spark.table("silver.orders")
order_items = spark.table("silver.order_items")
# Join and compute measures
fact_df = order_items.join(orders, "order_id") \
.withColumn("net_revenue", F.col("unit_price") * F.col("quantity") - F.col("discount_amount")) \
.withColumn("order_date_sk", F.date_format("order_date", "yyyyMMdd").cast("int")) \
.withColumn("load_dts", F.current_timestamp()) \
.select(
"order_line_id", "order_date_sk", "customer_sk",
"product_sk", "store_sk", "quantity",
"unit_price", "discount_amount", "net_revenue", "load_dts"
)
# Write to Gold Delta table
fact_df.write.format("delta").mode("append") \
.saveAsTable("gold.fact_order_lines")
Dimension tables provide the descriptive context for facts. A conformed dimension is shared across multiple fact tables (e.g., dim_date used in sales, HR, and finance facts). A degenerate dimension is a dimension with no attributes beyond its key — it lives on the fact table itself (e.g., order number). A junk dimension bundles several low-cardinality flags and indicators into one table to avoid cluttering the fact table (e.g., is_rush_order, is_gift_wrapped, is_loyalty_member combined).
CREATE TABLE dim_date (
date_sk INT PRIMARY KEY, -- 20240115
full_date DATE,
year INT,
quarter INT,
month INT,
month_name STRING,
week_of_year INT,
day_of_week INT,
day_name STRING,
is_weekend BOOLEAN,
is_holiday BOOLEAN,
fiscal_year INT,
fiscal_quarter INT
);
from pyspark.sql import functions as F
from pyspark.sql.types import *
# Generate a date sequence from 2020-01-01 to 2030-12-31
date_range = spark.sql("SELECT sequence(to_date('2020-01-01'), to_date('2030-12-31'), interval 1 day) as date_array") \
.select(F.explode("date_array").alias("full_date"))
dim_date = date_range.select(
F.date_format("full_date", "yyyyMMdd").cast("int").alias("date_sk"),
F.col("full_date"),
F.year("full_date").alias("year"),
F.quarter("full_date").alias("quarter"),
F.month("full_date").alias("month"),
F.date_format("full_date", "MMMM").alias("month_name"),
F.weekofyear("full_date").alias("week_of_year"),
F.dayofweek("full_date").alias("day_of_week"),
F.date_format("full_date", "EEEE").alias("day_name"),
(F.dayofweek("full_date").isin([1, 7])).alias("is_weekend")
)
dim_date.write.format("delta").mode("overwrite") \
.saveAsTable("gold.dim_date")
is_rush_order, is_gift_wrapped, is_loyalty_member directly to the fact table, you create dim_order_flags with all combinations (2³ = 8 rows max), then store only the order_flag_sk on the fact. Keeps fact tables lean.
A star schema has the fact table at the center connected directly to denormalized dimension tables — one join to get all context. A snowflake schema normalizes dimensions further into sub-dimensions (e.g., dim_product → dim_category → dim_department). Star is preferred in analytics because fewer joins = faster queries and simpler SQL; snowflake is used when storage is a concern or dimensions are extremely wide.
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension normalization | Denormalized (wide tables) | Normalized (sub-dimensions) |
| Query complexity | Simple — fewer joins | More joins needed |
| Storage | Slightly more (duplicated values) | Less storage |
| Preferred for | OLAP, BI tools, Redshift/Snowflake queries | Normalized source systems |
The grain is the precise definition of what one row in a fact table means. You must declare it before writing a single line of SQL. Get it wrong and your queries will double-count or miss records silently. Common grains: one row per order line, one row per daily account snapshot, one row per user session event.
-- GRAIN: One row per order line item per day
-- Primary Key: (order_id, product_id, order_date)
-- Each row = one product sold in one order on one date
-- Do NOT aggregate across product_sk without GROUP BY
CREATE TABLE gold.fact_order_lines (
order_id STRING,
product_sk INT,
order_date_sk INT,
quantity INT,
net_revenue DECIMAL(10,2)
);
Dimension attributes change over time (a customer moves cities, a product changes category). SCD strategies define how to handle those changes in the dimension table.
| Type | Strategy | History Kept? | Use When |
|---|---|---|---|
| SCD Type 1 | Overwrite the old value | No | Corrections, typos, no historical analysis needed |
| SCD Type 2 | Add a new row, keep old row | Yes (full) | Track history — "what was the customer's city when they ordered?" |
| SCD Type 3 | Add a "previous value" column | Partial (1 prior) | Only need to know one previous value (e.g., last city) |
CREATE TABLE dim_customer (
customer_sk BIGINT, -- surrogate key (auto-increment or hash)
customer_id STRING, -- natural / business key
customer_name STRING,
city STRING,
state STRING,
effective_start_date DATE, -- when this version became active
effective_end_date DATE, -- NULL if still current; 9999-12-31 convention
is_current BOOLEAN, -- TRUE for the active row
load_dts TIMESTAMP
);
from delta.tables import DeltaTable
from pyspark.sql import functions as F
# New/changed customers arriving from source
updates = spark.table("silver.customer_updates") \
.withColumn("effective_start_date", F.current_date()) \
.withColumn("effective_end_date", F.lit(None).cast("date")) \
.withColumn("is_current", F.lit(True))
dim_table = DeltaTable.forName(spark, "gold.dim_customer")
# Step 1 — expire old rows that have changed
dim_table.alias("dim").merge(
updates.alias("src"),
"dim.customer_id = src.customer_id AND dim.is_current = true AND \
(dim.city != src.city OR dim.state != src.state)"
).whenMatchedUpdate(set={
"is_current": "false",
"effective_end_date": "current_date()"
}).execute()
# Step 2 — insert new versions
updates.write.format("delta").mode("append") \
.saveAsTable("gold.dim_customer")
from delta.tables import DeltaTable
updates = spark.table("silver.customer_updates")
dim_table = DeltaTable.forName(spark, "gold.dim_customer_scd1")
# Just overwrite the attributes — no history
dim_table.alias("dim").merge(
updates.alias("src"),
"dim.customer_id = src.customer_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
CREATE TABLE dim_customer_scd3 (
customer_sk BIGINT,
customer_id STRING,
current_city STRING,
previous_city STRING, -- only one prior value stored
city_changed_date DATE
);
Data Vault is a modeling methodology designed for enterprise raw vaults where auditability, scalability, and parallel loading matter more than query simplicity. It separates data into three entity types.
| Entity | Purpose | Example |
|---|---|---|
| Hub | Stores only the unique business keys | hub_customer — customer_id only |
| Link | Stores relationships between hubs | link_order_customer — order_id + customer_id |
| Satellite | Stores descriptive attributes + history | sat_customer_details — name, city, updated_at |
dbt (data build tool) is the industry-standard tool for writing, testing, and documenting SQL-based transformations in your warehouse or lakehouse. It replaces ad-hoc SQL scripts with version-controlled, tested, documented models. A model is just a .sql file that defines a SELECT — dbt materializes it as a table or view.
-- models/staging/stg_orders.sql
-- Materialization: view (lightweight, always fresh)
{{ config(materialized='view') }}
SELECT
order_id,
customer_id,
order_date::date AS order_date,
total_amount::decimal(10,2) AS total_amount,
status,
CURRENT_TIMESTAMP AS _loaded_at
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL
-- models/marts/fct_orders.sql
{{ config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
) }}
SELECT
o.order_id,
o.customer_id,
d.date_sk AS order_date_sk,
o.total_amount,
o.status,
CURRENT_TIMESTAMP AS load_dts
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_date') }} d ON o.order_date = d.full_date
{% if is_incremental() %}
-- Only process rows newer than the max already in the table
WHERE o.order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
version: 2
models:
- name: fct_orders
columns:
- name: order_id
tests:
- not_null
- unique
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customer')
field: customer_id
- name: total_amount
tests:
- not_null
- accepted_values:
values: [0.01, 999999.99] # range check via custom test
# In Airflow DAG — BashOperator to run dbt
run_dbt = BashOperator(
task_id="run_dbt_gold_models",
bash_command="""
cd /opt/dbt/my_project &&
dbt run --select tag:gold --target prod --profiles-dir /opt/dbt/profiles
""",
env={"DBT_PROFILES_DIR": "/opt/dbt/profiles"}
)
# Run tests after models complete
test_dbt = BashOperator(
task_id="test_dbt_gold_models",
bash_command="""
cd /opt/dbt/my_project &&
dbt test --select tag:gold --target prod
"""
)
run_dbt >> test_dbt
| Materialization | What it Creates | When to Use |
|---|---|---|
| view | SQL view — no data stored | Staging / lightweight transforms |
| table | Full table, rebuilt every run | Small-medium marts |
| incremental | Table, only new rows merged/appended | Large fact tables (avoid full rebuild) |
| ephemeral | CTE — no physical object created | Intermediate logic reused within a run |
Spark Performance Engineering
Partitioning strategy, join optimization, Adaptive Query Execution (AQE), caching, and small-file compaction — the core levers for tuning Spark pipelines at scale.
Spark processes data in partitions — chunks that run in parallel across executor cores. The number and size of partitions directly determines whether your job is fast, slow, or crashes with OOM errors. Too few partitions = underutilized cluster. Too many = scheduling overhead and tiny tasks.
spark.sql.shuffle.partitions controls how many partitions are created after a shuffle (groupBy, join, orderBy). The default is 200 — correct for large clusters, terrible for small datasets (200 tiny partitions waste scheduling overhead). Rule of thumb: target 128 MB – 200 MB per partition after the shuffle.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("perf-tuning").getOrCreate()
# Default is 200 — too high for small/medium data
print(spark.conf.get("spark.sql.shuffle.partitions")) # 200
# For a 10 GB dataset with 200 MB target size → 50 partitions
spark.conf.set("spark.sql.shuffle.partitions", "50")
# For very large datasets (1 TB+) you might go higher
spark.conf.set("spark.sql.shuffle.partitions", "2000")
# With AQE enabled, Spark adjusts this automatically at runtime
spark.conf.set("spark.sql.adaptive.enabled", "true")
# AQE coalesces small post-shuffle partitions dynamically
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
repartition(n) performs a full shuffle — data moves across the network to create exactly n evenly distributed partitions. Use it to increase partitions or when you need balanced distribution before a join. coalesce(n) is a narrow transformation — it merges partitions locally without a shuffle. Use it to decrease partitions cheaply (e.g., before writing output to avoid thousands of small files).
df = spark.table("silver.events")
print(f"Original partitions: {df.rdd.getNumPartitions()}") # e.g. 800
# repartition — full shuffle, balanced, use before heavy joins
df_balanced = df.repartition(200)
# Can also repartition by column — co-locates same keys together
df_by_date = df.repartition(200, "event_date")
# coalesce — no shuffle, just merges existing partitions
# Use before writing output to reduce small files
df_small = df.coalesce(10)
# Write with coalesce to get exactly 10 output files
df_small.write.mode("overwrite").parquet("s3://bucket/output/")
# WRONG: using repartition before write just to reduce files
# costs a full shuffle for no benefit — use coalesce instead
# df.repartition(10).write... ← wasteful if only reducing count
| Aspect | repartition(n) | coalesce(n) |
|---|---|---|
| Shuffle | Yes — full network shuffle | No — local merge only |
| Partition balance | Perfectly even | Uneven (merges as-is) |
| Increasing partitions | Yes | No |
| Decreasing partitions | Yes (expensive) | Yes (cheap) |
| Best use | Before join, balancing skewed data | Before write, reducing output files |
When writing to Delta/Parquet, partitionBy() physically organizes files on disk by column values. This enables partition pruning — downstream jobs and queries skip entire folders when filtering on those columns.
# Writing partitioned by date columns — Athena/Spark can skip entire date folders
df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("year", "month", "day") \
.save("s3://my-lake/gold/events/")
# Reading — Spark pushes the filter down and skips irrelevant partitions
spark.read.format("delta") \
.load("s3://my-lake/gold/events/") \
.filter("year = 2024 AND month = 1") \ # reads ONLY year=2024/month=1/ folders
.count()
user_id or order_id — this creates millions of tiny files (the small files problem). Good partition columns have low cardinality: date parts, region, status, category.
Joins are the most expensive operation in Spark. Choosing the wrong join strategy on large tables causes massive shuffles, OOM errors, and hour-long jobs. Understanding when Spark picks each join type — and how to force the right one — is a core senior DE skill.
A broadcast join sends a copy of the small table to every executor, so the large table never moves. It completely eliminates the shuffle. Spark automatically broadcasts tables under spark.sql.autoBroadcastJoinThreshold (default 10 MB). You can force it with a hint.
from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast
large_df = spark.table("silver.transactions") # 500 GB
small_df = spark.table("gold.dim_store") # 2 MB
# Auto broadcast (if small_df < autoBroadcastJoinThreshold)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024)) # 50 MB
# Force broadcast with hint — even if Spark wouldn't auto-broadcast
result = large_df.join(broadcast(small_df), "store_id")
# SQL hint equivalent
spark.sql("""
SELECT /*+ BROADCAST(s) */ t.*, s.store_name
FROM silver.transactions t
JOIN gold.dim_store s ON t.store_id = s.store_id
""")
When both tables are large, Spark uses a sort-merge join: shuffle both tables by the join key, sort each partition, then merge matching rows. This is the safe default but requires two shuffles (one per table) and sorting — expensive but correct for any data size.
# Force sort-merge join (useful when you've pre-bucketed both tables)
result = spark.sql("""
SELECT /*+ MERGE(t, o) */ t.customer_id, o.order_total
FROM silver.transactions t
JOIN silver.orders o ON t.order_id = o.order_id
""")
Skew happens when one join key value has disproportionately more rows than others (e.g., 80% of orders belong to one customer_id = 'UNKNOWN'). This causes one executor task to process millions of rows while others sit idle — the slowest task determines the total job time.
Salting fixes skew by adding a random suffix to the skewed key, splitting one huge task into many small balanced ones.
from pyspark.sql import functions as F
SALT_FACTOR = 10 # split into 10 buckets
# Large skewed table — add random salt to the join key
transactions = spark.table("silver.transactions") \
.withColumn("salt", (F.rand() * SALT_FACTOR).cast("int")) \
.withColumn("salted_key", F.concat(F.col("customer_id"), F.lit("_"), F.col("salt")))
# Small dimension — replicate rows for each salt value
dim_customer = spark.table("gold.dim_customer")
salt_range = spark.range(SALT_FACTOR).withColumnRenamed("id", "salt")
dim_exploded = dim_customer.crossJoin(salt_range) \
.withColumn("salted_key", F.concat(F.col("customer_id"), F.lit("_"), F.col("salt")))
# Join on salted key — no single executor gets all the UNKNOWN rows
result = transactions.join(dim_exploded, "salted_key") \
.drop("salt", "salted_key")
Spark 3 introduced a skew join hint and AQE-based automatic skew handling — no manual salting required if AQE is on. AQE detects which partitions are skewed at runtime and splits them automatically.
# Enable AQE — handles skew automatically
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# A partition is skewed if it's > this many times the median partition size
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
# And larger than this absolute threshold
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", str(256 * 1024 * 1024))
# AQE will automatically split the skewed partition during the join
result = transactions.join(dim_customer, "customer_id")
# Or force via hint if AQE is not enabled
result = spark.sql("""
SELECT /*+ SKEW('t', 'customer_id', ('UNKNOWN', 'GUEST')) */ *
FROM silver.transactions t
JOIN gold.dim_customer c ON t.customer_id = c.customer_id
""")
Predicate pushdown means Spark (via Catalyst optimizer) moves filter conditions as close to the data source as possible — ideally into the file reader or even the storage layer — so less data is ever read into memory. It works automatically with Parquet, ORC, Delta, and Iceberg.
Parquet and ORC store min/max statistics per row group (Parquet) or stripe (ORC). Spark reads these statistics first and skips entire row groups where the filter can't match — without reading the actual data.
df = spark.read.parquet("s3://my-lake/silver/events/")
# This filter gets pushed into the Parquet reader
result = df.filter("event_date = '2024-01-15' AND event_type = 'purchase'")
# Check the explain plan — look for "PushedFilters" in the scan
result.explain(True)
# You'll see: PushedFilters: [IsNotNull(event_date), EqualTo(event_date,2024-01-15)]
# This means Spark is filtering at the file level, not after reading all data
Delta and Iceberg add an additional layer: file-level statistics stored in the transaction log (Delta) or manifest files (Iceberg). Spark checks these stats and skips entire files before even opening them.
# Delta stores min/max for every column in every file in _delta_log
# This query skips files where max(amount) < 1000
result = spark.read.format("delta") \
.load("s3://my-lake/gold/transactions/") \
.filter("amount > 1000 AND event_date = '2024-01-15'")
# ZORDER clusters data by columns → makes file skipping more effective
# Run this as a maintenance job
spark.sql("""
OPTIMIZE gold.transactions
ZORDER BY (event_date, customer_id)
""")
Partition pruning is the coarsest and most impactful pushdown — Spark skips entire directories (not just row groups) when the filter matches a partition column. The key requirement: the column you filter on must be a partition column (used in partitionBy() at write time).
# Table written with partitionBy("year", "month")
# File layout: .../year=2024/month=01/... year=2024/month=02/... year=2023/...
df = spark.read.format("delta").load("s3://my-lake/gold/events/")
# Spark reads ONLY .../year=2024/month=01/ — skips all other folders
result = df.filter("year = 2024 AND month = 1")
# Check explain — look for "PartitionFilters" in the scan node
result.explain()
# Output shows: PartitionFilters: [isnotnull(year#10), (year#10 = 2024), ...]
# Dynamic partition pruning (DPP) — filter from a join key narrows partitions
large = spark.table("gold.fact_sales") # partitioned by date_sk
small = spark.table("gold.dim_date").filter("fiscal_year = 2024")
# Spark computes dim_date result first, then uses those date_sk values
# to prune partitions in fact_sales — even though fact_sales isn't filtered directly
result = large.join(small, "date_sk")
cache() and persist() store a DataFrame in memory (and optionally disk) so downstream actions don't re-read from source or recompute expensive transformations. Use caching when you reuse a DataFrame more than once in the same job — without it, Spark recomputes the entire lineage from source every time.
from pyspark import StorageLevel
# cache() = shorthand for persist(StorageLevel.MEMORY_AND_DISK)
df_cached = df.cache()
# persist() lets you control exactly where data is stored
# MEMORY_ONLY — fastest, but evicted if not enough RAM (recomputed on miss)
df.persist(StorageLevel.MEMORY_ONLY)
# MEMORY_AND_DISK — spills to disk if memory is full (safest default)
df.persist(StorageLevel.MEMORY_AND_DISK)
# DISK_ONLY — always on disk, slow but survives memory pressure
df.persist(StorageLevel.DISK_ONLY)
# MEMORY_ONLY_SER — serialized storage, less RAM but slower to deserialize
df.persist(StorageLevel.MEMORY_ONLY_SER)
# Always trigger an action to materialize the cache
df_cached.count() # materializes the cache
# Unpersist when done — free up memory for other jobs
df_cached.unpersist()
| Situation | Cache? | Reason |
|---|---|---|
| DataFrame used 2+ times in same job | Yes | Avoids re-reading source and recomputing |
| Iterative ML training loops | Yes | Same dataset used in every iteration |
| foreachBatch with multiple writes | Yes | batchDF used for Delta + Kafka simultaneously |
| Single-use DataFrame (one action) | No | Cache overhead without benefit |
| Very large dataset (fills memory + disk) | No | Spill cost may exceed recompute cost |
| Quickly computed transformation | No | Recomputing is faster than cache read |
The Spark UI (port 4040 while a job runs, History Server after) is your primary diagnostic tool. Every performance problem leaves a fingerprint in the UI — you need to know where to look.
The Jobs → Stages → DAG view shows how Spark decomposed your query into stages separated by shuffle boundaries. Wide stages (wide arrows in the DAG) indicate shuffles — each one is a potential performance bottleneck. Identify the heaviest stages and focus optimization there.
Click into a Stage to see per-task metrics. Key columns to check:
| Metric | What It Means | Problem Signal |
|---|---|---|
| Shuffle Read Size | Data read from shuffle files | Very large → shuffle bottleneck |
| Shuffle Write Size | Data written to shuffle files | Large → consider tuning partitions |
| Spill (Memory) | Data that couldn't fit in execution memory | Any spill → increase executor memory or reduce partition size |
| Spill (Disk) | Data written to disk due to spill | High → serious memory pressure |
| Task Duration (max vs median) | How long tasks take | Max >> median → data skew |
| GC Time | Garbage collection time per task | >10% of task time → GC pressure, increase memory |
If GC time is high, executors are spending more time cleaning up objects than doing actual work. Fix: increase spark.executor.memory, switch to Kryo serialization, or reduce data held in memory per task.
spark = SparkSession.builder \
.appName("perf-tuned-job") \
.config("spark.executor.memory", "8g") \
.config("spark.executor.cores", "4") \
.config("spark.executor.memoryOverhead", "2g") \
.config("spark.memory.fraction", "0.8") \ # 80% of heap for execution+storage
.config("spark.memory.storageFraction", "0.3") \ # 30% of above for caching
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrationRequired", "false") \
.getOrCreate()
Spill means Spark ran out of execution memory and had to write intermediate data (shuffle buffers, sort buffers, aggregation hash maps) to disk. This slows jobs by 10–100x. The fix depends on the cause:
# Cause 1: Too few partitions → each partition too large → spill
# Fix: increase partition count
spark.conf.set("spark.sql.shuffle.partitions", "500") # was 200
# Cause 2: Executor memory too small
# Fix: increase executor memory (if cluster resources allow)
# spark.executor.memory = "16g"
# Cause 3: Data skew — one giant partition spills while others are fine
# Fix: AQE skew handling or manual salting (see join optimization above)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Cause 4: Expensive aggregation with high cardinality
# Fix: use approx functions where exactness not required
from pyspark.sql import functions as F
df.agg(F.approx_count_distinct("user_id", rsd=0.05)) # much less memory than countDistinct
AQE (introduced in Spark 3.0) re-optimizes the query plan at runtime using actual statistics collected during execution — unlike the Catalyst optimizer which plans ahead of time with estimates. It has three major features that often make jobs significantly faster with zero code changes.
# Enable AQE — should be on by default in Spark 3.2+
spark.conf.set("spark.sql.adaptive.enabled", "true")
# Feature 1: Auto-optimize shuffle partitions
# Coalesces many small post-shuffle partitions into fewer larger ones
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", str(128 * 1024 * 1024)) # 128 MB target
# Feature 2: Convert sort-merge join → broadcast join at runtime
# If Spark underestimated a table's size but discovers it's small after the shuffle
spark.conf.set("spark.sql.adaptive.localShuffleReader.enabled", "true")
# Feature 3: Skew join handling (covered above)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
# Verify AQE is working — look for "AdaptiveSparkPlan" in explain output
df_result = large_df.join(medium_df, "key").groupBy("category").count()
df_result.explain()
# Output: == Physical Plan ==
# AdaptiveSparkPlan isFinalPlan=false
# +- HashAggregate ... ← AQE manages this
Serialization converts objects to bytes for network transfer (shuffle) and disk storage (spill, cache). Java serialization is the default — simple but slow and produces large byte arrays. Kryo serialization is 2–10x faster and produces smaller byte arrays — use it for any job with heavy shuffle or caching.
spark = SparkSession.builder \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryo.registrationRequired", "false") \
.getOrCreate()
# Optionally register custom classes for maximum Kryo efficiency
# spark.conf.set("spark.kryo.classesToRegister", "com.mycompany.MyClass")
The small files problem occurs when a pipeline writes many tiny files over time (e.g., streaming writes every minute, or partitioned writes with too many partitions). Reading thousands of small files is slow — each file has metadata overhead (S3 API call, Parquet footer read). The fix is periodic compaction — merging small files into larger ones (target: 128 MB – 1 GB per file).
Common causes: streaming micro-batch writes (one batch = one file per partition), over-partitioned data (too many partition columns = too many folders), incremental appends without compaction.
# Read a partition that has accumulated many small files
partition_path = "s3://my-lake/silver/events/year=2024/month=01/day=15/"
df = spark.read.format("parquet").load(partition_path)
print(f"Files before compaction: {df.rdd.getNumPartitions()}") # e.g. 800 small files
# Coalesce to target file count (128MB per file → 10 GB / 128 MB = ~80 files)
df_compacted = df.coalesce(80)
# Overwrite the same partition
df_compacted.write.mode("overwrite").parquet(partition_path)
print("Compaction complete")
# Delta OPTIMIZE compacts small files into 1 GB files automatically
spark.sql("OPTIMIZE silver.events")
# Optimize only a specific partition — much faster for large tables
spark.sql("OPTIMIZE silver.events WHERE year = 2024 AND month = 1")
# ZORDER co-locates related rows in the same files for better file skipping
spark.sql("OPTIMIZE gold.transactions ZORDER BY (customer_id, event_date)")
# Run as a scheduled Airflow task (e.g., daily at midnight)
# After streaming writes accumulate small files, OPTIMIZE consolidates them
from pyiceberg.catalog import load_catalog
# Via Spark SQL (SparkCatalog configured)
spark.sql("""
CALL my_catalog.system.rewrite_data_files(
table => 'silver.events',
options => map(
'target-file-size-bytes', '134217728', -- 128 MB
'min-file-size-bytes', '33554432', -- only compact files < 32 MB
'max-concurrent-file-group-rewrites', '10'
)
)
""")
Pipeline Observability & Reliability
Metrics, logging, lineage tracking, SLA monitoring, and alerting — the practices that keep production data pipelines reliable and make failures visible before they become incidents.
Observability is the ability to understand what your pipeline is doing, why it failed, and where data came from — from the outside, by examining its outputs. In data engineering, observability rests on three pillars. Without all three, you are flying blind in production.
| Pillar | Answers | Example in Data Engineering |
|---|---|---|
| Metrics | What happened? How much? How long? | rows_processed=1.2M, duration_seconds=142, dq_failures=3 |
| Logs | Why did it happen? What was the error? | ERROR: NullPointerException at silver.orders line 47, batch_id=20240115 |
| Lineage | Where did this data come from? What consumed it? | gold.daily_revenue ← silver.orders ← bronze.raw_orders ← PostgreSQL.orders |
AWS built-in metrics cover infrastructure (CPU, memory, disk). But pipeline-level metrics — rows processed, DQ failures, pipeline duration — must be published manually from your Glue/EMR/Lambda code using put_metric_data(). These custom metrics are then used to build dashboards and alarms.
import boto3
import time
from datetime import datetime, timezone
cw = boto3.client("cloudwatch", region_name="us-east-1")
def publish_pipeline_metrics(pipeline_name, rows_read, rows_written,
rows_rejected, duration_seconds, dq_score):
"""Publish all key pipeline metrics in a single batch call."""
now = datetime.now(timezone.utc)
dimensions = [{"Name": "PipelineName", "Value": pipeline_name}]
cw.put_metric_data(
Namespace="DataEngineering/Pipelines",
MetricData=[
{
"MetricName": "RowsProcessed",
"Dimensions": dimensions,
"Value": rows_written,
"Unit": "Count",
"Timestamp": now
},
{
"MetricName": "RowsRejected",
"Dimensions": dimensions,
"Value": rows_rejected,
"Unit": "Count",
"Timestamp": now
},
{
"MetricName": "PipelineDurationSeconds",
"Dimensions": dimensions,
"Value": duration_seconds,
"Unit": "Seconds",
"Timestamp": now
},
{
"MetricName": "DataQualityScore",
"Dimensions": dimensions,
"Value": dq_score,
"Unit": "Percent",
"Timestamp": now
},
{
"MetricName": "PipelineSuccess",
"Dimensions": dimensions,
"Value": 1, # 1 = success, publish 0 on failure
"Unit": "Count",
"Timestamp": now
}
]
)
print(f"[METRICS] Published metrics for {pipeline_name}")
# Usage at end of Glue ETL job
start_time = time.time()
# ... your ETL logic here ...
rows_read = 1_200_000
rows_written = 1_198_500
rows_rejected = 1_500
dq_score = 99.87
duration = time.time() - start_time
publish_pipeline_metrics(
pipeline_name = "silver-orders-etl",
rows_read = rows_read,
rows_written = rows_written,
rows_rejected = rows_rejected,
duration_seconds= duration,
dq_score = dq_score
)
DataEngineering/Pipelines or Company/DataPlatform. This groups your custom metrics separately from AWS built-in metrics and makes dashboard creation much cleaner. Use Dimensions (PipelineName, Environment, SourceSystem) to slice metrics in CloudWatch.
A well-designed alerting architecture has a clear flow: metric threshold breach → CloudWatch Alarm → SNS Topic → subscribers (email, Slack webhook, PagerDuty). Each pipeline should have at minimum four alarms: failure alarm, duration alarm, DQ failure alarm, and DLQ depth alarm.
import boto3
cw = boto3.client("cloudwatch", region_name="us-east-1")
sns = boto3.client("sns", region_name="us-east-1")
PIPELINE_NAME = "silver-orders-etl"
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts"
def create_pipeline_alarms(pipeline_name, sns_topic_arn):
dims = [{"Name": "PipelineName", "Value": pipeline_name}]
# Alarm 1 — Pipeline failure (PipelineSuccess drops to 0)
cw.put_metric_alarm(
AlarmName = f"{pipeline_name}-failure",
AlarmDescription = f"{pipeline_name} did not complete successfully",
Namespace = "DataEngineering/Pipelines",
MetricName = "PipelineSuccess",
Dimensions = dims,
Statistic = "Sum",
Period = 3600, # check every 1 hour
EvaluationPeriods= 1,
Threshold = 1,
ComparisonOperator = "LessThanThreshold",
TreatMissingData = "breaching", # missing metric = pipeline didn't run = alert
AlarmActions = [sns_topic_arn],
OKActions = [sns_topic_arn]
)
# Alarm 2 — Duration exceeded SLA (e.g., must finish within 30 min = 1800s)
cw.put_metric_alarm(
AlarmName = f"{pipeline_name}-duration-sla-breach",
AlarmDescription = f"{pipeline_name} exceeded 30-minute SLA",
Namespace = "DataEngineering/Pipelines",
MetricName = "PipelineDurationSeconds",
Dimensions = dims,
Statistic = "Maximum",
Period = 3600,
EvaluationPeriods= 1,
Threshold = 1800, # 30 minutes
ComparisonOperator = "GreaterThanThreshold",
AlarmActions = [sns_topic_arn]
)
# Alarm 3 — DQ score dropped below threshold
cw.put_metric_alarm(
AlarmName = f"{pipeline_name}-dq-failure",
AlarmDescription = f"{pipeline_name} data quality score below 99%",
Namespace = "DataEngineering/Pipelines",
MetricName = "DataQualityScore",
Dimensions = dims,
Statistic = "Minimum",
Period = 3600,
EvaluationPeriods= 1,
Threshold = 99.0,
ComparisonOperator = "LessThanThreshold",
AlarmActions = [sns_topic_arn]
)
# Alarm 4 — DLQ depth (failed messages piling up)
cw.put_metric_alarm(
AlarmName = f"{pipeline_name}-dlq-depth",
AlarmDescription = "Dead letter queue has messages — pipeline errors not being handled",
Namespace = "AWS/SQS",
MetricName = "ApproximateNumberOfMessagesVisible",
Dimensions = [{"Name": "QueueName", "Value": f"{pipeline_name}-dlq"}],
Statistic = "Sum",
Period = 300, # check every 5 minutes
EvaluationPeriods= 1,
Threshold = 1,
ComparisonOperator = "GreaterThanOrEqualToThreshold",
AlarmActions = [sns_topic_arn]
)
print(f"[ALARMS] Created 4 alarms for {pipeline_name}")
create_pipeline_alarms(PIPELINE_NAME, SNS_TOPIC_ARN)
import boto3, json
sns = boto3.client("sns", region_name="us-east-1")
# Create the alert topic
response = sns.create_topic(Name="data-pipeline-alerts")
topic_arn = response["TopicArn"]
# Subscribe an email endpoint
sns.subscribe(TopicArn=topic_arn, Protocol="email",
Endpoint="data-team@company.com")
# Subscribe a Lambda function that forwards to Slack
# (Lambda reads SNS message and calls Slack webhook URL)
sns.subscribe(TopicArn=topic_arn, Protocol="lambda",
Endpoint="arn:aws:lambda:us-east-1:123456789012:function:slack-notifier")
# The Lambda (slack_notifier) looks like this:
def lambda_handler(event, context):
import urllib.request
SLACK_WEBHOOK = "https://hooks.slack.com/services/T.../B.../..."
for record in event["Records"]:
message = json.loads(record["Sns"]["Message"])
alarm = record["Sns"]["Subject"]
payload = {
"text": f":rotating_light: *PIPELINE ALERT*\n*Alarm:* {alarm}\n*Detail:* {message}"
}
req = urllib.request.Request(
SLACK_WEBHOOK,
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"}
)
urllib.request.urlopen(req)
Every production pipeline has an SLA — a contractual or internal commitment that data will be available by a certain time (e.g., "daily sales report ready by 06:00 UTC"). SLA monitoring checks whether each pipeline completed on time, and fires an alert before business users notice the data is missing.
import boto3
from datetime import datetime, timezone, timedelta
import json
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
sns = boto3.client("sns", region_name="us-east-1")
AUDIT_TABLE = dynamodb.Table("pipeline_audit")
SNS_TOPIC_ARN = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts"
# SLA definitions — pipeline_name → expected completion UTC hour
PIPELINE_SLAS = {
"silver-orders-etl": {"expected_by_hour": 4, "expected_by_minute": 30},
"gold-daily-revenue": {"expected_by_hour": 5, "expected_by_minute": 0},
"gold-executive-kpi": {"expected_by_hour": 6, "expected_by_minute": 0},
}
def lambda_handler(event, context):
"""Run at 06:30 UTC daily — check all pipelines completed on time."""
today = datetime.now(timezone.utc).date().isoformat()
breaches = []
for pipeline_name, sla in PIPELINE_SLAS.items():
# Look up today's run in audit table
response = AUDIT_TABLE.get_item(
Key={"pipeline_name": pipeline_name, "run_date": today}
)
item = response.get("Item")
if not item:
breaches.append(f"❌ {pipeline_name}: NO RUN RECORDED for {today}")
continue
if item.get("status") != "SUCCESS":
breaches.append(f"❌ {pipeline_name}: Status={item.get('status')} on {today}")
continue
# Check if it completed before the SLA deadline
end_time = datetime.fromisoformat(item["end_time"])
deadline = datetime(
end_time.year, end_time.month, end_time.day,
sla["expected_by_hour"], sla["expected_by_minute"],
tzinfo=timezone.utc
)
if end_time > deadline:
delay_mins = int((end_time - deadline).total_seconds() / 60)
breaches.append(
f"⚠️ {pipeline_name}: Completed {delay_mins} min LATE "
f"(finished {end_time.strftime('%H:%M')} UTC, SLA was "
f"{sla['expected_by_hour']:02d}:{sla['expected_by_minute']:02d} UTC)"
)
if breaches:
message = "SLA BREACH REPORT — " + today + "\n\n" + "\n".join(breaches)
sns.publish(
TopicArn = SNS_TOPIC_ARN,
Subject = f"🚨 SLA Breach Detected — {len(breaches)} pipeline(s)",
Message = message
)
print(f"[SLA] Published breach alert: {len(breaches)} pipeline(s)")
else:
print(f"[SLA] All pipelines met SLA for {today} ✅")
import boto3
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
sns = boto3.client("sns", region_name="us-east-1")
def sla_miss_callback(context):
"""Called by Airflow when any task misses its SLA."""
task_id = context["task_instance"].task_id
dag_id = context["task_instance"].dag_id
exec_date = context["execution_date"]
sns.publish(
TopicArn = "arn:aws:sns:us-east-1:123456789012:data-pipeline-alerts",
Subject = f"⏰ SLA Miss: {dag_id}.{task_id}",
Message = (
f"Task {task_id} in DAG {dag_id} missed its SLA.\n"
f"Execution date: {exec_date}\n"
f"Check Airflow UI for details."
)
)
with DAG(
dag_id = "silver_orders_etl",
schedule_interval = "0 2 * * *", # run at 02:00 UTC
sla_miss_callback = sla_miss_callback,
default_args = {
"owner": "data-engineering",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"sla": timedelta(hours=2) # each task must finish within 2 hours
}
) as dag:
pass # your tasks here
Data lineage is the record of where data came from, what transformations it went through, and where it landed. It answers critical questions: "Which source tables feed this Gold report?", "If this upstream table is wrong, which dashboards are affected?", "Who changed this data and when?".
import boto3
from datetime import datetime, timezone
import uuid
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
lineage_table = dynamodb.Table("pipeline_lineage")
def record_lineage(run_id, pipeline_name, source_tables,
target_table, transformation_summary, rows_out):
"""Write one lineage record per pipeline run."""
lineage_table.put_item(Item={
"lineage_id" : str(uuid.uuid4()),
"run_id" : run_id,
"pipeline_name" : pipeline_name,
"source_tables" : source_tables, # list of strings
"target_table" : target_table,
"transformation_summary" : transformation_summary,
"rows_out" : rows_out,
"recorded_at" : datetime.now(timezone.utc).isoformat()
})
# Example usage after a Glue ETL job
record_lineage(
run_id = "run-20240115-142300",
pipeline_name = "gold-daily-revenue",
source_tables = ["silver.orders", "silver.order_items", "gold.dim_customer"],
target_table = "gold.fact_daily_revenue",
transformation_summary = "Joined orders + items, grouped by date and store, applied SCD2 customer lookup",
rows_out = 45_230
)
OpenLineage is an open-source specification for collecting lineage metadata from data pipelines. Instead of building your own lineage tables from scratch, you emit standardized run events (START, COMPLETE, FAIL) with input/output datasets, and any compatible backend (Marquez, DataHub, Atlan) can store and visualize the lineage graph.
# Install: pip install openlineage-python
from openlineage.client import OpenLineageClient
from openlineage.client.run import RunEvent, RunState, Run, Job
from openlineage.client.facet import (
DataSourceDatasetFacet, SchemaDatasetFacet, SchemaField,
SqlJobFacet
)
from openlineage.client.dataset import InputDataset, OutputDataset
import uuid
from datetime import datetime, timezone
client = OpenLineageClient.from_environment() # reads OPENLINEAGE_URL env var
run_id = str(uuid.uuid4())
job = Job(namespace="aws-glue", name="silver-orders-etl")
# Emit START event
client.emit(RunEvent(
eventType = RunState.START,
eventTime = datetime.now(timezone.utc).isoformat(),
run = Run(runId=run_id),
job = job,
inputs = [
InputDataset(
namespace = "postgresql://prod-db:5432",
name = "public.orders",
facets = {
"dataSource": DataSourceDatasetFacet(
name="prod-postgres", uri="postgresql://prod-db:5432/salesdb"
)
}
)
],
outputs = []
))
# ... run ETL ...
# Emit COMPLETE event with output dataset
client.emit(RunEvent(
eventType = RunState.COMPLETE,
eventTime = datetime.now(timezone.utc).isoformat(),
run = Run(runId=run_id),
job = job,
inputs = [],
outputs = [
OutputDataset(
namespace = "s3://my-lake",
name = "silver/orders",
facets = {
"schema": SchemaDatasetFacet(fields=[
SchemaField("order_id", "STRING"),
SchemaField("customer_id", "STRING"),
SchemaField("net_revenue", "DOUBLE")
])
}
)
]
))
Marquez is the open-source reference implementation of an OpenLineage backend. It provides a REST API that receives lineage events and a UI that renders the full lineage graph — showing exactly which datasets feed which jobs and which jobs produce which datasets.
# Start Marquez (lineage server + UI)
docker run -d --name marquez \
-p 5000:5000 \
-p 5001:5001 \
marquezproject/marquez:latest
# Set env var so OpenLineage client sends events to Marquez
export OPENLINEAGE_URL=http://localhost:5000
# Open Marquez UI at http://localhost:5001
# Navigate: Namespaces → Jobs → click any job to see lineage graph
The audit table is the heartbeat log of your entire data platform. Every pipeline run — success or failure — writes one record. It gives you historical visibility into pipeline health, run durations, row counts, and error messages — all queryable with SQL.
CREATE TABLE pipeline_audit (
audit_id STRING DEFAULT gen_random_uuid(),
run_id STRING NOT NULL,
pipeline_name STRING NOT NULL,
dag_id STRING,
task_id STRING,
source_table STRING,
target_table STRING,
status STRING NOT NULL, -- SUCCESS / FAILED / RUNNING
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP,
duration_seconds INT,
rows_read BIGINT,
rows_written BIGINT,
rows_rejected BIGINT,
dq_score DECIMAL(5,2),
error_message STRING,
error_type STRING, -- NullConstraint / SchemaError / TimeoutError
batch_id STRING,
watermark_value TIMESTAMP,
environment STRING, -- dev / staging / prod
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (audit_id)
);
import boto3, uuid, time
from datetime import datetime, timezone
dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("pipeline_audit")
class PipelineAudit:
"""Context manager that writes audit records automatically."""
def __init__(self, pipeline_name, source_table, target_table,
run_id=None, environment="prod"):
self.pipeline_name = pipeline_name
self.source_table = source_table
self.target_table = target_table
self.run_id = run_id or str(uuid.uuid4())
self.environment = environment
self.start_time = None
self.rows_read = 0
self.rows_written = 0
self.rows_rejected = 0
self.dq_score = 100.0
def __enter__(self):
self.start_time = datetime.now(timezone.utc)
# Write RUNNING record immediately so we know the pipeline started
table.put_item(Item={
"audit_id" : str(uuid.uuid4()),
"run_id" : self.run_id,
"pipeline_name" : self.pipeline_name,
"source_table" : self.source_table,
"target_table" : self.target_table,
"status" : "RUNNING",
"start_time" : self.start_time.isoformat(),
"environment" : self.environment
})
return self
def __exit__(self, exc_type, exc_val, exc_tb):
end_time = datetime.now(timezone.utc)
duration = int((end_time - self.start_time).total_seconds())
if exc_type is None:
status = "SUCCESS"
error_message = None
error_type = None
else:
status = "FAILED"
error_message = str(exc_val)[:1000] # DynamoDB 400 KB item limit
error_type = exc_type.__name__
table.put_item(Item={
"audit_id" : str(uuid.uuid4()),
"run_id" : self.run_id,
"pipeline_name" : self.pipeline_name,
"source_table" : self.source_table,
"target_table" : self.target_table,
"status" : status,
"start_time" : self.start_time.isoformat(),
"end_time" : end_time.isoformat(),
"duration_seconds": duration,
"rows_read" : self.rows_read,
"rows_written" : self.rows_written,
"rows_rejected" : self.rows_rejected,
"dq_score" : str(self.dq_score),
"error_message" : error_message,
"error_type" : error_type,
"environment" : self.environment
})
return False # do not suppress exceptions
# Usage — wraps your entire ETL logic
with PipelineAudit("silver-orders-etl",
source_table="bronze.raw_orders",
target_table="silver.orders") as audit:
# your ETL logic here
df = spark.table("bronze.raw_orders")
audit.rows_read = df.count()
df_clean = df.dropna(subset=["order_id", "customer_id"])
audit.rows_rejected = audit.rows_read - df_clean.count()
audit.rows_written = df_clean.count()
audit.dq_score = round(audit.rows_written / audit.rows_read * 100, 2)
df_clean.write.format("delta").mode("overwrite") \
.saveAsTable("silver.orders")
-- 1. Last 7 days pipeline success rate per pipeline
SELECT
pipeline_name,
COUNT(*) AS total_runs,
SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) AS successes,
ROUND(100.0 * SUM(CASE WHEN status = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate_pct
FROM pipeline_audit
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL '7 days'
GROUP BY pipeline_name
ORDER BY success_rate_pct ASC;
-- 2. Pipelines that took longer than their SLA today
SELECT pipeline_name, start_time, end_time, duration_seconds, rows_written
FROM pipeline_audit
WHERE DATE(start_time) = CURRENT_DATE
AND status = 'SUCCESS'
AND duration_seconds > 1800 -- 30 min SLA
ORDER BY duration_seconds DESC;
-- 3. DQ score trend for a specific pipeline
SELECT DATE(start_time) AS run_date, AVG(dq_score::float) AS avg_dq_score
FROM pipeline_audit
WHERE pipeline_name = 'silver-orders-etl'
AND status = 'SUCCESS'
AND start_time >= CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY run_date
ORDER BY run_date;
-- 4. Most common error types
SELECT error_type, COUNT(*) AS occurrences, MAX(start_time) AS last_seen
FROM pipeline_audit
WHERE status = 'FAILED'
AND start_time >= CURRENT_TIMESTAMP - INTERVAL '30 days'
GROUP BY error_type
ORDER BY occurrences DESC;
Putting it all together — here is how all observability components connect in a production AWS data platform.
Boto3 Fundamentals
Sessions, clients, resources, credential resolution, and production-grade client configuration — the foundation for every AWS API call you'll make from Python.
Boto3 has three levels of abstraction for interacting with AWS. Understanding which one to use — and why — is the first thing every data engineer must get right before writing a single API call.
A Session stores configuration — credentials, region, profile — and is the factory from which you create clients and resources. By default, boto3 uses an implicit default session. You create explicit sessions when you need to use different credentials (cross-account, assumed role) or different regions in the same script.
import boto3
# Implicit default session — uses ~/.aws/credentials or instance profile
s3 = boto3.client("s3") # boto3 creates a default session internally
# Explicit session — useful when you need custom region or profile
session = boto3.Session(
region_name="eu-west-1", # override default region
profile_name="data-dev" # use a named profile from ~/.aws/config
)
s3_eu = session.client("s3")
# Cross-account session using temporary credentials from STS assume_role
sts = boto3.client("sts")
creds = sts.assume_role(
RoleArn="arn:aws:iam::999999999999:role/CrossAccountDataRole",
RoleSessionName="etl-cross-account"
)["Credentials"]
cross_session = boto3.Session(
aws_access_key_id = creds["AccessKeyId"],
aws_secret_access_key = creds["SecretAccessKey"],
aws_session_token = creds["SessionToken"],
region_name = "us-east-1"
)
s3_cross = cross_session.client("s3") # now operates in the target account
A Client is the low-level interface — every method maps directly to one AWS API call. It returns raw Python dictionaries. Use Client for all production data engineering code — it covers 100% of the AWS API, has predictable return structures, and aligns with the AWS documentation.
import boto3
# Create a client — specify service name and region
s3 = boto3.client("s3", region_name="us-east-1")
glue = boto3.client("glue", region_name="us-east-1")
emr = boto3.client("emr", region_name="us-east-1")
dynamo = boto3.client("dynamodb", region_name="us-east-1")
# Client methods return raw dicts — you access fields by key
response = s3.get_object(Bucket="my-lake", Key="silver/orders/part-0001.parquet")
body = response["Body"].read() # raw bytes
metadata = response["Metadata"] # dict of custom metadata
size = response["ContentLength"] # integer
# List all S3 buckets — returns dict with "Buckets" key
buckets = s3.list_buckets()
for b in buckets["Buckets"]:
print(b["Name"], b["CreationDate"])
A Resource is the high-level, object-oriented interface. It wraps AWS entities as Python objects with attributes and methods (e.g., bucket.objects.all() instead of s3.list_objects_v2()). Resources are only available for a subset of services (S3, DynamoDB, EC2, IAM, SQS, SNS). They are convenient for simple scripts but lack full API coverage — not recommended for production pipelines.
import boto3
# S3 Resource — object-oriented access
s3_resource = boto3.resource("s3", region_name="us-east-1")
bucket = s3_resource.Bucket("my-lake")
# List all objects in a bucket (resource style)
for obj in bucket.objects.filter(Prefix="silver/orders/"):
print(obj.key, obj.size)
# Upload a file — simpler syntax than client.upload_file
bucket.upload_file("local_file.parquet", "silver/orders/part-0001.parquet")
# DynamoDB Resource — table as an object
dynamo_resource = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamo_resource.Table("pipeline_audit")
# put_item with resource — cleaner syntax than client
table.put_item(Item={"run_id": "abc123", "status": "SUCCESS", "rows": 50000})
# get_item with resource — no need to parse nested response
response = table.get_item(Key={"run_id": "abc123"})
item = response.get("Item") # already deserialized — no type wrappers
| Aspect | Session | Client | Resource |
|---|---|---|---|
| Purpose | Credential/config store | Raw API calls | OO wrapper |
| Return type | N/A | Raw dict | Python objects |
| API coverage | N/A | 100% of AWS API | Subset of services |
| Production DE use | Always (implicit) | Primary choice | Simple scripts only |
| Services supported | All | All | S3, DynamoDB, EC2, IAM, SQS, SNS only |
Boto3 resolves credentials using a credential chain — it tries each method in order and uses the first one it finds. In production, you never hardcode credentials. In local dev, you use profiles. On AWS compute, the instance/task/pod role is automatically used.
1. Explicit credentials passed to Session() / client() constructor
2. Environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN (for temporary creds)
AWS_DEFAULT_REGION
3. AWS config file — ~/.aws/credentials and ~/.aws/config
[default] profile or named profiles
4. AWS SSO credential cache (if using aws sso login)
5. Container credentials (ECS task role — via metadata endpoint)
6. EC2 instance profile / Lambda execution role / Glue IAM role
→ fetched automatically from EC2 metadata service (169.254.169.254)
→ This is the standard production pattern on AWS compute
In production, AWS compute services (Glue jobs, Lambda functions, EMR clusters) run under an IAM role. Boto3 automatically picks up the role's credentials from the metadata service — you never need to pass credentials explicitly. This is the most secure pattern.
import boto3
# On Glue / Lambda / EMR — boto3 auto-uses the attached IAM role
# No credentials needed in code — fetched from EC2 metadata service
s3 = boto3.client("s3", region_name="us-east-1")
glue = boto3.client("glue", region_name="us-east-1")
# Verify which identity boto3 resolved — useful for debugging auth issues
sts = boto3.client("sts")
identity = sts.get_caller_identity()
print(f"Account : {identity['Account']}")
print(f"UserId : {identity['UserId']}")
print(f"ARN : {identity['Arn']}")
# Output on a Glue job:
# ARN: arn:aws:sts::123456789012:assumed-role/GlueExecutionRole/session-name
# Used in CI/CD pipelines (GitHub Actions, GitLab CI) and Docker containers
# Store in GitHub Secrets / GitLab CI Variables — never in code
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="wJalrXUtnFEMI..."
export AWS_DEFAULT_REGION="us-east-1"
# boto3 picks these up automatically — no code change needed
python my_etl_script.py
# ~/.aws/credentials
[default]
aws_access_key_id = AKIA...
aws_secret_access_key = wJalrX...
[data-dev]
aws_access_key_id = AKIA...DEV
aws_secret_access_key = wJalrX...DEV
# ~/.aws/config
[default]
region = us-east-1
output = json
[profile data-dev]
region = eu-west-1
role_arn = arn:aws:iam::999999999999:role/DevDataRole
source_profile = default # assume role using default creds
import boto3
# Use the data-dev profile from ~/.aws/config
session = boto3.Session(profile_name="data-dev")
s3 = session.client("s3")
# Useful for local testing against a dev AWS account
# while prod code uses the instance role (no profile needed)
When your code runs on AWS compute, boto3 automatically fetches temporary credentials from the EC2 Instance Metadata Service (IMDS) at 169.254.169.254. These credentials are rotated automatically by AWS every few hours — you never manage them. This is the gold standard for production.
import boto3
# On any AWS compute — this confirms auto-auth from instance profile
sts = boto3.client("sts")
try:
identity = sts.get_caller_identity()
print(f"✅ Auth via: {identity['Arn']}")
except Exception as e:
print(f"❌ Auth failed: {e} — check IAM role attachment")
Always specify region_name explicitly when creating clients — never rely on the default region being correct in a different environment. For services that are global (IAM, STS, S3 bucket creation), use us-east-1 as the convention. For VPC endpoint access (private subnets), you may need to specify a custom endpoint_url.
import boto3
# Always specify region explicitly — avoid silent wrong-region bugs
s3 = boto3.client("s3", region_name="us-east-1")
glue = boto3.client("glue", region_name="eu-west-1") # Glue in Ireland
# Custom endpoint — for VPC Interface Endpoints (private subnet access)
# Instead of going over the public internet, traffic stays inside VPC
s3_private = boto3.client(
"s3",
region_name = "us-east-1",
endpoint_url = "https://bucket.vpce-0a1b2c3d4e-abcdef.s3.us-east-1.vpce.amazonaws.com"
)
# LocalStack — local AWS emulation for integration testing
s3_local = boto3.client(
"s3",
region_name = "us-east-1",
endpoint_url = "http://localhost:4566",
aws_access_key_id = "test",
aws_secret_access_key = "test"
)
The botocore.config.Config object controls retry behavior, connection timeouts, and connection pool size. In production pipelines that make thousands of API calls, tuning these settings prevents silent failures from transient errors and connection exhaustion.
import boto3
from botocore.config import Config
# Production-grade config for high-throughput DE pipelines
prod_config = Config(
region_name = "us-east-1",
# Retry configuration
retries = {
"max_attempts": 10, # total attempts (1 initial + 9 retries)
"mode": "adaptive" # adaptive: exponential backoff + client-side rate limiting
# standard: exponential backoff only
# legacy: fixed 5 retries (old default — avoid)
},
# Timeout configuration (in seconds)
connect_timeout = 5, # time to establish TCP connection
read_timeout = 60, # time to wait for server response after connected
# Connection pool — increase for high-concurrency pipelines
max_pool_connections = 50 # default is 10; increase for parallel uploads/downloads
)
# Apply config to a client
s3 = boto3.client("s3", config=prod_config)
glue = boto3.client("glue", config=prod_config)
# Verify the config took effect
print(s3.meta.config.retries) # {'max_attempts': 10, 'mode': 'adaptive'}
print(s3.meta.config.connect_timeout) # 5
| Retry Mode | Behaviour | Use When |
|---|---|---|
| legacy | Fixed 5 retries, no rate limiting | Never — old default, avoid |
| standard | Exponential backoff on retryable errors | Most production use cases |
| adaptive | Exponential backoff + token bucket rate limiter | High-throughput pipelines hitting AWS rate limits |
ThreadPoolExecutor with 50 threads each making S3 calls on the same client, the default pool of 10 connections becomes a bottleneck. Set max_pool_connections to at least match your thread count to avoid connection wait time.
This is the pattern every DE should use as their standard template for initializing boto3 clients in production pipelines — combining explicit region, production config, and environment-aware credential resolution.
import boto3
import os
from botocore.config import Config
REGION = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
PROD_CONFIG = Config(
region_name = REGION,
retries = {"max_attempts": 10, "mode": "adaptive"},
connect_timeout = 5,
read_timeout = 60,
max_pool_connections = 50
)
def get_client(service_name: str, region: str = REGION):
"""
Factory for boto3 clients.
- On AWS compute → auto-uses instance role (no creds needed)
- In local dev → uses ~/.aws/credentials or env vars
- In CI/CD → uses env var credentials
"""
return boto3.client(service_name, config=Config(
region_name = region,
retries = {"max_attempts": 10, "mode": "adaptive"},
connect_timeout = 5,
read_timeout = 60,
max_pool_connections = 50
))
# Usage throughout the pipeline
s3 = get_client("s3")
glue = get_client("glue")
dynamo = get_client("dynamodb")
sns = get_client("sns")
cw = get_client("cloudwatch")
② Always specify
region_name explicitly — never rely on defaults.③ Always use
Config with mode="adaptive" retry and reasonable timeouts.④ Use
sts.get_caller_identity() at startup to verify the correct identity is being used.⑤ Use explicit Sessions only for cross-account or multi-region scenarios.
ClientError is raised when AWS returns an HTTP error response — meaning your request reached AWS, but AWS rejected it or something went wrong on its side. It wraps all service-level errors: access denied, resource not found, throttling, concurrent run limits, etc.
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
try:
response = s3.get_object(Bucket="my-bucket", Key="data/file.parquet")
except ClientError as e:
# Always inspect these two fields first
error_code = e.response["Error"]["Code"] # e.g. "NoSuchKey"
error_message = e.response["Error"]["Message"] # human-readable
http_status = e.response["ResponseMetadata"]["HTTPStatusCode"] # e.g. 404
print(f"Code: {error_code} | HTTP: {http_status} | Msg: {error_message}")
ClientError has e.response["Error"]["Code"] and e.response["Error"]["Message"]. Always parse these — never just print the raw exception, as it loses the code you need to branch on.
BotoCoreError is the base class for errors that happen before a request reaches AWS — network issues, bad parameters, missing credentials, endpoint resolution failures. These are SDK-side, not service-side.
from botocore.exceptions import (
BotoCoreError,
NoCredentialsError,
EndpointResolutionError,
ParamValidationError,
ClientError
)
try:
s3 = boto3.client("s3", region_name="us-east-1")
s3.get_object(Bucket="my-bucket", Key="file.parquet")
except NoCredentialsError:
# No credentials found at all — misconfigured environment
print("ERROR: No AWS credentials found. Check IAM role or env vars.")
except ParamValidationError as e:
# Wrong parameter type or missing required parameter — caught before network call
print(f"ERROR: Bad parameters passed to boto3: {e}")
except EndpointResolutionError as e:
# Cannot resolve the AWS endpoint — DNS/network issue
print(f"ERROR: Cannot reach AWS endpoint: {e}")
except ClientError as e:
# AWS responded with an error (service-side)
print(f"AWS Error: {e.response['Error']['Code']}: {e.response['Error']['Message']}")
except BotoCoreError as e:
# Catch-all for any other SDK-level error
print(f"Boto3 SDK error: {e}")
These are the error codes that appear in real pipelines. When you catch a ClientError, branch on e.response["Error"]["Code"] to take the right action for each case.
import boto3
from botocore.exceptions import ClientError
def safe_get_object(bucket, key):
s3 = boto3.client("s3", region_name="us-east-1")
try:
return s3.get_object(Bucket=bucket, Key=key)
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "NoSuchBucket":
# The S3 bucket doesn't exist — configuration error
raise ValueError(f"Bucket '{bucket}' does not exist. Check your config.")
elif code == "NoSuchKey":
# The file doesn't exist — could be expected (e.g. first run)
print(f"File s3://{bucket}/{key} not found — may be first run.")
return None
elif code == "AccessDenied":
# IAM permissions issue — escalate immediately
raise PermissionError(f"Access denied to s3://{bucket}/{key}. Check IAM role.")
elif code in ("ThrottlingException", "RequestLimitExceeded", "SlowDown"):
# AWS is rate-limiting us — back off and retry
print("Being throttled by S3 — apply exponential backoff.")
raise # re-raise so retry logic catches it
elif code == "ServiceUnavailableException":
# Transient AWS service issue — safe to retry
raise # re-raise for retry
else:
# Unknown error — log and re-raise
print(f"Unhandled error code: {code}")
raise
| Error Code | Service | Meaning | Action |
|---|---|---|---|
NoSuchBucket | S3 | Bucket doesn't exist | Config error — fail fast |
NoSuchKey | S3 | Object doesn't exist | Handle gracefully (first run etc.) |
AccessDenied | All | IAM policy blocks action | Fail fast, alert ops |
ThrottlingException | All | API rate limit hit | Backoff + retry |
SlowDown | S3 | S3 request rate limit | Backoff + retry |
EntityAlreadyExists | IAM | Role/policy already exists | Skip or update |
ConcurrentRunsExceededException | Glue | Job already running | Wait or queue |
ServiceUnavailableException | All | Transient AWS outage | Retry with backoff |
ValidationException | All | Invalid input | Fix parameters — don't retry |
When AWS throttles you, retrying immediately makes things worse — all clients pile on at the same moment. Exponential backoff doubles the wait time after each failure. Jitter adds randomness so multiple pipeline instances don't all retry at exactly the same second.
import boto3, time, random
from botocore.exceptions import ClientError
# Error codes that are safe to retry (transient)
RETRYABLE_CODES = {
"ThrottlingException",
"RequestLimitExceeded",
"SlowDown",
"ServiceUnavailableException",
"InternalError",
"RequestTimeout",
"ProvisionedThroughputExceededException", # DynamoDB
}
def with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=32.0):
"""
Call fn() with exponential backoff + full jitter on retryable errors.
Wait formula: min(max_delay, base_delay * 2^attempt) with full jitter
Example waits: ~0.5s, ~1s, ~2s, ~4s, ~8s (randomized within each cap)
"""
for attempt in range(max_attempts):
try:
return fn() # call the boto3 operation
except ClientError as e:
code = e.response["Error"]["Code"]
if code not in RETRYABLE_CODES or attempt == max_attempts - 1:
# Non-retryable error OR we've exhausted all attempts
raise
# Exponential cap with full jitter
cap = min(max_delay, base_delay * (2 ** attempt))
sleep = random.uniform(0, cap) # full jitter: random between 0 and cap
print(f"[Attempt {attempt+1}/{max_attempts}] Got {code}. "
f"Retrying in {sleep:.2f}s...")
time.sleep(sleep)
raise RuntimeError("Exhausted all retry attempts")
# ── Usage example ──
s3 = boto3.client("s3", region_name="us-east-1")
response = with_backoff(
lambda: s3.get_object(Bucket="my-bucket", Key="data/events.parquet")
)
body = response["Body"].read()
print(f"Downloaded {len(body)} bytes")
tenacity is a Python library that handles retries declaratively with decorators. It's cleaner than writing manual loops for every function. It supports exponential backoff, jitter, stop conditions, and custom retry predicates — all in one decorator.
# pip install tenacity
import boto3
from botocore.exceptions import ClientError
from tenacity import (
retry,
stop_after_attempt,
wait_exponential_jitter,
retry_if_exception,
before_sleep_log,
RetryError
)
import logging
logger = logging.getLogger(__name__)
# ── Define what counts as a retryable error ──
RETRYABLE_CODES = {
"ThrottlingException", "SlowDown", "ServiceUnavailableException",
"RequestLimitExceeded", "InternalError", "ProvisionedThroughputExceededException"
}
def is_retryable(exc):
"""Return True if this exception should trigger a retry."""
if isinstance(exc, ClientError):
return exc.response["Error"]["Code"] in RETRYABLE_CODES
return False
# ── Decorated function — retries automatically ──
@retry(
retry = retry_if_exception(is_retryable), # only retry on throttle/transient
stop = stop_after_attempt(5), # give up after 5 total attempts
wait = wait_exponential_jitter( # exponential backoff + jitter
initial=1, # first wait = 1s
max=30, # cap wait at 30s
jitter=2 # add up to 2s random jitter on top
),
before_sleep = before_sleep_log(logger, logging.WARNING), # log each retry
reraise = True # re-raise the original exception on final failure
)
def download_s3_object(bucket: str, key: str) -> bytes:
s3 = boto3.client("s3", region_name="us-east-1")
response = s3.get_object(Bucket=bucket, Key=key)
return response["Body"].read()
# ── Usage ──
try:
data = download_s3_object("my-data-lake", "bronze/events/2024/01/01/events.parquet")
print(f"Downloaded {len(data):,} bytes")
except ClientError as e:
print(f"Final failure after retries: {e.response['Error']['Code']}")
except RetryError:
print("tenacity exhausted all retry attempts")
@retry decorator keeps your function body focused on business logic, not retry plumbing.
from tenacity import retry, stop_after_attempt, wait_fixed, retry_if_result
glue = boto3.client("glue", region_name="us-east-1")
def is_still_running(state: str) -> bool:
"""Return True while the job is still running — keep polling."""
return state in ("STARTING", "RUNNING", "STOPPING")
@retry(
retry = retry_if_result(is_still_running), # keep retrying while RUNNING
stop = stop_after_attempt(60), # max 60 polls (~10 minutes)
wait = wait_fixed(10), # poll every 10 seconds
reraise = True
)
def poll_glue_job(job_name: str, run_id: str) -> str:
"""Poll until Glue job reaches a terminal state. Returns final state string."""
response = glue.get_job_run(JobName=job_name, RunId=run_id)
state = response["JobRun"]["JobRunState"]
print(f" Glue job state: {state}")
return state # tenacity checks if this triggers retry_if_result
# ── Usage ──
run = glue.start_job_run(JobName="my-etl-job", Arguments={"--env": "prod"})
run_id = run["JobRunId"]
final_state = poll_glue_job("my-etl-job", run_id)
if final_state == "SUCCEEDED":
print("✅ Glue job completed successfully")
else:
raise RuntimeError(f"❌ Glue job ended with state: {final_state}")
boto3 has a built-in retry mechanism configurable via the Config object. With mode="adaptive", boto3 applies exponential backoff automatically for all retryable errors, plus a client-side rate limiter that slows requests before they even get throttled.
import boto3
from botocore.config import Config
# ── Configure adaptive retry globally for this client ──
retry_config = Config(
retries={
"max_attempts": 10, # 1 initial attempt + 9 retries
"mode": "adaptive" # adaptive = exponential backoff + token bucket rate limiter
}
)
s3 = boto3.client("s3", region_name="us-east-1", config=retry_config)
glue = boto3.client("glue", region_name="us-east-1", config=retry_config)
dynamo= boto3.client("dynamodb", region_name="us-east-1", config=retry_config)
# Now all boto3 calls on these clients automatically retry on:
# ThrottlingException, ServiceUnavailableException, TransientError, etc.
# No extra code needed for basic retry cases.
response = s3.get_object(Bucket="my-bucket", Key="data/file.parquet")
print("Success — boto3 retried internally if needed")
| Mode | Backoff | Rate Limiter | Best For |
|---|---|---|---|
legacy | Fixed 5 retries | None | Never use |
standard | Exponential | None | Most pipelines |
adaptive | Exponential | Token bucket | High-throughput pipelines |
In a real production pipeline, catching the error is just step one. You also need to: log it to CloudWatch, write it to an audit table in DynamoDB, publish a failure alert to SNS, and decide whether to retry or fail the pipeline. Here is the complete production-grade template every senior DE uses:
import boto3, json, time, logging
from botocore.exceptions import ClientError, NoCredentialsError
from datetime import datetime, timezone
# ── Clients ──
s3 = boto3.client("s3", region_name="us-east-1")
dynamo = boto3.client("dynamodb", region_name="us-east-1")
sns = boto3.client("sns", region_name="us-east-1")
cw = boto3.client("cloudwatch", region_name="us-east-1")
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
AUDIT_TABLE = "pipeline_audit"
ALERT_TOPIC = "arn:aws:sns:us-east-1:123456789012:pipeline-alerts"
PIPELINE_NAME = "s3_to_delta_bronze"
RUN_ID = f"run_{int(time.time())}"
# ── Retryable error codes ──
RETRYABLE = {
"ThrottlingException", "SlowDown", "ServiceUnavailableException",
"RequestLimitExceeded", "InternalError"
}
def write_audit(status: str, error_msg: str = "", rows: int = 0):
"""Write pipeline run status to DynamoDB audit table."""
dynamo.put_item(
TableName=AUDIT_TABLE,
Item={
"run_id": {"S": RUN_ID},
"pipeline_name": {"S": PIPELINE_NAME},
"status": {"S": status},
"start_time": {"S": datetime.now(timezone.utc).isoformat()},
"rows_processed":{"N": str(rows)},
"error_message": {"S": error_msg},
}
)
def publish_alert(subject: str, message: str):
"""Push failure notification to SNS → email/Slack."""
sns.publish(
TopicArn=ALERT_TOPIC,
Subject=subject,
Message=json.dumps({
"pipeline": PIPELINE_NAME,
"run_id": RUN_ID,
"message": message,
"time": datetime.now(timezone.utc).isoformat()
}, indent=2)
)
def publish_metric(metric_name: str, value: float, unit: str = "Count"):
"""Publish custom CloudWatch metric for dashboards and alarms."""
cw.put_metric_data(
Namespace="DataPipeline/Bronze",
MetricData=[{
"MetricName": metric_name,
"Value": value,
"Unit": unit,
"Dimensions": [{"Name": "Pipeline", "Value": PIPELINE_NAME}]
}]
)
def run_pipeline():
"""Main pipeline entry point with full error handling."""
rows_processed = 0
try:
logger.info(f"Starting pipeline: {PIPELINE_NAME} | run_id: {RUN_ID}")
write_audit("RUNNING")
# ── Step 1: Download source file from S3 ──
try:
response = s3.get_object(Bucket="raw-data-lake", Key="events/2024/01/01/events.json")
data = response["Body"].read()
logger.info(f"Downloaded {len(data):,} bytes from S3")
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "NoSuchKey":
logger.warning("Source file not found — skipping run (may be first run).")
write_audit("SKIPPED", "Source file not found")
return # graceful exit — not a failure
elif code in RETRYABLE:
logger.error(f"Transient S3 error: {code} — should be handled by boto3 retry config")
raise # re-raise — boto3 adaptive mode should have already retried
else:
raise # unexpected error — propagate to outer handler
# ── Step 2: Transform and write output ──
# (your Spark / Pandas / business logic here)
rows_processed = 1_000_000 # example
# ── Step 3: Success ──
write_audit("SUCCEEDED", rows=rows_processed)
publish_metric("RowsProcessed", rows_processed)
publish_metric("PipelineSuccess", 1)
logger.info(f"✅ Pipeline succeeded. Rows: {rows_processed:,}")
except NoCredentialsError:
msg = "No AWS credentials found. Check IAM role configuration."
logger.critical(msg)
write_audit("FAILED", msg)
publish_alert(f"CRITICAL: {PIPELINE_NAME} — No Credentials", msg)
publish_metric("PipelineFailure", 1)
raise
except ClientError as e:
code = e.response["Error"]["Code"]
msg = f"AWS ClientError [{code}]: {e.response['Error']['Message']}"
logger.error(msg)
write_audit("FAILED", msg)
publish_alert(f"FAILED: {PIPELINE_NAME} — {code}", msg)
publish_metric("PipelineFailure", 1)
raise
except Exception as e:
msg = f"Unexpected error: {type(e).__name__}: {str(e)}"
logger.exception(msg)
write_audit("FAILED", msg)
publish_alert(f"FAILED: {PIPELINE_NAME} — Unexpected Error", msg)
publish_metric("PipelineFailure", 1)
raise
if __name__ == "__main__":
run_pipeline()
✅ Failures immediately trigger SNS alerts (ops team gets email/Slack)
✅ CloudWatch metrics power dashboards and alarms (PipelineFailure alarm → PagerDuty)
✅ Errors are classified — graceful skips vs real failures vs config errors
✅ Logs have run_id for correlation across all systems
AWS APIs like list_objects_v2, get_tables, scan return at most a fixed number of items per call (e.g. S3 returns max 1000 objects per call). If there are more, AWS returns a continuation token (NextToken, NextMarker, ContinuationToken) that you must pass in the next call to get the next page.
s3.list_objects_v2(Bucket="my-bucket") and get back 1000 objects, you might think that's all of them. But if the bucket has 50,000 objects, you silently missed 49,000. Paginators prevent this silent truncation bug.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
# ❌ WRONG — silently misses objects beyond the first 1000
response = s3.list_objects_v2(Bucket="my-data-lake", Prefix="bronze/events/")
objects = response.get("Contents", [])
print(f"Found {len(objects)} objects") # might say 1000, but real count is 50,000
# ✅ CORRECT — paginator walks all pages automatically
paginator = s3.get_paginator("list_objects_v2")
pages = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/events/")
all_objects = []
for page in pages:
all_objects.extend(page.get("Contents", []))
print(f"Found {len(all_objects)} objects") # correct: 50,000
get_paginator("operation_name") returns a Paginator object tied to that API operation. Calling .paginate(**kwargs) on it returns a PageIterator — a lazy iterator that makes one API call per page, automatically passing the continuation token between calls.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
# Step 1 — get a Paginator for a specific operation
paginator = s3.get_paginator("list_objects_v2")
# paginator knows: which field is the token, what the max page size is, etc.
# Step 2 — call paginate() with the same args you'd pass to the API
page_iterator = paginator.paginate(
Bucket = "my-data-lake",
Prefix = "bronze/events/2024/",
PaginationConfig = {
"MaxItems" : 5000, # total items to return across ALL pages (optional cap)
"PageSize" : 500, # items per individual page / API call
"StartingToken": None # resume from a specific page (advanced use)
}
)
# Step 3 — iterate over pages (each page = one API call)
total = 0
for page in page_iterator:
items = page.get("Contents", [])
total += len(items)
print(f" Page had {len(items)} items")
print(f"Total objects found: {total}")
Instead of looping over pages and manually extracting fields, you can use .search(jmespath_expression) on a PageIterator. It applies a JMESPath query across all pages and yields matching values lazily — no need to collect all pages into memory first.
Contents[].Key means "from the Contents array in each page, give me the Key field of every item." It's a mini query language for navigating nested JSON structures.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
paginator = s3.get_paginator("list_objects_v2")
page_iterator = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")
# ── Extract only the Keys (file paths) from all pages ──
# JMESPath: "Contents[].Key" → from Contents array, get Key field of each item
all_keys = list(page_iterator.search("Contents[].Key"))
print(f"All S3 keys: {all_keys[:5]}") # e.g. ['bronze/events/2024/01/file1.parquet', ...]
# ── Filter: only .parquet files, get their keys and sizes ──
page_iterator2 = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")
parquet_info = list(page_iterator2.search(
"Contents[?ends_with(Key, '.parquet')].[Key, Size]"
))
# Returns list of [key, size] pairs for .parquet files only
for key, size in parquet_info:
print(f" {key} ({size:,} bytes)")
# ── Get keys larger than 100MB ──
page_iterator3 = paginator.paginate(Bucket="my-data-lake", Prefix="bronze/")
large_files = list(page_iterator3.search("Contents[?Size > `104857600`].Key"))
print(f"Files > 100MB: {large_files}")
Contents[].Key — all Keys from Contents arrayContents[?Size > `1000`].Key — filter by conditionContents[?ends_with(Key, '.parquet')] — ends_with filterContents[].{k: Key, s: Size} — project to new shapelength(Contents[]) — count items
The most common paginator in data engineering. Used to list all files in a bucket prefix — essential for building file inventories, finding new files to process, or checking what's landed in a landing zone.
import boto3
from datetime import datetime, timezone, timedelta
s3 = boto3.client("s3", region_name="us-east-1")
paginator = s3.get_paginator("list_objects_v2")
# ── List all objects in a prefix ──
def list_all_s3_objects(bucket: str, prefix: str) -> list[dict]:
"""Return list of all S3 object metadata dicts under prefix."""
pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
objects = []
for page in pages:
objects.extend(page.get("Contents", []))
return objects
# ── List only files modified in the last N hours (incremental check) ──
def list_recent_files(bucket: str, prefix: str, hours: int = 24) -> list[str]:
cutoff = datetime.now(timezone.utc) - timedelta(hours=hours)
pages = paginator.paginate(Bucket=bucket, Prefix=prefix)
new_files = []
for page in pages:
for obj in page.get("Contents", []):
if obj["LastModified"] >= cutoff:
new_files.append(obj["Key"])
return new_files
# ── Usage ──
all_files = list_all_s3_objects("my-data-lake", "bronze/events/")
print(f"Total files: {len(all_files)}")
recent = list_recent_files("my-data-lake", "bronze/events/", hours=6)
print(f"Files landed in last 6 hours: {len(recent)}")
# ── List common prefixes (pseudo-folders) — useful for partition discovery ──
folder_paginator = s3.get_paginator("list_objects_v2")
folder_pages = folder_paginator.paginate(
Bucket = "my-data-lake",
Prefix = "bronze/events/",
Delimiter = "/" # treat / as folder separator
)
partitions = []
for page in folder_pages:
for prefix_obj in page.get("CommonPrefixes", []):
partitions.append(prefix_obj["Prefix"])
print(f"Partition folders: {partitions}")
# e.g. ['bronze/events/year=2024/', 'bronze/events/year=2023/']
Used to list all tables in a Glue database, or all partitions of a table. Essential for metadata-driven pipelines that discover tables dynamically rather than hardcoding table names.
import boto3
glue = boto3.client("glue", region_name="us-east-1")
# ── List all tables in a Glue database ──
def list_glue_tables(database: str) -> list[str]:
paginator = glue.get_paginator("get_tables")
pages = paginator.paginate(DatabaseName=database)
table_names = []
for page in pages:
for table in page["TableList"]:
table_names.append(table["Name"])
return table_names
tables = list_glue_tables("bronze_db")
print(f"Tables in bronze_db: {tables}")
# ── List all partitions of a table ──
def list_glue_partitions(database: str, table: str) -> list[dict]:
paginator = glue.get_paginator("get_partitions")
pages = paginator.paginate(DatabaseName=database, TableName=table)
partitions = []
for page in pages:
partitions.extend(page["Partitions"])
return partitions
parts = list_glue_partitions("bronze_db", "events")
print(f"Partition count: {len(parts)}")
for p in parts[:3]:
print(f" Values: {p['Values']} | Location: {p['StorageDescriptor']['Location']}")
# ── List all Glue databases ──
db_paginator = glue.get_paginator("get_databases")
all_dbs = list(db_paginator.paginate().search("DatabaseList[].Name"))
print(f"All databases: {all_dbs}")
Athena returns query results in pages of up to 1000 rows. To get all rows from a large result set you must paginate get_query_results. You also need to parse the ResultSet structure into a usable list of dicts.
import boto3, time
athena = boto3.client("athena", region_name="us-east-1")
def run_athena_query(sql: str, database: str, output_s3: str) -> list[dict]:
"""
Run an Athena query, wait for it, paginate all results, return list of dicts.
"""
# Step 1 — start query
start = athena.start_query_execution(
QueryString = sql,
QueryExecutionContext = {"Database": database},
ResultConfiguration = {"OutputLocation": output_s3}
)
query_id = start["QueryExecutionId"]
print(f"Started query: {query_id}")
# Step 2 — poll until done (Athena has no built-in waiter)
while True:
status = athena.get_query_execution(QueryExecutionId=query_id)
state = status["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
break
elif state in ("FAILED", "CANCELLED"):
reason = status["QueryExecution"]["Status"].get("StateChangeReason", "")
raise RuntimeError(f"Athena query {state}: {reason}")
time.sleep(2) # poll every 2 seconds
# Step 3 — paginate results
paginator = athena.get_paginator("get_query_results")
pages = paginator.paginate(QueryExecutionId=query_id)
rows = []
column_names = None
for page in pages:
result_set = page["ResultSet"]
if column_names is None:
# First row of first page = header row with column names
column_names = [
col["VarCharValue"]
for col in result_set["Rows"][0]["Data"]
]
data_rows = result_set["Rows"][1:] # skip header
else:
data_rows = result_set["Rows"]
for row in data_rows:
values = [cell.get("VarCharValue", "") for cell in row["Data"]]
rows.append(dict(zip(column_names, values)))
print(f"Query returned {len(rows)} rows")
return rows
# ── Usage ──
results = run_athena_query(
sql = "SELECT event_type, COUNT(*) as cnt FROM events GROUP BY 1 ORDER BY 2 DESC",
database = "bronze_db",
output_s3 = "s3://my-athena-results/queries/"
)
for row in results[:5]:
print(row) # {'event_type': 'click', 'cnt': '1234567'}
Used to audit running clusters and steps programmatically — useful for monitoring dashboards, cost tracking, and finding stuck or zombie jobs.
import boto3
emr = boto3.client("emr", region_name="us-east-1")
# ── List all RUNNING clusters ──
def list_running_clusters() -> list[dict]:
paginator = emr.get_paginator("list_clusters")
pages = paginator.paginate(ClusterStates=["RUNNING", "WAITING"])
clusters = []
for page in pages:
clusters.extend(page["Clusters"])
return clusters
running = list_running_clusters()
for c in running:
print(f"Cluster: {c['Name']} | ID: {c['Id']} | State: {c['Status']['State']}")
# ── List all steps for a cluster ──
def list_cluster_steps(cluster_id: str) -> list[dict]:
paginator = emr.get_paginator("list_steps")
pages = paginator.paginate(ClusterId=cluster_id)
steps = []
for page in pages:
steps.extend(page["Steps"])
return steps
steps = list_cluster_steps("j-XXXXXXXXXXXXX")
for s in steps:
print(f" Step: {s['Name']} | State: {s['Status']['State']}")
DynamoDB scan and query return up to 1MB of data per call. For large tables or audit tables with many records, you must paginate. Use query over scan whenever possible — scan reads the entire table, query uses the index.
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.client("dynamodb", region_name="us-east-1")
# ── Paginated scan — read all items in audit table ──
def scan_all_items(table_name: str) -> list[dict]:
paginator = dynamo.get_paginator("scan")
pages = paginator.paginate(TableName=table_name)
items = []
for page in pages:
items.extend(page["Items"])
return items
all_runs = scan_all_items("pipeline_audit")
print(f"Total audit records: {len(all_runs)}")
# ── Paginated query — get all runs for a specific pipeline ──
def query_pipeline_runs(table_name: str, pipeline_name: str) -> list[dict]:
paginator = dynamo.get_paginator("query")
pages = paginator.paginate(
TableName = table_name,
KeyConditionExpression = "pipeline_name = :pn",
ExpressionAttributeValues = {":pn": {"S": pipeline_name}},
ScanIndexForward = False # newest first
)
items = []
for page in pages:
items.extend(page["Items"])
return items
runs = query_pipeline_runs("pipeline_audit", "s3_to_delta_bronze")
for run in runs[:5]:
print(f" run_id={run['run_id']['S']} | status={run['status']['S']}")
scan reads every item in the table — it costs RCUs proportional to table size. For production audit tables with millions of records, always design a GSI (Global Secondary Index) and use query instead.
| Service | Operation | Key Result Field | Token Field |
|---|---|---|---|
| S3 | list_objects_v2 | Contents | ContinuationToken |
| S3 | list_buckets | Buckets | N/A (no pagination) |
| Glue | get_tables | TableList | NextToken |
| Glue | get_partitions | Partitions | NextToken |
| Glue | get_databases | DatabaseList | NextToken |
| Glue | get_job_runs | JobRuns | NextToken |
| Athena | get_query_results | ResultSet.Rows | NextToken |
| Athena | list_query_executions | QueryExecutionIds | NextToken |
| EMR | list_clusters | Clusters | Marker |
| EMR | list_steps | Steps | Marker |
| DynamoDB | scan | Items | LastEvaluatedKey |
| DynamoDB | query | Items | LastEvaluatedKey |
| SNS | list_topics | Topics | NextToken |
| Lambda | list_functions | Functions | NextMarker |
| IAM | list_roles | Roles | Marker |
A waiter calls a specific AWS API on a fixed interval (e.g. every 15 seconds) and checks a field in the response. When that field matches the desired state (e.g. State = "running"), the waiter returns. If the field matches a failure state, it raises WaiterError. If max attempts are exhausted, it also raises WaiterError.
import boto3
from botocore.exceptions import WaiterError
s3 = boto3.client("s3", region_name="us-east-1")
# ── Get a waiter by name ──
waiter = s3.get_waiter("bucket_exists")
# ── Wait until the S3 bucket exists ──
# Polls s3.head_bucket() every 5s, up to 20 attempts (100s total)
try:
waiter.wait(Bucket="my-new-data-lake-bucket")
print("✅ Bucket is now available")
except WaiterError as e:
print(f"❌ Waiter failed: {e}")
# Either bucket never appeared, or an error occurred during polling
client.waiter_names lists all built-in waiters for that service client.Example:
print(s3.waiter_names) →
['bucket_exists', 'bucket_not_exists', 'object_exists', 'object_not_exists']
S3 waiters are used to confirm that a bucket or object exists before proceeding — essential in pipelines where an upstream step creates a file and a downstream step needs to read it.
import boto3
from botocore.exceptions import WaiterError
from botocore.waiter import WaiterConfig
s3 = boto3.client("s3", region_name="us-east-1")
# ── Wait for a bucket to exist ──
bucket_waiter = s3.get_waiter("bucket_exists")
bucket_waiter.wait(
Bucket = "my-data-lake-bucket",
WaiterConfig = WaiterConfig(delay=5, max_attempts=12) # 5s × 12 = 60s max
)
print("Bucket exists — proceeding")
# ── Wait for a specific object (file) to appear ──
object_waiter = s3.get_waiter("object_exists")
try:
object_waiter.wait(
Bucket = "my-data-lake-bucket",
Key = "bronze/events/2024/01/01/events.parquet",
WaiterConfig = WaiterConfig(delay=10, max_attempts=30) # 10s × 30 = 5min max
)
print("✅ File has landed — safe to read")
except WaiterError:
raise RuntimeError("File did not appear within 5 minutes — upstream job may have failed")
# ── Wait for object to be DELETED (useful for cleanup verification) ──
delete_waiter = s3.get_waiter("object_not_exists")
delete_waiter.wait(
Bucket = "my-data-lake-bucket",
Key = "temp/staging/scratch_file.csv"
)
print("Temp file is gone")
EMR has built-in waiters for cluster and step states. These are the most commonly used waiters in data pipelines that run Spark jobs on EMR — you submit a step and then wait for it to complete before writing the audit record or triggering the next step.
import boto3
from botocore.exceptions import WaiterError
from botocore.waiter import WaiterConfig
emr = boto3.client("emr", region_name="us-east-1")
# ── Step 1: Start a cluster ──
cluster = emr.run_job_flow(
Name = "bronze-ingestion-cluster",
ReleaseLabel = "emr-6.15.0",
Instances = {
"MasterInstanceType": "m5.xlarge",
"SlaveInstanceType" : "m5.xlarge",
"InstanceCount" : 3,
"KeepJobFlowAliveWhenNoSteps": True
},
Applications = [{"Name": "Spark"}],
JobFlowRole = "EMR_EC2_DefaultRole",
ServiceRole = "EMR_DefaultRole",
AutoTerminate = False
)
cluster_id = cluster["JobFlowId"]
print(f"Cluster launched: {cluster_id}")
# ── Step 2: Wait for cluster to be RUNNING ──
cluster_waiter = emr.get_waiter("cluster_running")
try:
cluster_waiter.wait(
ClusterId = cluster_id,
WaiterConfig = WaiterConfig(delay=30, max_attempts=40) # 30s × 40 = 20min max
)
print("✅ Cluster is running")
except WaiterError as e:
raise RuntimeError(f"Cluster failed to start: {e}")
# ── Step 3: Submit a Spark step ──
step_response = emr.add_job_flow_steps(
JobFlowId = cluster_id,
Steps = [{
"Name" : "run-bronze-etl",
"ActionOnFailure" : "CONTINUE",
"HadoopJarStep" : {
"Jar" : "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"s3://my-scripts/bronze_etl.py",
"--date", "2024-01-01"
]
}
}]
)
step_id = step_response["StepIds"][0]
print(f"Step submitted: {step_id}")
# ── Step 4: Wait for step to COMPLETE ──
step_waiter = emr.get_waiter("step_complete")
try:
step_waiter.wait(
ClusterId = cluster_id,
StepId = step_id,
WaiterConfig = WaiterConfig(delay=30, max_attempts=120) # 30s × 120 = 60min max
)
print("✅ Spark step completed successfully")
except WaiterError as e:
# Check if it FAILED or just timed out
step_info = emr.describe_step(ClusterId=cluster_id, StepId=step_id)
step_state = step_info["Step"]["Status"]["State"]
raise RuntimeError(f"Step ended in state: {step_state} — Error: {e}")
# ── Step 5: Terminate cluster ──
emr.terminate_job_flows(JobFlowIds=[cluster_id])
print("Cluster termination initiated")
After deploying a new Lambda function or updating its code, it takes a few seconds to become active. These waiters ensure the function is ready before you invoke it — important in CI/CD pipelines that deploy and then immediately test the function.
import boto3, zipfile, io
from botocore.waiter import WaiterConfig
lam = boto3.client("lambda", region_name="us-east-1")
# ── Deploy new function code ──
lam.update_function_code(
FunctionName = "my-pipeline-trigger",
S3Bucket = "my-deploy-bucket",
S3Key = "lambdas/pipeline_trigger_v2.zip"
)
print("Code update submitted")
# ── Wait for update to complete before invoking ──
update_waiter = lam.get_waiter("function_updated")
update_waiter.wait(
FunctionName = "my-pipeline-trigger",
WaiterConfig = WaiterConfig(delay=5, max_attempts=20) # 5s × 20 = 100s max
)
print("✅ Lambda function updated and ready")
# ── Now safe to invoke ──
response = lam.invoke(
FunctionName = "my-pipeline-trigger",
InvocationType = "RequestResponse",
Payload = b'{"env": "prod", "date": "2024-01-01"}'
)
print(f"Lambda response status: {response['StatusCode']}")
Every waiter has a default delay (seconds between polls) and max_attempts (how many times to poll before giving up). The defaults are often too short for slow operations like EMR cluster startup. Always override them with WaiterConfig to match your operation's expected duration.
import boto3
from botocore.waiter import WaiterConfig
from botocore.exceptions import WaiterError
emr = boto3.client("emr", region_name="us-east-1")
# ── Default waiter config (often too short for EMR) ──
# cluster_running default: delay=30s, max_attempts=60 → 30 minutes max
# step_complete default: delay=30s, max_attempts=60 → 30 minutes max
# ── Override for a long-running Spark job ──
long_job_config = WaiterConfig(
delay = 60, # poll every 60 seconds
max_attempts = 120 # max 120 polls = 120 minutes (2 hours) max wait
)
step_waiter = emr.get_waiter("step_complete")
try:
step_waiter.wait(
ClusterId = "j-XXXXXXXXXXXXX",
StepId = "s-XXXXXXXXXXXXX",
WaiterConfig = long_job_config
)
print("✅ Step complete within 2 hours")
except WaiterError:
print("❌ Step did not complete within 2 hours — investigate")
# ── Fast config for quick operations like S3 object check ──
fast_config = WaiterConfig(delay=3, max_attempts=20) # 3s × 20 = 60s max
s3 = boto3.client("s3", region_name="us-east-1")
s3.get_waiter("object_exists").wait(
Bucket = "my-bucket",
Key = "trigger/ready.flag",
WaiterConfig = fast_config
)
| Waiter | Default delay | Default max_attempts | Recommended override |
|---|---|---|---|
| s3 bucket_exists | 5s | 20 | delay=5, max=12 (60s) |
| s3 object_exists | 5s | 20 | delay=10, max=30 (5min) |
| emr cluster_running | 30s | 60 | delay=30, max=40 (20min) |
| emr step_complete | 30s | 60 | delay=60, max=120 (2hr) |
| lambda function_active | 5s | 60 | delay=5, max=20 (100s) |
| lambda function_updated | 5s | 60 | delay=5, max=20 (100s) |
Two of the most commonly used services in data engineering — Glue and Athena — do not have built-in boto3 waiters. You must write your own polling loop. The pattern is always the same: call the status API, check the state field, sleep, repeat until terminal state.
glue.start_job_run() and assume the job is done when the call returns. It isn't — start_job_run is asynchronous. You must poll get_job_run() until the state is SUCCEEDED or FAILED.
import boto3, time
from botocore.exceptions import ClientError
glue = boto3.client("glue", region_name="us-east-1")
# Terminal states — stop polling when we hit one of these
GLUE_TERMINAL_STATES = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}
def wait_for_glue_job(
job_name : str,
run_id : str,
poll_interval: int = 30, # seconds between polls
max_wait : int = 7200 # maximum total wait time in seconds (2 hours)
) -> str:
"""
Poll Glue job until terminal state. Returns final JobRunState string.
Raises RuntimeError if job fails or times out.
"""
elapsed = 0
attempt = 0
while elapsed < max_wait:
attempt += 1
try:
response = glue.get_job_run(JobName=job_name, RunId=run_id)
state = response["JobRun"]["JobRunState"]
duration = response["JobRun"].get("ExecutionTime", 0)
print(f" [{attempt}] Glue job '{job_name}' state: {state} "
f"(elapsed: {elapsed}s, execution: {duration}s)")
if state in GLUE_TERMINAL_STATES:
if state == "SUCCEEDED":
print(f"✅ Glue job completed successfully in {duration}s")
return state
else:
error_msg = response["JobRun"].get("ErrorMessage", "No error message")
raise RuntimeError(
f"❌ Glue job '{job_name}' ended with state '{state}': {error_msg}"
)
except ClientError as e:
code = e.response["Error"]["Code"]
if code in ("ThrottlingException", "ServiceUnavailableException"):
print(f" Throttled during polling — will retry after sleep")
else:
raise
time.sleep(poll_interval)
elapsed += poll_interval
raise TimeoutError(
f"Glue job '{job_name}' did not complete within {max_wait}s "
f"(last known state: polling timed out)"
)
# ── Full usage pattern ──
job_name = "bronze-events-etl"
# Start the job
run_response = glue.start_job_run(
JobName = job_name,
Arguments = {
"--date" : "2024-01-01",
"--env" : "prod",
"--output_path": "s3://my-data-lake/bronze/events/"
}
)
run_id = run_response["JobRunId"]
print(f"Started Glue job run: {run_id}")
# Wait for it to finish
final_state = wait_for_glue_job(job_name, run_id, poll_interval=30, max_wait=3600)
print(f"Final state: {final_state}")
Athena query execution is also asynchronous — start_query_execution returns immediately with a query ID. You must poll get_query_execution to know when it finishes. Here is the production-grade pattern with proper error handling and backoff.
import boto3, time
from botocore.exceptions import ClientError
athena = boto3.client("athena", region_name="us-east-1")
ATHENA_TERMINAL_STATES = {"SUCCEEDED", "FAILED", "CANCELLED"}
def wait_for_athena_query(
query_execution_id: str,
poll_interval : int = 2, # start with 2s polls
max_wait : int = 1800 # 30 minutes max
) -> dict:
"""
Poll Athena query until terminal state.
Returns the full QueryExecution dict on success.
Raises RuntimeError on FAILED/CANCELLED or timeout.
"""
elapsed = 0
attempt = 0
interval = poll_interval # will increase with backoff
while elapsed < max_wait:
attempt += 1
try:
response = athena.get_query_execution(QueryExecutionId=query_execution_id)
execution = response["QueryExecution"]
state = execution["Status"]["State"]
print(f" [{attempt}] Athena query state: {state} (elapsed: {elapsed:.0f}s)")
if state in ATHENA_TERMINAL_STATES:
if state == "SUCCEEDED":
stats = execution.get("Statistics", {})
print(f"✅ Query succeeded | "
f"Scanned: {stats.get('DataScannedInBytes',0)/1e6:.1f} MB | "
f"Runtime: {stats.get('TotalExecutionTimeInMillis',0)/1000:.1f}s")
return execution
else:
reason = execution["Status"].get("StateChangeReason", "No reason given")
raise RuntimeError(
f"❌ Athena query {state}: {reason} "
f"(QueryExecutionId: {query_execution_id})"
)
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "ThrottlingException":
interval = min(interval * 2, 30) # back off on throttle, cap at 30s
print(f" Throttled — backing off to {interval}s")
else:
raise
time.sleep(interval)
elapsed += interval
# Gradually increase poll interval to reduce API calls for long queries
if elapsed > 30 and interval < 10:
interval = 10
elif elapsed > 120 and interval < 20:
interval = 20
raise TimeoutError(
f"Athena query {query_execution_id} did not complete within {max_wait}s"
)
# ── Full Athena workflow ──
start = athena.start_query_execution(
QueryString = """
SELECT date_trunc('hour', event_time) AS hour,
event_type,
COUNT(*) AS cnt
FROM bronze_db.events
WHERE year = '2024' AND month = '01'
GROUP BY 1, 2
ORDER BY 1, 3 DESC
""",
QueryExecutionContext = {"Database": "bronze_db"},
ResultConfiguration = {"OutputLocation": "s3://my-athena-results/queries/"}
)
query_id = start["QueryExecutionId"]
print(f"Athena query started: {query_id}")
execution = wait_for_athena_query(query_id, poll_interval=2, max_wait=900)
print(f"Query complete — now fetching results...")
boto3 allows you to define a custom waiter using WaiterModel — a JSON-like config that tells boto3 which API to call, which field to inspect, and which values are success vs failure. This is the proper way to build reusable waiters for services like Glue that don't have built-in ones.
import boto3
from botocore.waiter import WaiterModel, create_waiter_with_client
from botocore.exceptions import WaiterError
glue = boto3.client("glue", region_name="us-east-1")
# ── Define the custom waiter model ──
# This tells boto3: call GetJobRun, look at JobRun.JobRunState,
# succeed on SUCCEEDED, fail on FAILED/ERROR/TIMEOUT/STOPPED
glue_job_waiter_model = WaiterModel({
"version" : 2,
"waiters" : {
"JobRunComplete": {
"delay" : 30, # poll every 30 seconds
"maxAttempts": 120, # up to 120 attempts = 60 minutes
"operation" : "GetJobRun", # which boto3 API to call
"acceptors" : [
{
"matcher" : "path",
"expected" : "SUCCEEDED",
"argument" : "JobRun.JobRunState",
"state" : "success" # waiter returns when this matches
},
{
"matcher" : "path",
"expected" : "FAILED",
"argument" : "JobRun.JobRunState",
"state" : "failure" # waiter raises WaiterError when this matches
},
{
"matcher" : "path",
"expected" : "ERROR",
"argument" : "JobRun.JobRunState",
"state" : "failure"
},
{
"matcher" : "path",
"expected" : "TIMEOUT",
"argument" : "JobRun.JobRunState",
"state" : "failure"
},
{
"matcher" : "path",
"expected" : "STOPPED",
"argument" : "JobRun.JobRunState",
"state" : "failure"
}
]
}
}
})
# ── Create the waiter from the model ──
glue_job_waiter = create_waiter_with_client(
waiter_name = "JobRunComplete",
waiter_model = glue_job_waiter_model,
client = glue
)
# ── Use it exactly like a built-in waiter ──
job_name = "bronze-events-etl"
run = glue.start_job_run(JobName=job_name, Arguments={"--date": "2024-01-01"})
run_id = run["JobRunId"]
print(f"Started Glue job: {run_id}")
try:
glue_job_waiter.wait(JobName=job_name, RunId=run_id)
print("✅ Glue job succeeded")
except WaiterError as e:
raise RuntimeError(f"❌ Glue job failed or timed out: {e}")
Use a manual polling loop when you need dynamic logic — e.g. adaptive poll intervals, logging intermediate states, or writing progress to a DynamoDB audit table during the wait.
| Service | Waiter Name | Polls API | Waits For |
|---|---|---|---|
| S3 | bucket_exists | head_bucket | Bucket to appear |
| S3 | bucket_not_exists | head_bucket | Bucket to be deleted |
| S3 | object_exists | head_object | Object to appear |
| S3 | object_not_exists | head_object | Object to be deleted |
| EMR | cluster_running | describe_cluster | Cluster in RUNNING state |
| EMR | cluster_terminated | describe_cluster | Cluster terminated |
| EMR | step_complete | describe_step | Step COMPLETED |
| Lambda | function_active | get_function | Function Active state |
| Lambda | function_updated | get_function_configuration | Update propagated |
| Glue | ❌ None built-in | get_job_run (manual) | SUCCEEDED / FAILED |
| Athena | ❌ None built-in | get_query_execution (manual) | SUCCEEDED / FAILED |
| DynamoDB | table_exists | describe_table | Table ACTIVE |
| DynamoDB | table_not_exists | describe_table | Table deleted |
No built-in waiter (Glue, Athena) → write a manual polling loop with proper terminal state checks and error handling, or build a
WaiterModel for team reuse.
Bucket-level operations are used in infrastructure setup, CI/CD pipelines, and environment provisioning. head_bucket is the right way to check if a bucket exists — it's lightweight and doesn't list contents.
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
REGION = "us-east-1"
# ── Create a bucket ──
# Note: us-east-1 does NOT take CreateBucketConfiguration — all other regions do
def create_bucket(bucket_name: str, region: str = "us-east-1"):
if region == "us-east-1":
s3.create_bucket(Bucket=bucket_name)
else:
s3.create_bucket(
Bucket = bucket_name,
CreateBucketConfiguration = {"LocationConstraint": region}
)
print(f"✅ Created bucket: {bucket_name}")
create_bucket("my-data-lake-bronze")
# ── List all buckets in the account ──
response = s3.list_buckets()
for bucket in response["Buckets"]:
print(f" {bucket['Name']} (created: {bucket['CreationDate'].date()})")
# ── Check if bucket exists (lightweight — does NOT list contents) ──
def bucket_exists(bucket_name: str) -> bool:
try:
s3.head_bucket(Bucket=bucket_name)
return True
except ClientError as e:
code = e.response["Error"]["Code"]
if code in ("404", "NoSuchBucket"):
return False
raise # re-raise unexpected errors (e.g. access denied)
print(bucket_exists("my-data-lake-bronze")) # True
print(bucket_exists("nonexistent-bucket")) # False
# ── Delete a bucket (must be empty first) ──
s3.delete_bucket(Bucket="my-temp-bucket")
print("Bucket deleted")
Bucket policies control access at the bucket level. Lifecycle rules automate storage class transitions and object expiry — critical for cost control in data lakes where raw/bronze data accumulates over time.
import boto3, json
s3 = boto3.client("s3", region_name="us-east-1")
# ── Set a bucket policy ──
# This policy allows a specific IAM role to read all objects
bucket_policy = {
"Version" : "2012-10-17",
"Statement": [{
"Sid" : "AllowGlueRead",
"Effect" : "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:role/GlueETLRole"},
"Action" : ["s3:GetObject", "s3:ListBucket"],
"Resource" : [
"arn:aws:s3:::my-data-lake-bronze",
"arn:aws:s3:::my-data-lake-bronze/*"
]
}]
}
s3.put_bucket_policy(
Bucket = "my-data-lake-bronze",
Policy = json.dumps(bucket_policy)
)
print("Bucket policy applied")
# ── Read current policy ──
policy = s3.get_bucket_policy(Bucket="my-data-lake-bronze")
print(json.loads(policy["Policy"]))
# ── Set lifecycle rules — transition old data to cheaper storage ──
lifecycle_config = {
"Rules": [
{
"ID" : "bronze-archive-policy",
"Status" : "Enabled",
"Filter" : {"Prefix": "bronze/"}, # apply to bronze/ prefix only
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"}, # after 30 days → IA
{"Days": 90, "StorageClass": "GLACIER_IR"}, # after 90 days → Glacier IR
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}, # after 1 year → Deep Archive
],
"Expiration": {"Days": 2555} # delete after 7 years
},
{
"ID" : "delete-temp-files",
"Status": "Enabled",
"Filter": {"Prefix": "temp/"},
"Expiration": {"Days": 3} # temp files auto-deleted after 3 days
}
]
}
s3.put_bucket_lifecycle_configuration(
Bucket = "my-data-lake-bronze",
LifecycleConfiguration = lifecycle_config
)
print("Lifecycle rules applied")
# ── Enable versioning ──
s3.put_bucket_versioning(
Bucket = "my-data-lake-bronze",
VersioningConfiguration = {"Status": "Enabled"}
)
print("Versioning enabled")
Temp/staging data: expire after 3–7 days automatically.
Gold (curated) data: keep in Standard, no transition — it's queried frequently.
Both upload data to S3 but they work differently. upload_file is a high-level method from the S3 Transfer Manager — it automatically handles multipart uploads for large files, retries, and concurrency. put_object is a low-level single HTTP PUT — use it only for small objects or when you need full control over metadata and content type.
upload_file is like a courier service — it figures out the best way to ship your package (breaks it into pieces if it's too big, retries if the truck breaks down). put_object is like mailing a letter yourself — simple, direct, but limited to small items and no retry logic.
import boto3, json
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
# ── upload_file — USE THIS for most cases (especially large files) ──
# Automatically uses multipart upload for files > 8MB
# Has built-in retry logic and progress callback support
s3.upload_file(
Filename = "/local/path/events_2024_01_01.parquet", # local file path
Bucket = "my-data-lake",
Key = "bronze/events/year=2024/month=01/day=01/events.parquet",
ExtraArgs = {
"ContentType" : "application/octet-stream",
"ServerSideEncryption": "AES256", # SSE-S3 encryption
"Metadata" : {
"pipeline" : "bronze-ingestion",
"source" : "kafka-events",
"row_count" : "1000000"
}
}
)
print("✅ Large file uploaded with multipart automatically")
# ── put_object — USE FOR small objects, in-memory data, config files ──
# Single HTTP PUT — no multipart, no automatic retry
config_data = {"pipeline": "bronze-etl", "version": "2.1.0", "active": True}
s3.put_object(
Bucket = "my-data-lake",
Key = "config/pipeline_config.json",
Body = json.dumps(config_data).encode("utf-8"),
ContentType = "application/json"
)
print("Config file written to S3")
# ── put_object for writing a string directly (e.g. SQL, manifest) ──
sql_query = "SELECT * FROM events WHERE year = '2024'"
s3.put_object(
Bucket = "my-data-lake",
Key = "queries/daily_events.sql",
Body = sql_query.encode("utf-8"),
ContentType = "text/plain"
)
# ── put_object for writing bytes from memory ──
import io, pandas as pd
df = pd.DataFrame({"id": [1,2,3], "value": ["a","b","c"]})
buf = io.BytesIO()
df.to_parquet(buf, index=False)
buf.seek(0)
s3.put_object(
Bucket = "my-data-lake",
Key = "temp/small_df.parquet",
Body = buf.read(),
ContentType = "application/octet-stream"
)
| Method | Source | Multipart | Retry | Best For |
|---|---|---|---|---|
upload_file() | Local file path | ✅ Auto | ✅ Built-in | Large files (>8MB) |
upload_fileobj() | File-like object | ✅ Auto | ✅ Built-in | Streaming / in-memory |
put_object() | Bytes / string | ❌ None | ❌ None | Small objects <5MB |
Same pattern on the download side. download_file saves directly to disk with multipart download and retry. get_object returns a streaming response body — use it when you want to read content into memory without saving to disk first.
import boto3, json, io
import pandas as pd
s3 = boto3.client("s3", region_name="us-east-1")
# ── download_file — saves directly to disk ──
s3.download_file(
Bucket = "my-data-lake",
Key = "bronze/events/year=2024/month=01/day=01/events.parquet",
Filename = "/tmp/events.parquet"
)
df = pd.read_parquet("/tmp/events.parquet")
print(f"Downloaded and loaded: {len(df):,} rows")
# ── get_object — read into memory (no disk write) ──
response = s3.get_object(
Bucket = "my-data-lake",
Key = "config/pipeline_config.json"
)
# response["Body"] is a StreamingBody — must read() it
config = json.loads(response["Body"].read().decode("utf-8"))
print(f"Config loaded: {config}")
# ── get_object for Parquet directly into pandas ──
response = s3.get_object(
Bucket = "my-data-lake",
Key = "bronze/events/year=2024/month=01/day=01/events.parquet"
)
buf = io.BytesIO(response["Body"].read())
df = pd.read_parquet(buf)
print(f"Loaded parquet from S3 into memory: {df.shape}")
# ── download_fileobj — streaming download into file-like object ──
buf = io.BytesIO()
s3.download_fileobj(
Bucket = "my-data-lake",
Key = "bronze/events/year=2024/month=01/day=01/events.parquet",
Fileobj = buf
)
buf.seek(0)
df = pd.read_parquet(buf)
print(f"Streaming download complete: {df.shape}")
head_object fetches an object's metadata without downloading its content. It is the correct and efficient way to check if a file exists, get its size, last modified time, and custom metadata — all in a single lightweight API call.
import boto3
from botocore.exceptions import ClientError
s3 = boto3.client("s3", region_name="us-east-1")
def get_object_info(bucket: str, key: str) -> dict | None:
"""
Return object metadata if it exists, None if it doesn't.
Never downloads the object body.
"""
try:
response = s3.head_object(Bucket=bucket, Key=key)
return {
"size" : response["ContentLength"], # bytes
"last_modified": response["LastModified"], # datetime
"content_type" : response["ContentType"],
"etag" : response["ETag"].strip('"'), # MD5 hash
"metadata" : response.get("Metadata", {}), # custom metadata
"storage_class": response.get("StorageClass", "STANDARD")
}
except ClientError as e:
if e.response["Error"]["Code"] in ("404", "NoSuchKey"):
return None
raise
# ── Check if today's file has already landed ──
info = get_object_info(
"my-data-lake",
"bronze/events/year=2024/month=01/day=01/events.parquet"
)
if info:
print(f"File exists — size: {info['size']/1e6:.1f} MB, "
f"modified: {info['last_modified']}")
print(f"Row count from metadata: {info['metadata'].get('row_count', 'unknown')}")
else:
print("File not found — upstream job may not have run yet")
Deleting and copying are common in pipeline cleanup, archiving, and promotion workflows. delete_objects can delete up to 1000 objects in a single API call — always use the batch version when cleaning up many files to avoid thousands of individual API calls.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
# ── Delete single object ──
s3.delete_object(Bucket="my-data-lake", Key="temp/staging/scratch.csv")
print("Single file deleted")
# ── Batch delete up to 1000 objects per call ──
# First list the objects to delete
paginator = s3.get_paginator("list_objects_v2")
pages = paginator.paginate(Bucket="my-data-lake", Prefix="temp/")
keys_to_delete = [
{"Key": obj["Key"]}
for page in pages
for obj in page.get("Contents", [])
]
# Delete in batches of 1000 (S3 limit per call)
for i in range(0, len(keys_to_delete), 1000):
batch = keys_to_delete[i:i+1000]
response = s3.delete_objects(
Bucket = "my-data-lake",
Delete = {
"Objects": batch,
"Quiet" : True # suppress per-object success responses
}
)
errors = response.get("Errors", [])
if errors:
for err in errors:
print(f" Failed to delete {err['Key']}: {err['Message']}")
else:
print(f" Deleted batch of {len(batch)} objects")
print(f"Total deleted: {len(keys_to_delete)} objects")
# ── Copy object — promote from bronze to silver ──
# Copy does NOT download the file to your machine — it's a server-side copy
s3.copy_object(
CopySource = {
"Bucket": "my-data-lake",
"Key" : "bronze/events/year=2024/month=01/day=01/events.parquet"
},
Bucket = "my-data-lake",
Key = "silver/events/year=2024/month=01/day=01/events.parquet",
MetadataDirective = "COPY" # COPY keeps original metadata; REPLACE overwrites it
)
print("Server-side copy complete — no data downloaded")
copy_object never moves data through your machine — AWS copies it entirely within S3. For large files this is far faster and cheaper than download + re-upload. Use it for promotions (bronze → silver), archiving, and cross-bucket copies within the same region.
A presigned URL grants temporary access to a private S3 object without requiring the recipient to have AWS credentials. The URL is valid for a specified number of seconds. Used for sharing pipeline outputs with external teams, giving applications download links, and enabling secure file uploads from browsers.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
# ── Presigned GET URL — share a file for download ──
download_url = s3.generate_presigned_url(
ClientMethod = "get_object",
Params = {
"Bucket": "my-data-lake",
"Key" : "gold/reports/daily_summary_2024_01_01.csv"
},
ExpiresIn = 3600 # valid for 1 hour (3600 seconds)
)
print(f"Share this URL (expires in 1 hour):\n{download_url}")
# Anyone with this URL can download the file without AWS credentials
# ── Presigned PUT URL — allow external system to upload directly to S3 ──
upload_url = s3.generate_presigned_url(
ClientMethod = "put_object",
Params = {
"Bucket" : "my-data-lake",
"Key" : "landing/external_feed/partner_data.csv",
"ContentType": "text/csv"
},
ExpiresIn = 900 # valid for 15 minutes
)
print(f"Upload URL for partner:\n{upload_url}")
# Partner can HTTP PUT their file to this URL without any AWS SDK
For files larger than 5GB (required) or 100MB (recommended), S3 requires multipart upload. You split the file into parts, upload each part separately, then tell S3 to assemble them. If a part fails, you only retry that part. upload_file does this automatically — but understanding the manual API is important for custom streaming scenarios.
import boto3, os
s3 = boto3.client("s3", region_name="us-east-1")
BUCKET = "my-data-lake"
KEY = "bronze/large_dataset/events_full_year.parquet"
LOCAL_FILE = "/data/events_full_year.parquet"
PART_SIZE = 100 * 1024 * 1024 # 100 MB per part (minimum 5MB except last part)
# ── Step 1: Initiate multipart upload ──
mpu = s3.create_multipart_upload(Bucket=BUCKET, Key=KEY)
upload_id = mpu["UploadId"]
print(f"Multipart upload initiated: {upload_id}")
parts = []
part_num = 1
try:
with open(LOCAL_FILE, "rb") as f:
while True:
data = f.read(PART_SIZE)
if not data:
break # end of file
# ── Step 2: Upload each part ──
response = s3.upload_part(
Bucket = BUCKET,
Key = KEY,
UploadId = upload_id,
PartNumber = part_num,
Body = data
)
parts.append({
"PartNumber": part_num,
"ETag" : response["ETag"] # AWS returns ETag per part
})
print(f" Uploaded part {part_num} ({len(data)/1e6:.1f} MB)")
part_num += 1
# ── Step 3: Complete the multipart upload ──
s3.complete_multipart_upload(
Bucket = BUCKET,
Key = KEY,
UploadId = upload_id,
MultipartUpload = {"Parts": parts}
)
print(f"✅ Multipart upload complete — {part_num-1} parts uploaded")
except Exception as e:
# ── CRITICAL: Always abort on failure to avoid storage charges ──
s3.abort_multipart_upload(Bucket=BUCKET, Key=KEY, UploadId=upload_id)
print(f"❌ Upload failed — multipart aborted: {e}")
raise
abort_multipart_upload in your except block. Set a lifecycle rule to auto-abort incomplete multipart uploads after 7 days as a safety net.
The S3 Transfer Manager (used by upload_file and download_file) can be tuned via TransferConfig to maximise throughput. Key parameters: multipart_threshold (file size above which multipart kicks in), max_concurrency (parallel part uploads), and multipart_chunksize (size of each part).
import boto3
from boto3.s3.transfer import TransferConfig
from concurrent.futures import ThreadPoolExecutor
s3 = boto3.client("s3", region_name="us-east-1")
# ── Tune the Transfer Manager ──
transfer_config = TransferConfig(
multipart_threshold = 50 * 1024 * 1024, # use multipart for files > 50 MB
multipart_chunksize = 50 * 1024 * 1024, # each part = 50 MB
max_concurrency = 20, # 20 parallel threads per transfer
use_threads = True
)
# ── Upload with tuned config ──
def upload_with_progress(local_path: str, bucket: str, key: str):
uploaded_bytes = [0]
file_size = os.path.getsize(local_path)
def progress(bytes_transferred):
uploaded_bytes[0] += bytes_transferred
pct = uploaded_bytes[0] / file_size * 100
print(f"\r Progress: {pct:.1f}% ({uploaded_bytes[0]/1e6:.1f}/{file_size/1e6:.1f} MB)", end="")
s3.upload_file(
Filename = local_path,
Bucket = bucket,
Key = key,
Config = transfer_config,
Callback = progress
)
print(f"\n✅ Uploaded: {key}")
import os
upload_with_progress(
"/data/events_2024.parquet",
"my-data-lake",
"bronze/events/events_2024.parquet"
)
# ── Parallel upload of multiple files using ThreadPoolExecutor ──
files_to_upload = [
("/data/jan.parquet", "bronze/events/year=2024/month=01/data.parquet"),
("/data/feb.parquet", "bronze/events/year=2024/month=02/data.parquet"),
("/data/mar.parquet", "bronze/events/year=2024/month=03/data.parquet"),
]
def upload_one(args):
local_path, s3_key = args
s3.upload_file(local_path, "my-data-lake", s3_key, Config=transfer_config)
print(f" ✅ Uploaded: {s3_key}")
with ThreadPoolExecutor(max_workers=4) as pool:
pool.map(upload_one, files_to_upload)
print("All files uploaded in parallel")
S3 Select lets you run a SQL expression against a single S3 object (CSV, JSON, Parquet) and retrieve only the matching rows — without downloading the entire file. For large files where you only need a subset, this can reduce data transfer by 80–90%.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
# ── S3 Select on a CSV file — get only rows where status = 'ERROR' ──
response = s3.select_object_content(
Bucket = "my-data-lake",
Key = "bronze/pipeline_logs/logs_2024_01_01.csv",
ExpressionType = "SQL",
Expression = "SELECT s.run_id, s.status, s.error_msg FROM S3Object s WHERE s.status = 'ERROR'",
InputSerialization = {
"CSV": {"FileHeaderInfo": "USE"}, # USE = first row is header
"CompressionType": "NONE"
},
OutputSerialization = {
"CSV": {}
}
)
# ── Read the streaming result ──
result_rows = []
for event in response["Payload"]:
if "Records" in event:
data = event["Records"]["Payload"].decode("utf-8")
result_rows.append(data)
elif "Stats" in event:
stats = event["Stats"]["Details"]
print(f"Bytes scanned: {stats['BytesScanned']:,} | "
f"Bytes returned: {stats['BytesReturned']:,}")
print(f"Error rows found: {len(result_rows)}")
for row in result_rows[:5]:
print(f" {row.strip()}")
S3 event notifications fire when objects are created, deleted, or restored. You configure them to send events to SQS, SNS, or Lambda. The most common data engineering pattern: new file lands in S3 → SQS receives the event → pipeline polls SQS and starts processing.
import boto3
s3 = boto3.client("s3", region_name="us-east-1")
sqs = boto3.client("sqs", region_name="us-east-1")
BUCKET = "my-data-lake"
SQS_ARN = "arn:aws:sqs:us-east-1:123456789012:s3-file-arrival-queue"
LAMBDA_ARN= "arn:aws:lambda:us-east-1:123456789012:function:trigger-pipeline"
# ── Configure S3 to send events to SQS on object creation ──
s3.put_bucket_notification_configuration(
Bucket = BUCKET,
NotificationConfiguration = {
"QueueConfigurations": [
{
"Id" : "new-bronze-file-notification",
"QueueArn": SQS_ARN,
"Events": ["s3:ObjectCreated:*"], # all create events (PUT, POST, COPY)
"Filter": {
"Key": {
"FilterRules": [
{"Name": "prefix", "Value": "landing/"}, # only landing/ prefix
{"Name": "suffix", "Value": ".parquet"} # only .parquet files
]
}
}
}
],
# ── Also trigger a Lambda directly for CSV files ──
"LambdaFunctionConfigurations": [
{
"Id" : "csv-arrival-lambda",
"LambdaFunctionArn" : LAMBDA_ARN,
"Events" : ["s3:ObjectCreated:Put"],
"Filter" : {
"Key": {
"FilterRules": [
{"Name": "prefix", "Value": "landing/csv/"},
{"Name": "suffix", "Value": ".csv"}
]
}
}
}
]
}
)
print("✅ S3 event notifications configured")
print(" New .parquet in landing/ → SQS queue")
print(" New .csv in landing/csv/ → Lambda function")
receive_message() → extracts bucket/key from message → starts Glue job or Spark step → deletes SQS message on success.This is the foundation of file-arrival triggered batch pipelines on AWS.
Glue APIs
AWS Glue is your central ETL orchestration and metadata layer. Its boto3 API covers four domains: Glue Data Catalog (databases, tables, partitions), Crawlers (schema discovery), ETL Jobs (start/poll/manage), and Data Quality. Every production pipeline touches at least two of these.
The Glue Data Catalog is a central metadata store — a fully managed Hive Metastore replacement. It stores database and table definitions (schema, location, partition info) that are shared across Glue jobs, EMR Spark, Athena, Redshift Spectrum, and Lake Formation. Think of it as the single source of truth for "what tables exist and where their data lives on S3."
Creates a logical namespace in the Glue Catalog. You provide a name and optional description. All tables for a project or domain live under one database.
import boto3
from botocore.exceptions import ClientError
glue = boto3.client("glue", region_name="us-east-1")
try:
glue.create_database(
DatabaseInput={
"Name": "orders_db",
"Description": "Orders domain — Bronze and Silver tables",
"LocationUri": "s3://my-datalake/orders/" # optional default S3 location
}
)
print("Database created: orders_db")
except ClientError as e:
if e.response["Error"]["Code"] == "AlreadyExistsException":
print("Database already exists — skipping")
else:
raise
create_database() once during environment provisioning (via Terraform or an init script). Day-to-day pipeline code never creates databases — it only creates tables inside an existing database.
get_database() fetches a single database by name. get_databases() lists all databases — use a paginator since there can be many.
# ── Get a single database ─────────────────────────────────────────
response = glue.get_database(Name="orders_db")
db = response["Database"]
print(db["Name"], db.get("Description"), db.get("LocationUri"))
# ── List ALL databases with paginator ─────────────────────────────
paginator = glue.get_paginator("get_databases")
for page in paginator.paginate():
for db in page["DatabaseList"]:
print(db["Name"])
Deletes the database definition from the catalog. It does NOT delete the underlying S3 data — only the metadata. Be careful in production — deleting a database removes all table definitions under it.
try:
glue.delete_database(Name="old_staging_db")
print("Deleted")
except ClientError as e:
if e.response["Error"]["Code"] == "EntityNotFoundException":
print("Database not found — nothing to delete")
else:
raise
You call create_table() to tell the Glue Catalog that a table exists — pointing it to an S3 path with schema information. This is what Glue Crawlers do under the hood. You can also do it manually for full schema control.
glue.create_table(
DatabaseName="orders_db",
TableInput={
"Name": "orders_bronze",
"Description": "Raw orders landed from DMS CDC",
"StorageDescriptor": {
"Columns": [
{"Name": "order_id", "Type": "string"},
{"Name": "customer_id","Type": "string"},
{"Name": "amount", "Type": "double"},
{"Name": "status", "Type": "string"},
{"Name": "created_at", "Type": "timestamp"},
],
"Location": "s3://my-datalake/bronze/orders/",
"InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Parameters": {"serialization.format": "1"}
},
"Compressed": False,
"StoredAsSubDirectories": False
},
"PartitionKeys": [
{"Name": "year", "Type": "string"},
{"Name": "month", "Type": "string"},
{"Name": "day", "Type": "string"},
],
"TableType": "EXTERNAL_TABLE", # data lives on S3, not managed by Glue
"Parameters": {
"classification": "parquet",
"EXTERNAL": "TRUE"
}
}
)
create_table() is like registering a filing cabinet in the office directory. You're telling everyone "there's a cabinet called orders_bronze on the 3rd floor shelf (S3 path), and here's what kinds of documents are inside (schema)." The documents themselves aren't moved — only the registration is created.
Retrieves the full table metadata — schema, S3 location, partition keys, SerDe info. Useful for validation before running a pipeline: check if the table exists and if the schema matches what you expect.
def table_exists(glue_client, database: str, table: str) -> bool:
try:
glue_client.get_table(DatabaseName=database, Name=table)
return True
except ClientError as e:
if e.response["Error"]["Code"] == "EntityNotFoundException":
return False
raise
if table_exists(glue, "orders_db", "orders_bronze"):
resp = glue.get_table(DatabaseName="orders_db", Name="orders_bronze")
tbl = resp["Table"]
cols = tbl["StorageDescriptor"]["Columns"]
loc = tbl["StorageDescriptor"]["Location"]
print(f"Location: {loc}")
print(f"Columns: {[c['Name'] for c in cols]}")
When your source schema evolves (a new column arrives), you update the Glue table definition so Athena and Glue jobs see the new column. You must pass the full updated TableInput — not just the delta.
# 1. Fetch existing definition
existing = glue.get_table(DatabaseName="orders_db", Name="orders_bronze")["Table"]
sd = existing["StorageDescriptor"]
# 2. Add the new column to the existing list
sd["Columns"].append({"Name": "discount_pct", "Type": "double"})
# 3. Push the full updated definition back
glue.update_table(
DatabaseName="orders_db",
TableInput={
"Name": existing["Name"],
"StorageDescriptor": sd,
"PartitionKeys": existing.get("PartitionKeys", []),
"TableType": existing.get("TableType", "EXTERNAL_TABLE"),
"Parameters": existing.get("Parameters", {}),
}
)
print("Schema updated — discount_pct column added")
update_table() with an incomplete TableInput will wipe out fields you didn't include (like PartitionKeys). Never build the TableInput from scratch when updating.
Returns all table definitions in a database. A single database can have hundreds of tables — always use the paginator.
paginator = glue.get_paginator("get_tables")
all_tables = []
for page in paginator.paginate(DatabaseName="orders_db"):
all_tables.extend(page["TableList"])
print(f"Found {len(all_tables)} tables")
for t in all_tables:
print(t["Name"], t["StorageDescriptor"]["Location"])
delete_table() removes a single table from the catalog. batch_delete_table() removes up to 25 tables in one API call — useful for cleanup scripts.
# ── Single delete ─────────────────────────────────────────────────
glue.delete_table(DatabaseName="orders_db", Name="orders_temp")
# ── Batch delete (up to 25 at once) ───────────────────────────────
tables_to_drop = ["stg_orders_20230101", "stg_orders_20230102", "stg_orders_20230103"]
response = glue.batch_delete_table(
DatabaseName="orders_db",
TablesToDelete=tables_to_drop
)
errors = response.get("Errors", [])
if errors:
for err in errors:
print(f"Failed to delete {err['TableName']}: {err['ErrorDetail']['ErrorMessage']}")
When your Spark job writes Parquet to S3 in partitioned directories (year=2024/month=01/day=15/), Athena and Glue don't automatically know about new partitions. You must register new partitions in the Glue Catalog so query engines can find them. Without registration, SELECT * FROM orders_bronze WHERE year='2024' in Athena returns zero rows even though the data is on S3.
MSCK REPAIR TABLE orders_bronze in Athena or Glue SQL to auto-discover all partitions — but this scans the entire S3 path and is slow for large tables. The programmatic batch_create_partition() approach is always preferred in production pipelines.
The most important partition API. Registers up to 100 partitions in a single call. Call this at the end of every Spark write job to register the partitions just written.
from datetime import date, timedelta
def register_daily_partitions(glue_client, database, table, s3_base, year, month, day):
"""Register a single date partition after Spark write."""
partition_values = [str(year), str(month).zfill(2), str(day).zfill(2)]
s3_location = f"{s3_base}/year={year}/month={str(month).zfill(2)}/day={str(day).zfill(2)}/"
# Fetch parent table's StorageDescriptor to clone it for the partition
parent_sd = glue_client.get_table(
DatabaseName=database, Name=table
)["Table"]["StorageDescriptor"]
# Override location for this specific partition
partition_sd = {**parent_sd, "Location": s3_location}
try:
glue_client.batch_create_partition(
DatabaseName=database,
TableName=table,
PartitionInputList=[{
"Values": partition_values,
"StorageDescriptor": partition_sd,
"Parameters": {}
}]
)
print(f"Registered partition: {partition_values}")
except ClientError as e:
if e.response["Error"]["Code"] == "AlreadyExistsException":
print(f"Partition {partition_values} already exists — skipping")
else:
raise
# ── Usage: register today's partition after pipeline completes ────
today = date.today()
register_daily_partitions(
glue_client=glue,
database="orders_db",
table="orders_bronze",
s3_base="s3://my-datalake/bronze/orders",
year=today.year, month=today.month, day=today.day
)
Lists all registered partitions for a table. Returns partition values and their S3 locations. You can filter with an Expression (Hive-style filter like "year='2024' AND month='01'").
paginator = glue.get_paginator("get_partitions")
all_partitions = []
for page in paginator.paginate(
DatabaseName="orders_db",
TableName="orders_bronze",
Expression="year='2024' AND month='03'" # optional filter
):
all_partitions.extend(page["Partitions"])
print(f"Found {len(all_partitions)} partitions for 2024-03")
for p in all_partitions[:3]:
print(p["Values"], p["StorageDescriptor"]["Location"])
Used for cleanup — remove old partition registrations (e.g., after a retention policy deletes the S3 data). Removes the catalog entry only — does not delete S3 data.
# Remove two specific date partitions
glue.batch_delete_partition(
DatabaseName="orders_db",
TableName="orders_bronze",
PartitionsToDelete=[
{"Values": ["2022", "01", "01"]},
{"Values": ["2022", "01", "02"]},
]
)
print("Old partitions removed from catalog")
A Glue Crawler is an automated schema discovery tool. You point it at an S3 path (or JDBC database), and it scans the files, infers the schema, detects partitions, and writes or updates table definitions in the Glue Catalog. You use crawlers for initial table registration and to detect schema changes automatically.
Creates a crawler that points to an S3 path. You specify which database to write tables into, what IAM role to use, and what S3 paths to crawl.
glue.create_crawler(
Name="orders-bronze-crawler",
Role="arn:aws:iam::123456789012:role/GlueCrawlerRole",
DatabaseName="orders_db",
Description="Crawl S3 bronze orders and update Glue Catalog",
Targets={
"S3Targets": [
{"Path": "s3://my-datalake/bronze/orders/"},
]
},
SchemaChangePolicy={
"UpdateBehavior": "UPDATE_IN_DATABASE", # add new columns automatically
"DeleteBehavior": "LOG" # log removals, don't auto-delete
},
RecrawlPolicy={"RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"} # incremental
)
# ── Start crawler ─────────────────────────────────────────────────
try:
glue.start_crawler(Name="orders-bronze-crawler")
print("Crawler started")
except ClientError as e:
if e.response["Error"]["Code"] == "CrawlerRunningException":
print("Crawler already running — skipping")
else:
raise
# ── Stop crawler ──────────────────────────────────────────────────
glue.stop_crawler(Name="orders-bronze-crawler")
Glue does NOT provide a built-in waiter for crawlers. You must implement polling manually: call get_crawler() in a loop, check the State field, and sleep between checks. This is the standard production pattern.
import time
def wait_for_crawler(glue_client, crawler_name: str, poll_sec: int = 15, timeout_sec: int = 900):
"""Block until the crawler reaches READY state (or timeout)."""
elapsed = 0
while elapsed < timeout_sec:
state = glue_client.get_crawler(Name=crawler_name)["Crawler"]["State"]
print(f"[{elapsed}s] Crawler state: {state}")
if state == "READY":
print("✅ Crawler finished")
return
if state in ("STOPPING", "FAILED"):
raise RuntimeError(f"Crawler ended in unexpected state: {state}")
time.sleep(poll_sec)
elapsed += poll_sec
raise TimeoutError(f"Crawler did not finish within {timeout_sec}s")
# ── Full pattern: start → wait → confirm ──────────────────────────
glue.start_crawler(Name="orders-bronze-crawler")
wait_for_crawler(glue, "orders-bronze-crawler")
print("Catalog updated — Athena can now query new partitions")
while True loop in a PythonOperator). Set timeout_sec to something reasonable for your data volume — crawling 1 TB of Parquet can take 10–20 minutes.
A Glue ETL Job is a managed execution environment for your transformation code. The two most important types: Spark ETL jobs (run your PySpark script on a managed cluster — most common) and Python Shell jobs (run plain Python on a single worker — for lightweight transforms, metadata updates, or API calls).
| Job Type | Worker | Use Case | Billing Unit |
|---|---|---|---|
| Glue ETL (Spark) | G.1X, G.2X, G.4X, G.8X | PySpark transforms, large-scale data processing | DPU-hours |
| Python Shell | 0.0625 DPU | Lightweight Python: API calls, metadata updates, orchestration helpers | DPU-hours (cheap) |
| Glue Streaming | G.1X | Continuous Spark Structured Streaming jobs | DPU-hours (continuous) |
Registers a Glue job definition. The actual script lives on S3. You can pass default arguments that the script reads at runtime (like the target date, S3 path, or environment name).
glue.create_job(
Name="orders-bronze-to-silver",
Description="Transform raw orders (Bronze) into clean orders (Silver)",
Role="arn:aws:iam::123456789012:role/GlueJobRole",
Command={
"Name": "glueetl", # "glueetl" for Spark, "pythonshell" for Python
"ScriptLocation": "s3://my-scripts/glue/bronze_to_silver.py",
"PythonVersion": "3"
},
DefaultArguments={
"--job-language": "python",
"--TempDir": "s3://my-datalake/glue-temp/",
"--enable-metrics": "true",
"--enable-continuous-cloudwatch-log":"true",
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://my-datalake/spark-ui-logs/",
"--SOURCE_DB": "orders_db",
"--SOURCE_TABLE":"orders_bronze",
"--TARGET_PATH": "s3://my-datalake/silver/orders/",
},
GlueVersion="4.0", # Spark 3.3 + Python 3.10
WorkerType="G.1X", # 4 vCPU, 16 GB per worker
NumberOfWorkers=5, # 1 driver + 4 executors
Timeout=60, # minutes — job killed if exceeded
MaxRetries=1,
Tags={"Project": "orders-platform", "Env": "prod"}
)
Launches a new run of an existing job. You can override default arguments at runtime — this is how metadata-driven pipelines work: one job definition, different arguments per pipeline or date.
from datetime import date
run_date = date.today().isoformat() # "2024-03-15"
response = glue.start_job_run(
JobName="orders-bronze-to-silver",
Arguments={
"--RUN_DATE": run_date, # override for this specific run
"--ENV": "prod",
"--DRY_RUN": "false",
}
)
job_run_id = response["JobRunId"]
print(f"Started job run: {job_run_id}")
# JobRunId looks like: jr_abc123xyz456
Glue has no built-in waiter for jobs. You must poll get_job_run() until the JobRunState reaches a terminal state. Terminal states are: SUCCEEDED, FAILED, ERROR, TIMEOUT, STOPPED.
import time
TERMINAL_STATES = {"SUCCEEDED", "FAILED", "ERROR", "TIMEOUT", "STOPPED"}
def wait_for_glue_job(glue_client, job_name: str, run_id: str,
poll_sec: int = 20, timeout_sec: int = 3600) -> dict:
"""Poll until Glue job reaches terminal state. Raises on failure."""
elapsed = 0
while elapsed < timeout_sec:
resp = glue_client.get_job_run(JobName=job_name, RunId=run_id)
run = resp["JobRun"]
state = run["JobRunState"]
print(f"[{elapsed}s] {job_name} — {state}")
if state in TERMINAL_STATES:
if state != "SUCCEEDED":
error_msg = run.get("ErrorMessage", "No error message")
raise RuntimeError(f"Glue job {job_name} {state}: {error_msg}")
print(f"✅ Job succeeded in {run.get('ExecutionTime', '?')}s")
return run
time.sleep(poll_sec)
elapsed += poll_sec
raise TimeoutError(f"Job {run_id} did not complete within {timeout_sec}s")
# ── Full pattern: start → wait → register partitions ──────────────
run_id = glue.start_job_run(
JobName="orders-bronze-to-silver",
Arguments={"--RUN_DATE": "2024-03-15"}
)["JobRunId"]
wait_for_glue_job(glue, "orders-bronze-to-silver", run_id)
print("Pipeline step complete — moving to next stage")
wait_for_glue_job() returns, extract run["ExecutionTime"] (seconds), run["DPUSeconds"] (cost metric), and run["CompletedOn"] and write them to your DynamoDB audit table. This gives you a full history of every run with cost and duration per pipeline.
Returns all historical runs for a job. Useful for dashboards, SLA reporting, and debugging repeated failures.
paginator = glue.get_paginator("get_job_runs")
runs = []
for page in paginator.paginate(JobName="orders-bronze-to-silver", MaxResults=10):
runs.extend(page["JobRuns"])
if len(runs) >= 10:
break
for r in runs:
print(r["Id"], r["JobRunState"], r.get("ExecutionTime", "?"), "sec")
Use these to cancel jobs that are taking too long, consuming too many DPUs, or were triggered by mistake. batch_stop_job_run() cancels up to 25 runs in one call.
# Stop a single run
glue.stop_job_run(JobName="orders-bronze-to-silver", RunId=job_run_id)
# Batch stop (useful in a cleanup handler)
glue.batch_stop_job_run(
JobName="orders-bronze-to-silver",
JobRunIds=["jr_run1", "jr_run2"]
)
Glue Data Quality is a built-in rule engine that lets you define quality checks (completeness, uniqueness, freshness, custom SQL expressions) and run them against tables in the Glue Catalog. Results can be used to halt the pipeline on quality violations — preventing bad data from propagating to Silver or Gold layers.
order_id must be 100% complete.order_id must be 100% unique.A ruleset is a named set of DQ rules written in DQDL (Data Quality Definition Language). Rules are declarative — you describe what "good" looks like, not how to check it.
glue.create_data_quality_ruleset(
Name="orders-bronze-dq-rules",
Description="DQ checks for Bronze orders table",
Ruleset="""
Rules = [
Completeness "order_id" >= 1.0,
Uniqueness "order_id" >= 0.999,
Completeness "customer_id" >= 0.95,
RowCount >= 1000,
ColumnValues "amount" between 0 and 100000,
IsComplete "status"
]
""",
TargetTable={
"TableName": "orders_bronze",
"DatabaseName": "orders_db"
}
)
print("DQ ruleset created: orders-bronze-dq-rules")
Triggers an evaluation of the ruleset against the target table. Returns a RunId that you poll to get results. The evaluation runs as a Glue job under the hood.
import time, json, boto3
from botocore.exceptions import ClientError
glue = boto3.client("glue", region_name="us-east-1")
# ── 1. Start the evaluation ──────────────────────────────────────
resp = glue.start_data_quality_ruleset_evaluation_run(
DataSource={
"GlueTable": {
"DatabaseName": "orders_db",
"TableName": "orders_bronze"
}
},
Role="arn:aws:iam::123456789012:role/GlueJobRole",
RulesetNames=["orders-bronze-dq-rules"],
AdditionalRunOptions={"CloudWatchMetricsEnabled": True}
)
run_id = resp["RunId"]
print(f"DQ evaluation started: {run_id}")
# ── 2. Poll until complete ────────────────────────────────────────
for i in range(60): # up to 30 minutes
time.sleep(30)
run_detail = glue.get_data_quality_ruleset_evaluation_run(RunId=run_id)
status = run_detail["Status"]
print(f"DQ run status: {status}")
if status == "SUCCEEDED":
break
if status in ("FAILED", "ERROR", "TIMEOUT"):
raise RuntimeError(f"DQ evaluation failed: {status}")
# ── 3. Read results and gate the pipeline ─────────────────────────
results = run_detail.get("ResultIds", [])
if not results:
raise RuntimeError("No DQ results returned")
result_detail = glue.get_data_quality_result(ResultId=results[0])
rule_results = result_detail["RuleResults"]
failed_rules = [r for r in rule_results if r["Result"] == "FAIL"]
if failed_rules:
for r in failed_rules:
print(f"❌ FAILED: {r['Name']} — {r.get('EvaluationMessage', '')}")
raise RuntimeError(f"{len(failed_rules)} DQ rules failed — pipeline halted")
print(f"✅ All {len(rule_results)} DQ rules passed — proceeding to Silver")
dq_failure_count), write the failure to your DynamoDB audit table, trigger an SNS alert, and stop the pipeline. This prevents bad data from contaminating Silver and Gold layers.
Athena APIs
Amazon Athena is a serverless SQL engine that queries data directly on S3 using Trino under the hood. Its boto3 API is asynchronous — you submit a query, get back a QueryExecutionId, poll for completion, then fetch results. Mastering this pattern is essential for every automation script that reads from your data lake.
Athena does not return results immediately like a regular database call. Instead it works like a job system: you submit a query → Athena returns a QueryExecutionId → you poll until the state is SUCCEEDED → then you fetch results from S3 (or via the API). The results are always written to an S3 output location first.
QueryExecutionId) immediately. You check the tracking site periodically (poll) until it says "Delivered" (SUCCEEDED). Only then do you go to the destination to pick up what was delivered (fetch results).
The four most important parameters: QueryString (your SQL), QueryExecutionContext (which database to run against), ResultConfiguration (where to write results on S3), and WorkGroup (which cost/permission boundary to use).
import boto3
athena = boto3.client("athena", region_name="us-east-1")
response = athena.start_query_execution(
QueryString="""
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent,
MAX(created_at) AS last_order_at
FROM orders_db.orders_bronze
WHERE year = '2024' AND month = '03'
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 100
""",
QueryExecutionContext={
"Database": "orders_db", # default database — avoids db prefix in SQL
"Catalog": "AwsDataCatalog" # use Glue Catalog (default)
},
ResultConfiguration={
"OutputLocation": "s3://my-datalake/athena-results/",
"EncryptionConfiguration": {
"EncryptionOption": "SSE_S3" # encrypt results at rest
}
},
WorkGroup="data-engineering-wg" # workgroup controls cost and permissions
)
query_execution_id = response["QueryExecutionId"]
print(f"Query submitted: {query_execution_id}")
primary workgroup for production pipelines — it has no guardrails.
An Athena query moves through four states. You must keep polling until you hit one of the two terminal states.
| State | Meaning | Action |
|---|---|---|
| QUEUED | Query is waiting for resources | Keep polling |
| RUNNING | Query is actively executing on Athena/Trino | Keep polling |
| SUCCEEDED | Query finished — results are on S3 | Fetch results ✅ |
| FAILED | Query errored — check StateChangeReason | Raise exception ❌ |
| CANCELLED | Query was stopped by user or timeout | Raise exception ❌ |
Poll with exponential backoff — start with a 2-second sleep, increase gradually. Most short queries finish in under 10 seconds; large scans can take minutes. The response also contains useful metadata: bytes scanned (cost) and execution time.
import time
from botocore.exceptions import ClientError
def wait_for_athena_query(athena_client, query_execution_id: str,
max_wait_sec: int = 300) -> dict:
"""Poll Athena until terminal state. Returns execution detail on success."""
terminal = {"SUCCEEDED", "FAILED", "CANCELLED"}
sleep_sec = 2
elapsed = 0
while elapsed < max_wait_sec:
resp = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
execution = resp["QueryExecution"]
state = execution["Status"]["State"]
print(f"[{elapsed}s] Athena query state: {state}")
if state == "SUCCEEDED":
stats = execution.get("Statistics", {})
print(f" ✅ Done in {stats.get('TotalExecutionTimeInMillis', 0) / 1000:.1f}s")
print(f" 💰 Bytes scanned: {stats.get('DataScannedInBytes', 0):,}")
return execution
if state in ("FAILED", "CANCELLED"):
reason = execution["Status"].get("StateChangeReason", "Unknown")
raise RuntimeError(f"Athena query {state}: {reason}")
# Exponential backoff: 2 → 4 → 8 → 16 → 30 → 30 → …
time.sleep(sleep_sec)
elapsed += sleep_sec
sleep_sec = min(sleep_sec * 2, 30)
raise TimeoutError(f"Athena query did not complete within {max_wait_sec}s")
# ── Usage ─────────────────────────────────────────────────────────
execution_detail = wait_for_athena_query(athena, query_execution_id)
output_location = execution_detail["ResultConfiguration"]["OutputLocation"]
print(f"Results at: {output_location}")
Athena returns results as a ResultSet object. The first row is always the column headers, subsequent rows are the data. Each row is a list of {"VarCharValue": "..."} dicts. You need to parse this structure into something usable like a list of dicts or a Pandas DataFrame.
{
"ResultSet": {
"Rows": [
{"Data": [{"VarCharValue": "customer_id"}, {"VarCharValue": "order_count"}, {"VarCharValue": "total_spent"}]},
{"Data": [{"VarCharValue": "CUST-001"}, {"VarCharValue": "12"}, {"VarCharValue": "1450.99"}]},
{"Data": [{"VarCharValue": "CUST-002"}, {"VarCharValue": "5"}, {"VarCharValue": "230.50"}]}
],
"ResultSetMetadata": {
"ColumnInfo": [
{"Name": "customer_id", "Type": "varchar"},
{"Name": "order_count", "Type": "bigint"},
{"Name": "total_spent", "Type": "double"}
]
}
}
}
A single query can return millions of rows. Always use the paginator. The first page's first row is the header — skip it. Subsequent pages have no header row.
import pandas as pd
def athena_results_to_df(athena_client, query_execution_id: str) -> pd.DataFrame:
"""Paginate Athena results and return a Pandas DataFrame."""
paginator = athena_client.get_paginator("get_query_results")
pages = paginator.paginate(QueryExecutionId=query_execution_id)
headers = None
rows = []
for page_num, page in enumerate(pages):
result_rows = page["ResultSet"]["Rows"]
if page_num == 0:
# First row of first page = column headers
headers = [col["VarCharValue"] for col in result_rows[0]["Data"]]
result_rows = result_rows[1:] # skip header row
for row in result_rows:
# Each cell is {"VarCharValue": "..."} — extract the value
values = [cell.get("VarCharValue", None) for cell in row["Data"]]
rows.append(dict(zip(headers, values)))
return pd.DataFrame(rows)
# ── Full end-to-end pattern ───────────────────────────────────────
# 1. Submit
qid = athena.start_query_execution(
QueryString="SELECT customer_id, SUM(amount) AS total FROM orders_bronze GROUP BY 1",
QueryExecutionContext={"Database": "orders_db"},
ResultConfiguration={"OutputLocation": "s3://my-datalake/athena-results/"}
)["QueryExecutionId"]
# 2. Poll
wait_for_athena_query(athena, qid)
# 3. Fetch and parse
df = athena_results_to_df(athena, qid)
print(df.head())
print(f"Rows returned: {len(df)}")
get_query_results() — it's slow to paginate at scale. Instead, read the result CSV directly from S3 using boto3 s3.get_object() or Pandas read_csv(output_location). Athena always writes results to the OutputLocation as a CSV file named {QueryExecutionId}.csv.
import pandas as pd, io
# After SUCCEEDED — results are at OutputLocation/{QueryExecutionId}.csv
s3 = boto3.client("s3")
bucket = "my-datalake"
key = f"athena-results/{qid}.csv"
obj = s3.get_object(Bucket=bucket, Key=key)
df_large = pd.read_csv(io.BytesIO(obj["Body"].read()))
print(f"Loaded {len(df_large):,} rows from S3 directly")
Immediately cancels a QUEUED or RUNNING query. Useful in timeout handlers or when a runaway query is scanning too much data.
athena.stop_query_execution(QueryExecutionId=query_execution_id)
print(f"Cancelled query: {query_execution_id}")
Returns QueryExecutionIds for recent queries in a workgroup. Useful for dashboards, cost auditing, and debugging. Note: it returns IDs only — you then call get_query_execution() to get details for each.
paginator = athena.get_paginator("list_query_executions")
query_ids = []
for page in paginator.paginate(WorkGroup="data-engineering-wg", MaxResults=20):
query_ids.extend(page["QueryExecutionIds"])
if len(query_ids) >= 20:
break
# Fetch details for each (batch_get_query_execution — up to 50 at once)
details = athena.batch_get_query_execution(QueryExecutionIds=query_ids[:20])
for q in details["QueryExecutions"]:
state = q["Status"]["State"]
scanned = q.get("Statistics", {}).get("DataScannedInBytes", 0)
print(f"{q['QueryExecutionId'][:8]}… {state:10s} {scanned/1e6:.1f} MB scanned")
Named queries are saved SQL templates stored in Athena — like stored procedures. They're visible in the Athena console and can be retrieved by name and executed programmatically. Useful for standard reconciliation queries, SLA checks, or data quality assertions that run on a schedule.
# ── Create a named query ──────────────────────────────────────────
resp = athena.create_named_query(
Name="daily-order-count-check",
Description="Row count reconciliation — run daily after Bronze load",
Database="orders_db",
QueryString="""
SELECT
year, month, day,
COUNT(*) AS row_count
FROM orders_bronze
WHERE year = '${year}' AND month = '${month}'
GROUP BY year, month, day
ORDER BY day
""",
WorkGroup="data-engineering-wg"
)
named_query_id = resp["NamedQueryId"]
# ── Retrieve and execute it ───────────────────────────────────────
named = athena.get_named_query(NamedQueryId=named_query_id)
sql = named["NamedQuery"]["QueryString"]
# Replace template variables and submit
sql_resolved = sql.replace("${year}", "2024").replace("${month}", "03")
qid = athena.start_query_execution(
QueryString=sql_resolved,
QueryExecutionContext={"Database": "orders_db"},
ResultConfiguration={"OutputLocation": "s3://my-datalake/athena-results/"}
)["QueryExecutionId"]
wait_for_athena_query(athena, qid)
df = athena_results_to_df(athena, qid)
print(df)
This is the pattern you will use in almost every pipeline that needs to query the data lake programmatically — reconciliation checks, DQ row counts, audit queries, Silver→Gold transformation triggers via SQL.
import boto3, time, io, pandas as pd
from botocore.exceptions import ClientError
class AthenaRunner:
"""Production Athena query runner — submit, poll, parse."""
def __init__(self, region: str, output_location: str, workgroup: str = "primary"):
self.athena = boto3.client("athena", region_name=region)
self.s3 = boto3.client("s3", region_name=region)
self.output_location= output_location # e.g. "s3://bucket/athena-results/"
self.workgroup = workgroup
def run(self, sql: str, database: str, max_wait: int = 300) -> pd.DataFrame:
"""Submit SQL, wait for completion, return DataFrame."""
qid = self._submit(sql, database)
self._wait(qid, max_wait)
return self._fetch(qid)
def _submit(self, sql: str, database: str) -> str:
resp = self.athena.start_query_execution(
QueryString=sql,
QueryExecutionContext={"Database": database},
ResultConfiguration={"OutputLocation": self.output_location},
WorkGroup=self.workgroup
)
return resp["QueryExecutionId"]
def _wait(self, qid: str, max_wait: int):
sleep, elapsed = 2, 0
while elapsed < max_wait:
status = self.athena.get_query_execution(QueryExecutionId=qid)
state = status["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ("FAILED", "CANCELLED"):
reason = status["QueryExecution"]["Status"].get("StateChangeReason", "?")
raise RuntimeError(f"Athena {state}: {reason}")
time.sleep(sleep); elapsed += sleep; sleep = min(sleep * 2, 30)
raise TimeoutError(f"Athena query {qid} timed out after {max_wait}s")
def _fetch(self, qid: str) -> pd.DataFrame:
# For large results read directly from S3
bucket = self.output_location.split("/")[2]
prefix = "/".join(self.output_location.split("/")[3:])
key = f"{prefix}{qid}.csv"
obj = self.s3.get_object(Bucket=bucket, Key=key)
return pd.read_csv(io.BytesIO(obj["Body"].read()))
# ── Usage ─────────────────────────────────────────────────────────
runner = AthenaRunner(
region="us-east-1",
output_location="s3://my-datalake/athena-results/",
workgroup="data-engineering-wg"
)
# Daily reconciliation check
df = runner.run(
sql="SELECT COUNT(*) AS cnt, SUM(amount) AS total FROM orders_bronze WHERE year='2024' AND month='03' AND day='15'",
database="orders_db"
)
print(df)
# Gate: if row count is zero, halt the pipeline
if int(df["cnt"].iloc[0]) == 0:
raise RuntimeError("No rows found for 2024-03-15 — pipeline halted")
- Row count reconciliation — compare source DB count vs Athena table count after every load
- DQ assertions —
SELECT COUNT(*) FROM orders WHERE order_id IS NULL— assert 0 - SLA freshness check —
SELECT MAX(created_at) FROM orders— assert within 24 hours - Pipeline triggers — run Athena SQL to create Gold summary tables after Silver is ready
- Ad-hoc audit queries — triggered from Lambda or Airflow on a schedule
EMR APIs
Amazon EMR (Elastic MapReduce) is the most powerful way to run large-scale Spark jobs on AWS. Its boto3 API covers three modes: classic EMR clusters (you manage nodes), EMR Serverless (zero cluster management), and EMR on EKS. You'll use these APIs to spin up clusters, submit Spark steps, poll for completion, and tear down — all programmatically from Airflow, Lambda, or a control script.
An EMR cluster goes through a predictable lifecycle: STARTING → BOOTSTRAPPING → RUNNING → WAITING → TERMINATING → TERMINATED. You create it with run_job_flow(), submit Spark steps to it, poll its state, and optionally auto-terminate it after all steps complete. If KeepJobFlowAliveWhenNoSteps=False, EMR shuts itself down — great for cost control.
run_job_flow() is you calling the rental company and saying "set up 10 machines with Spark installed." Once they're ready (WAITING), you send your job (add steps). When the job finishes, the machines are returned (TERMINATED). You only pay while the machines are running.
run_job_flow() is the main API to create an EMR cluster. Key parameters: ReleaseLabel (EMR version), Instances (node types and counts), Applications (Spark, Hadoop etc.), JobFlowRole (EC2 instance profile), ServiceRole (EMR service IAM role), and AutoTerminationPolicy / KeepJobFlowAliveWhenNoSteps for cost control.
import boto3
emr = boto3.client("emr", region_name="us-east-1")
response = emr.run_job_flow(
Name="de-spark-pipeline-cluster",
ReleaseLabel="emr-6.15.0", # EMR version — includes Spark 3.4
LogUri="s3://my-datalake/emr-logs/", # all cluster logs go here
Instances={
"MasterInstanceType": "m5.xlarge",
"SlaveInstanceType": "m5.2xlarge",
"InstanceCount": 5, # 1 master + 4 core nodes
"KeepJobFlowAliveWhenNoSteps": False, # auto-terminate when done!
"TerminationProtected": False,
"Ec2SubnetId": "subnet-0abc123def456", # private subnet
"EmrManagedMasterSecurityGroup": "sg-master-xxx",
"EmrManagedSlaveSecurityGroup": "sg-slave-xxx",
},
Applications=[
{"Name": "Spark"},
{"Name": "Hadoop"},
{"Name": "Hive"}, # for Glue metastore access
],
Configurations=[
{
"Classification": "spark-defaults",
"Properties": {
"spark.sql.shuffle.partitions": "200",
"spark.executor.memory": "6g",
"spark.executor.cores": "2",
"spark.dynamicAllocation.enabled":"true",
}
},
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class":
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
} # use Glue as the Hive metastore
}
],
BootstrapActions=[
{
"Name": "Install Python Libraries",
"ScriptBootstrapAction": {
"Path": "s3://my-datalake/bootstrap/install_deps.sh",
"Args": []
}
}
],
JobFlowRole="EMR_EC2_DefaultRole", # IAM instance profile for EC2 nodes
ServiceRole="EMR_DefaultRole", # IAM role for EMR service itself
Tags=[
{"Key": "Project", "Value": "DataPlatform"},
{"Key": "Environment", "Value": "prod"},
{"Key": "CostCenter", "Value": "DE-Team"},
],
VisibleToAllUsers=True,
)
cluster_id = response["JobFlowId"] # e.g. "j-2AXXXXXXGAPLF"
print(f"Cluster created: {cluster_id}")
KeepJobFlowAliveWhenNoSteps=False. This means the cluster auto-terminates after all steps finish — you never pay for an idle cluster. If you forget this, an idle cluster can silently cost hundreds of dollars overnight.
After creating a cluster, use describe_cluster() to poll its state. The key field is Cluster['Status']['State']. Valid states: STARTING, BOOTSTRAPPING, RUNNING, WAITING, TERMINATING, TERMINATED, TERMINATED_WITH_ERRORS.
import time
def wait_for_cluster_ready(emr_client, cluster_id, poll_interval=30):
"""Poll until cluster reaches WAITING (ready) or a terminal error state."""
terminal_states = {"WAITING", "TERMINATED", "TERMINATED_WITH_ERRORS"}
while True:
response = emr_client.describe_cluster(ClusterId=cluster_id)
state = response["Cluster"]["Status"]["State"]
reason = response["Cluster"]["Status"].get("StateChangeReason", {})
print(f"Cluster {cluster_id} state: {state}")
if state == "WAITING":
print("✅ Cluster is ready — WAITING for steps.")
return True
if state in {"TERMINATED", "TERMINATED_WITH_ERRORS"}:
print(f"❌ Cluster failed. Reason: {reason}")
raise RuntimeError(f"Cluster {cluster_id} terminated unexpectedly: {reason}")
time.sleep(poll_interval) # check every 30 seconds
# Usage
wait_for_cluster_ready(emr, cluster_id)
Use list_clusters() with a paginator to list all clusters, optionally filtered by state. Useful for finding all RUNNING or WAITING clusters to monitor cost or add steps programmatically.
# List all RUNNING clusters
paginator = emr.get_paginator("list_clusters")
pages = paginator.paginate(ClusterStates=["RUNNING", "WAITING"])
for page in pages:
for cluster in page["Clusters"]:
print(f"ID: {cluster['Id']} Name: {cluster['Name']} State: {cluster['Status']['State']}")
If you used KeepJobFlowAliveWhenNoSteps=True (e.g. for a long-running cluster), you must explicitly terminate it when done. terminate_job_flows() accepts a list of cluster IDs — useful for bulk cleanup.
# Terminate a single cluster
emr.terminate_job_flows(JobFlowIds=[cluster_id])
print(f"Termination requested for {cluster_id}")
# Terminate multiple clusters in one call
emr.terminate_job_flows(
JobFlowIds=["j-CLUSTER1", "j-CLUSTER2", "j-CLUSTER3"]
)
TerminationProtected=True was set on the cluster, terminate_job_flows() will silently fail (the cluster won't terminate). You must first call set_termination_protection(JobFlowIds=[id], TerminationProtected=False) before terminating. Always keep TerminationProtected=False for auto-managed clusters.
A Step in EMR is a unit of work submitted to the cluster — in practice, this is almost always a spark-submit command. Steps run sequentially by default. Each step has a state: PENDING → RUNNING → COMPLETED / FAILED / CANCELLED. The ActionOnFailure field controls what happens when a step fails — either continue to the next step or terminate the entire cluster.
CONTINUE) or shut down the entire factory (TERMINATE_CLUSTER).
The HadoopJarStep for Spark always uses command-runner.jar as the Jar, and passes the actual spark-submit command as Args. This is the standard pattern for running PySpark scripts on EMR.
response = emr.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
"Name": "Silver Layer Transform — Orders",
"ActionOnFailure": "CONTINUE", # or "TERMINATE_CLUSTER"
"HadoopJarStep": {
"Jar": "command-runner.jar", # always this value for spark-submit
"Args": [
"spark-submit",
"--deploy-mode", "cluster",
"--master", "yarn",
"--conf", "spark.sql.shuffle.partitions=200",
"--conf", "spark.executor.memory=6g",
"--conf", "spark.executor.cores=2",
"--py-files", "s3://my-datalake/code/utils.zip",
"s3://my-datalake/code/orders_silver_transform.py",
# Any extra args become sys.argv in your script:
"--env", "prod",
"--run-date", "2024-03-15",
"--source-path","s3://my-datalake/bronze/orders/",
"--target-path","s3://my-datalake/silver/orders/",
]
}
}
]
)
step_ids = [s["StepId"] for s in response["StepIds"]]
print(f"Steps submitted: {step_ids}") # e.g. ['s-2XXXXXXHXXXXXX']
You can pass multiple step definitions in one add_job_flow_steps() call. EMR queues them and runs them sequentially. This is the cleanest pattern for a multi-stage pipeline (Bronze → Silver → Gold).
def make_step(name, script_s3_path, extra_args=None):
args = [
"spark-submit", "--deploy-mode", "cluster",
"--master", "yarn",
script_s3_path
]
if extra_args:
args.extend(extra_args)
return {
"Name": name,
"ActionOnFailure": "TERMINATE_CLUSTER",
"HadoopJarStep": {"Jar": "command-runner.jar", "Args": args}
}
response = emr.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
make_step("Bronze Ingest", "s3://my-bucket/code/bronze_ingest.py"),
make_step("Silver Transform", "s3://my-bucket/code/silver_transform.py"),
make_step("Gold Aggregate", "s3://my-bucket/code/gold_aggregate.py"),
]
)
step_ids = [s["StepId"] for s in response["StepIds"]]
print(f"3 steps submitted: {step_ids}")
Use describe_step() to poll a specific step's state. The terminal states are COMPLETED, FAILED, and CANCELLED. Always poll with a sleep to avoid throttling.
import time
def wait_for_step(emr_client, cluster_id, step_id, poll_interval=30):
"""Poll until step reaches a terminal state. Returns True on COMPLETED."""
terminal_states = {"COMPLETED", "FAILED", "CANCELLED"}
while True:
response = emr_client.describe_step(
ClusterId=cluster_id,
StepId=step_id
)
step = response["Step"]
state = step["Status"]["State"]
name = step["Name"]
reason = step["Status"].get("FailureDetails", {})
print(f"Step '{name}' ({step_id}): {state}")
if state == "COMPLETED":
print(f"✅ Step completed successfully.")
return True
if state in {"FAILED", "CANCELLED"}:
print(f"❌ Step failed: {reason}")
raise RuntimeError(f"EMR step {step_id} ended with state {state}: {reason}")
time.sleep(poll_interval)
# Poll the last submitted step
wait_for_step(emr, cluster_id, step_ids[-1])
Use list_steps() with a paginator to get the history of all steps on a cluster. Filter by StepStates to narrow results.
paginator = emr.get_paginator("list_steps")
pages = paginator.paginate(
ClusterId=cluster_id,
StepStates=["COMPLETED", "FAILED", "RUNNING"]
)
for page in pages:
for step in page["Steps"]:
print(
f"StepId: {step['Id']} "
f"Name: {step['Name']} "
f"State: {step['Status']['State']}"
)
If you need to abort a running or queued step (e.g. runaway job), use cancel_steps(). It accepts a list of step IDs.
emr.cancel_steps(
ClusterId=cluster_id,
StepIds=["s-STEPID1", "s-STEPID2"],
StepCancellationOption="SEND_INTERRUPT" # or "TERMINATE_PROCESS"
)
print("Steps cancellation requested.")
Boto3 has built-in waiters for EMR. step_complete waiter polls describe_step() internally until the step reaches a terminal state. It's simpler than a manual loop but less customizable.
# Built-in waiter — polls every 30 seconds, max 60 attempts (30 min)
waiter = emr.get_waiter("step_complete")
waiter.wait(
ClusterId=cluster_id,
StepId=step_ids[0],
WaiterConfig={
"Delay": 30, # seconds between polls
"MaxAttempts":120, # 120 × 30s = 60 minutes max
}
)
print("Step completed (waiter returned).")
cluster_terminated — waits for cluster state = TERMINATED
step_complete — waits for step state = COMPLETED
Note: There is no built-in waiter for
WAITING state (cluster ready for steps). For that, use a manual polling loop as shown above.
EMR Serverless removes all cluster management overhead. There are no master nodes, no core nodes, no bootstrap actions — you just submit a Spark job and AWS figures out compute. You pay only for the vCPU-hours and GB-hours your job actually uses. It's ideal for sporadic or unpredictable workloads where you don't want an idle cluster burning money.
| Feature | Classic EMR | EMR Serverless |
|---|---|---|
| Cluster startup time | 5–10 min | 30–60 sec (pre-init) |
| Cluster management | You manage | AWS manages |
| Cost model | Per-hour (instance) | Per-use (vCPU·s) |
| Custom bootstrap | Yes | No (use custom images) |
| Best for | Long-running / predictable | Sporadic / variable |
With EMR Serverless, you first create an Application (a reusable Spark runtime environment), then submit job runs to it. The application can be pre-initialized to reduce cold-start latency.
emr_serverless = boto3.client("emr-serverless", region_name="us-east-1")
# Step 1: Create application (one-time setup, reuse across job runs)
app_response = emr_serverless.create_application(
name="de-spark-app",
releaseLabel="emr-6.15.0",
type="SPARK",
autoStartConfiguration={"enabled": True}, # auto-start when job submitted
autoStopConfiguration={
"enabled": True,
"idleTimeoutMinutes": 15 # stop if idle for 15 min
},
initialCapacity={ # pre-warm workers to reduce latency
"DRIVER": {
"workerCount": 1,
"workerConfiguration": {"cpu":"2vCPU", "memory":"4GB"}
},
"EXECUTOR": {
"workerCount": 5,
"workerConfiguration": {"cpu":"4vCPU", "memory":"16GB"}
}
}
)
application_id = app_response["applicationId"]
print(f"Application created: {application_id}")
# Step 2: Start the application (if not auto-starting)
emr_serverless.start_application(applicationId=application_id)
print("Application started.")
Once the application exists, submit Spark jobs with start_job_run(). The key section is jobDriver.sparkSubmit where you provide the script S3 path, entry point arguments, and Spark configuration overrides.
job_response = emr_serverless.start_job_run(
applicationId=application_id,
executionRoleArn="arn:aws:iam::123456789012:role/EMRServerlessExecutionRole",
name="silver-orders-transform-2024-03-15",
jobDriver={
"sparkSubmit": {
"entryPoint": "s3://my-datalake/code/silver_transform.py",
"entryPointArguments": [
"--run-date", "2024-03-15",
"--env", "prod"
],
"sparkSubmitParameters": (
"--conf spark.executor.cores=4 "
"--conf spark.executor.memory=16g "
"--conf spark.sql.shuffle.partitions=200 "
"--py-files s3://my-datalake/code/utils.zip"
)
}
},
configurationOverrides={
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://my-datalake/emr-serverless-logs/"
}
}
},
tags={"Project": "DataPlatform", "Env": "prod"}
)
job_run_id = job_response["jobRunId"]
print(f"Job submitted: {job_run_id}")
Poll get_job_run() to track your serverless job. States: SUBMITTED → PENDING → SCHEDULED → RUNNING → SUCCESS / FAILED / CANCELLING / CANCELLED.
import time
def wait_for_serverless_job(client, app_id, job_run_id, poll_interval=20):
terminal_states = {"SUCCESS", "FAILED", "CANCELLED"}
while True:
resp = client.get_job_run(applicationId=app_id, jobRunId=job_run_id)
state = resp["jobRun"]["state"]
print(f"Job {job_run_id}: {state}")
if state == "SUCCESS":
print("✅ Serverless job succeeded.")
return True
if state in {"FAILED", "CANCELLED"}:
details = resp["jobRun"].get("stateDetails", "No details")
raise RuntimeError(f"Serverless job {job_run_id} ended: {state} — {details}")
time.sleep(poll_interval)
wait_for_serverless_job(emr_serverless, application_id, job_run_id)
Use cancel_job_run() to abort a running serverless job. Use stop_application() to stop the application (releases all pre-initialized capacity). Use delete_application() to permanently remove the application.
# Cancel a running job
emr_serverless.cancel_job_run(
applicationId=application_id,
jobRunId=job_run_id
)
# Stop application (releases pre-init capacity but keeps config)
emr_serverless.stop_application(applicationId=application_id)
# Delete application permanently
emr_serverless.delete_application(applicationId=application_id)
This is the full pattern you'll use in Airflow, Lambda, or a control script: create cluster → wait for WAITING state → submit steps → poll each step → handle failure → terminate cluster → publish audit metrics.
import boto3, time, json
from botocore.exceptions import ClientError
emr = boto3.client("emr", region_name="us-east-1")
cw = boto3.client("cloudwatch", region_name="us-east-1")
sns = boto3.client("sns", region_name="us-east-1")
ALERT_TOPIC = "arn:aws:sns:us-east-1:123456789012:de-alerts"
def run_emr_pipeline(run_date):
cluster_id = None
try:
# ── 1. Create cluster ─────────────────────
resp = emr.run_job_flow(
Name=f"pipeline-{run_date}",
ReleaseLabel="emr-6.15.0",
LogUri="s3://my-datalake/emr-logs/",
Instances={
"MasterInstanceType": "m5.xlarge",
"SlaveInstanceType": "m5.2xlarge",
"InstanceCount": 4,
"KeepJobFlowAliveWhenNoSteps": True, # we'll terminate manually
"Ec2SubnetId": "subnet-0abc123"
},
Applications=[{"Name": "Spark"}],
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
)
cluster_id = resp["JobFlowId"]
print(f"Cluster created: {cluster_id}")
# ── 2. Wait for cluster to reach WAITING ──
while True:
state = emr.describe_cluster(
ClusterId=cluster_id
)["Cluster"]["Status"]["State"]
print(f" Cluster state: {state}")
if state == "WAITING": break
if state in {"TERMINATED", "TERMINATED_WITH_ERRORS"}:
raise RuntimeError(f"Cluster failed during startup: {state}")
time.sleep(30)
# ── 3. Submit Spark steps ─────────────────
step_resp = emr.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
"Name": f"Silver Transform {run_date}",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit", "--deploy-mode", "cluster",
"s3://my-datalake/code/silver_transform.py",
"--run-date", run_date
]
}
}
]
)
step_id = step_resp["StepIds"][0]
print(f"Step submitted: {step_id}")
# ── 4. Poll step ──────────────────────────
while True:
step_state = emr.describe_step(
ClusterId=cluster_id, StepId=step_id
)["Step"]["Status"]["State"]
print(f" Step state: {step_state}")
if step_state == "COMPLETED": break
if step_state in {"FAILED", "CANCELLED"}:
raise RuntimeError(f"Step {step_id} ended: {step_state}")
time.sleep(30)
print("✅ Pipeline completed successfully.")
except (Exception, ClientError) as e:
# ── 5. Alert on failure ───────────────────
sns.publish(
TopicArn=ALERT_TOPIC,
Subject=f"[FAILURE] EMR pipeline {run_date}",
Message=str(e)
)
raise
finally:
# ── 6. Always terminate cluster ───────────
if cluster_id:
emr.terminate_job_flows(JobFlowIds=[cluster_id])
print(f"Cluster {cluster_id} termination requested.")
run_emr_pipeline("2024-03-15")
try/finally block and terminate the cluster in the finally. If you don't, a failed pipeline leaves an idle cluster running indefinitely, silently burning money. This is one of the most common and expensive production mistakes.
Task nodes on EMR are stateless (no HDFS data) — they are perfect for Spot Instances. Use On-Demand for master and core nodes (stable), and Spot for task nodes (can be interrupted without data loss). This can reduce cluster cost by 60–80%.
response = emr.run_job_flow(
Name="cost-optimized-cluster",
ReleaseLabel="emr-6.15.0",
LogUri="s3://my-datalake/emr-logs/",
Instances={
"InstanceGroups": [
{ # Master — On-Demand (never interrupt master)
"Name": "Master",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
"Market": "ON_DEMAND",
},
{ # Core — On-Demand (stores HDFS data, can't be interrupted)
"Name": "Core",
"InstanceRole": "CORE",
"InstanceType": "m5.2xlarge",
"InstanceCount": 2,
"Market": "ON_DEMAND",
},
{ # Task — Spot (stateless, safe to interrupt)
"Name": "Task-Spot",
"InstanceRole": "TASK",
"InstanceType": "m5.2xlarge",
"InstanceCount": 6,
"Market": "SPOT",
"BidPrice": "0.10", # max bid per hour
},
],
"KeepJobFlowAliveWhenNoSteps": False,
"Ec2SubnetId": "subnet-0abc123"
},
Applications=[{"Name": "Spark"}],
JobFlowRole="EMR_EC2_DefaultRole",
ServiceRole="EMR_DefaultRole",
)
Lambda APIs
AWS Lambda lets you run Python code without managing servers. In data engineering, Lambda is the glue between services — it reacts to S3 file arrivals, triggers Glue jobs, invokes EMR steps, writes audit records to DynamoDB, and sends alerts via SNS. The boto3 Lambda API lets you invoke functions, deploy code updates, manage configuration, and set up triggers — all programmatically.
invoke() calls a Lambda function. The key parameter is InvocationType:
• RequestResponse (default) — synchronous. Your code waits until Lambda finishes and returns the response payload. Use this when you need the result (e.g. a validation function that returns pass/fail).
• Event — asynchronous. boto3 returns immediately after queuing the invocation; Lambda runs in the background. Use this when you just want to trigger something and don't need the result (e.g. fire-and-forget notification Lambda).
import boto3
import json
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Synchronous invoke — wait for result ──
response = lambda_client.invoke(
FunctionName = "validate-orders-fn", # function name or ARN
InvocationType = "RequestResponse", # sync — wait for response
Payload = json.dumps({ # must be bytes or JSON string
"bucket" : "my-datalake",
"key" : "bronze/orders/2024/03/15/orders.parquet",
"run_date" : "2024-03-15"
}).encode("utf-8")
)
# ── Parse the response ──
status_code = response["StatusCode"] # HTTP 200 = Lambda invoked OK
function_error = response.get("FunctionError") # "Handled" or "Unhandled" if Lambda threw
payload = json.loads(response["Payload"].read()) # read the StreamingBody
print(f"HTTP Status : {status_code}") # 200
print(f"FunctionError: {function_error}") # None if success
print(f"Lambda result: {payload}")
# Example payload: {"status": "passed", "row_count": 48291, "dq_score": 99.2}
if function_error:
raise RuntimeError(f"Lambda function error: {payload}")
response["StatusCode"] == 200 just means Lambda was invoked successfully — not that your function ran without errors. Always check response.get("FunctionError"). If it's "Handled" or "Unhandled", the function threw an exception. The actual error is in the Payload.
import boto3, json
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Async invoke — fire and forget ──
response = lambda_client.invoke(
FunctionName = "send-pipeline-alert-fn",
InvocationType = "Event", # async — returns immediately (202)
Payload = json.dumps({
"pipeline" : "orders-silver-load",
"status" : "SUCCESS",
"rows" : 48291,
"duration" : "4m 32s"
}).encode("utf-8")
)
# StatusCode 202 = accepted for async execution
print(f"Async invoke accepted. Status: {response['StatusCode']}")
# No Payload to read — Lambda runs in the background
The Payload in a Lambda response is a StreamingBody object — you must call .read() on it and then json.loads() to get the actual data. If you forget .read(), you'll get a StreamingBody object instead of your data.
import boto3, json
from botocore.exceptions import ClientError
lambda_client = boto3.client("lambda", region_name="us-east-1")
def invoke_lambda(function_name: str, payload: dict) -> dict:
"""
Invoke Lambda synchronously and return the parsed response.
Raises RuntimeError if Lambda itself threw an exception.
"""
try:
response = lambda_client.invoke(
FunctionName = function_name,
InvocationType = "RequestResponse",
Payload = json.dumps(payload).encode("utf-8")
)
except ClientError as e:
# boto3-level error (e.g. function not found, no permissions)
code = e.response["Error"]["Code"]
if code == "ResourceNotFoundException":
raise ValueError(f"Lambda function '{function_name}' not found")
raise
# Read and parse the payload
raw = response["Payload"].read()
result = json.loads(raw) if raw else {}
# Check for Lambda-level errors (function threw an exception)
if response.get("FunctionError"):
error_type = result.get("errorType", "UnknownError")
error_msg = result.get("errorMessage", str(result))
raise RuntimeError(f"Lambda {function_name} failed [{error_type}]: {error_msg}")
return result
# ── Usage ──
result = invoke_lambda(
function_name = "validate-orders-fn",
payload = {"bucket": "my-datalake", "key": "bronze/orders/2024-03-15/"}
)
print(f"DQ Score: {result['dq_score']}")
print(f"Row Count: {result['row_count']}")
Lambda functions are paginated when listing. Always use a paginator — an account can have hundreds of functions. Useful for auditing all Lambda functions, finding functions by naming convention, or building a deployment inventory.
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── List all Lambda functions ──
paginator = lambda_client.get_paginator("list_functions")
pages = paginator.paginate()
de_functions = [] # collect only data engineering functions
for page in pages:
for fn in page["Functions"]:
name = fn["FunctionName"]
runtime = fn["Runtime"]
memory = fn["MemorySize"]
timeout = fn["Timeout"]
role = fn["Role"]
# Filter to only DE-related Lambdas by naming convention
if "pipeline" in name or "etl" in name or "glue" in name:
de_functions.append(name)
print(f"{name:50s} | {runtime:12s} | {memory}MB | {timeout}s")
print(f"\nTotal DE Lambda functions: {len(de_functions)}")
get_function() returns full metadata about a Lambda function: its runtime, memory, timeout, environment variables, VPC config, layers, and a pre-signed URL to download the current deployment package.
response = lambda_client.get_function(FunctionName="validate-orders-fn")
config = response["Configuration"]
print(f"Runtime : {config['Runtime']}") # python3.11
print(f"Memory : {config['MemorySize']} MB") # 512
print(f"Timeout : {config['Timeout']} seconds") # 300
print(f"Last Modified: {config['LastModified']}")
print(f"Code Size : {config['CodeSize']} bytes")
# Environment variables (redacted in response — values shown)
env_vars = config.get("Environment", {}).get("Variables", {})
print(f"Env vars : {list(env_vars.keys())}") # ['ENV', 'BUCKET', 'TABLE']
# VPC config (if function is in a VPC)
vpc = config.get("VpcConfig", {})
print(f"VPC Subnets : {vpc.get('SubnetIds', [])}")
# Pre-signed S3 URL to download current code package
code_url = response["Code"]["Location"]
print(f"Code download URL (expires 10min): {code_url[:60]}...")
You can create Lambda functions programmatically — useful in CI/CD pipelines or when building infrastructure-as-code without Terraform. The code must be uploaded as a ZIP file (either inline as bytes or referenced from an S3 bucket).
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Deploy Lambda from S3 ZIP ──
response = lambda_client.create_function(
FunctionName = "trigger-glue-on-arrival-fn",
Runtime = "python3.11",
Role = "arn:aws:iam::123456789012:role/LambdaGlueRole",
Handler = "lambda_handler.handler", # filename.function_name
Code = {
"S3Bucket": "my-datalake",
"S3Key" : "lambda-code/trigger-glue-on-arrival.zip"
},
Description = "Trigger Glue ETL job when a new file lands in Bronze S3",
Timeout = 300, # 5 minutes max (Lambda hard limit = 15 min)
MemorySize = 256, # MB — start low, tune after profiling
Environment = {
"Variables": {
"GLUE_JOB_NAME" : "orders-bronze-to-silver",
"ENV" : "prod",
"AUDIT_TABLE" : "pipeline-audit-log"
}
},
VpcConfig = {
"SubnetIds" : ["subnet-private-1a", "subnet-private-1b"],
"SecurityGroupIds" : ["sg-lambda-outbound"]
},
Layers = [
"arn:aws:lambda:us-east-1:123456789012:layer:pandas-layer:3"
],
Tags = {"Project": "DataPlatform", "Environment": "prod"}
)
arn = response["FunctionArn"]
print(f"Created Lambda: {arn}")
Handler = "filename.function_name". If your file is lambda_handler.py and your function is def handler(event, context):, then Handler = "lambda_handler.handler". The filename is the Python module name (no .py).
When you update your Lambda code (bug fix, new logic), use update_function_code() to push the new ZIP. This is what CI/CD pipelines do after building and uploading a new deployment package to S3.
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Update code — typically run from a CI/CD pipeline after new ZIP is uploaded ──
response = lambda_client.update_function_code(
FunctionName = "trigger-glue-on-arrival-fn",
S3Bucket = "my-datalake",
S3Key = "lambda-code/trigger-glue-on-arrival-v2.zip",
Publish = True # True = create a new published version (immutable snapshot)
)
print(f"Updated to version : {response['Version']}")
print(f"Last modified : {response['LastModified']}")
print(f"Code SHA256 : {response['CodeSha256']}")
# ── Alternative: upload ZIP bytes directly (for small functions in CI ──
with open("my_lambda.zip", "rb") as f:
zip_bytes = f.read()
lambda_client.update_function_code(
FunctionName = "trigger-glue-on-arrival-fn",
ZipFile = zip_bytes
)
Use update_function_configuration() to change runtime settings without redeploying code: update environment variables, increase memory or timeout, change the execution role, or update VPC settings.
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Update environment variables ──
lambda_client.update_function_configuration(
FunctionName = "trigger-glue-on-arrival-fn",
Environment = {
"Variables": {
"GLUE_JOB_NAME" : "orders-bronze-to-silver-v2", # updated job name
"ENV" : "prod",
"AUDIT_TABLE" : "pipeline-audit-log",
"ALERT_SNS_ARN" : "arn:aws:sns:us-east-1:123:pipeline-alerts" # new var
}
},
Timeout = 600, # increased from 300 to 600 seconds
MemorySize = 512 # increased from 256 to 512 MB
)
print("Function configuration updated")
# ── Important: wait for update to propagate before invoking ──
# Lambda updates are eventually consistent — add a small sleep or use a waiter
import time; time.sleep(3) # or use lambda_client.get_waiter("function_updated")
update_function_code() or update_function_configuration(), the update takes a few seconds to propagate. If you immediately invoke the function, you might hit the old version. Use the function_updated waiter or a brief sleep in CI/CD pipelines.
When you want S3, EventBridge, SQS, or SNS to invoke your Lambda, you must grant them permission via a resource-based policy using add_permission(). Without this, the trigger source gets a permission error when trying to call Lambda.
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# ── Allow S3 to invoke this Lambda (for S3 event notifications) ──
lambda_client.add_permission(
FunctionName = "trigger-glue-on-arrival-fn",
StatementId = "AllowS3Invoke-bronze-bucket", # unique ID for this permission
Action = "lambda:InvokeFunction",
Principal = "s3.amazonaws.com",
SourceArn = "arn:aws:s3:::my-datalake", # only THIS bucket can invoke
SourceAccount = "123456789012" # prevents confused deputy attack
)
print("S3 trigger permission added")
# ── Allow EventBridge to invoke Lambda (for scheduled runs) ──
lambda_client.add_permission(
FunctionName = "trigger-glue-on-arrival-fn",
StatementId = "AllowEventBridgeInvoke",
Action = "lambda:InvokeFunction",
Principal = "events.amazonaws.com",
SourceArn = "arn:aws:events:us-east-1:123456789012:rule/daily-etl-rule"
)
Delete a Lambda function when decommissioning a pipeline. You can delete a specific published version or the entire function (all versions).
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
# Delete entire function (all versions and aliases)
lambda_client.delete_function(FunctionName="old-legacy-pipeline-fn")
print("Function deleted")
# Delete only a specific published version (keeps $LATEST and other versions)
lambda_client.delete_function(
FunctionName = "trigger-glue-on-arrival-fn",
Qualifier = "5" # version number
)
print("Version 5 deleted")
The most common DE Lambda pattern: S3 sends an event notification to Lambda when a file lands. Lambda reads the event, extracts the bucket and key, then starts a Glue job with the file details as arguments. This makes pipelines event-driven — they run automatically the moment data arrives.
import boto3, os, json, logging
from datetime import datetime
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue = boto3.client("glue", region_name="us-east-1")
dynamo = boto3.client("dynamodb", region_name="us-east-1")
sns = boto3.client("sns", region_name="us-east-1")
GLUE_JOB_NAME = os.environ["GLUE_JOB_NAME"] # from Lambda env vars
AUDIT_TABLE = os.environ["AUDIT_TABLE"]
SNS_ARN = os.environ["ALERT_SNS_ARN"]
def handler(event, context):
"""Triggered by S3 event notification when a new file lands."""
for record in event.get("Records", []):
# ── Extract S3 file details from the event ──
bucket = record["s3"]["bucket"]["name"]
key = record["s3"]["object"]["key"]
size = record["s3"]["object"]["size"]
logger.info(f"File arrived: s3://{bucket}/{key} size={size} bytes")
# ── Start the Glue ETL job ──
try:
response = glue.start_job_run(
JobName = GLUE_JOB_NAME,
Arguments = {
"--source_bucket" : bucket,
"--source_key" : key,
"--run_date" : datetime.utcnow().strftime("%Y-%m-%d"),
"--trigger" : "s3_event"
}
)
run_id = response["JobRunId"]
logger.info(f"Glue job started: {run_id}")
except glue.exceptions.ConcurrentRunsExceededException:
logger.warning("Glue job already at max concurrent runs — skipping this trigger")
return {"status": "skipped", "reason": "concurrent_limit"}
# ── Write audit record to DynamoDB ──
dynamo.put_item(
TableName = AUDIT_TABLE,
Item = {
"run_id" : {"S": run_id},
"job_name" : {"S": GLUE_JOB_NAME},
"source_key": {"S": key},
"status" : {"S": "STARTED"},
"triggered_at": {"S": datetime.utcnow().isoformat()},
"trigger" : {"S": "s3_event"}
}
)
return {"status": "ok", "glue_run_id": run_id}
s3://my-datalake/landing/orders/ every 30 minutes. S3 notifies Lambda, Lambda triggers the Glue Bronze ingestion job with the exact file path. No polling, no cron — runs the moment data arrives.
When your heavy processing runs on EMR, Lambda acts as the orchestrator: it finds the running cluster (or creates one), submits a Spark step, and returns the step ID. A separate polling mechanism (Airflow, another Lambda, or a Step Function) tracks completion.
import boto3, os, json
emr = boto3.client("emr", region_name="us-east-1")
CLUSTER_ID = os.environ["EMR_CLUSTER_ID"]
SCRIPT_PATH = os.environ["SPARK_SCRIPT_S3_PATH"] # s3://my-datalake/code/transform.py
def handler(event, context):
run_date = event.get("run_date", "2024-03-15")
# Submit Spark step to existing running cluster
response = emr.add_job_flow_steps(
JobFlowId = CLUSTER_ID,
Steps = [{
"Name" : f"Silver Transform {run_date}",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep" : {
"Jar" : "command-runner.jar",
"Args": [
"spark-submit",
"--deploy-mode", "cluster",
"--master", "yarn",
SCRIPT_PATH,
"--run-date", run_date
]
}
}]
)
step_id = response["StepIds"][0]
print(f"EMR step submitted: {step_id}")
return {"cluster_id": CLUSTER_ID, "step_id": step_id, "run_date": run_date}
Lambda is commonly used as a pipeline audit writer — called at the end of every Glue or EMR job (via SNS or EventBridge) to record the run result in a DynamoDB audit table. This gives you a central job history table queryable from anywhere.
import boto3, os, json
from datetime import datetime
dynamo = boto3.client("dynamodb", region_name="us-east-1")
AUDIT_TABLE = os.environ["AUDIT_TABLE"]
def handler(event, context):
"""
Called by EventBridge when a Glue job state changes to SUCCEEDED or FAILED.
EventBridge Glue event detail structure:
{
"jobName": "orders-bronze-to-silver",
"state" : "SUCCEEDED",
"jobRunId": "jr_abc123",
"message": ""
}
"""
detail = event.get("detail", {})
job_name = detail.get("jobName", "unknown")
state = detail.get("state", "unknown")
run_id = detail.get("jobRunId", "unknown")
message = detail.get("message", "")
dynamo.put_item(
TableName = AUDIT_TABLE,
Item = {
"run_id" : {"S": run_id},
"job_name" : {"S": job_name},
"status" : {"S": state},
"message" : {"S": message},
"recorded_at" : {"S": datetime.utcnow().isoformat()}
}
)
print(f"Audit written: {job_name} → {state}")
return {"ok": True}
A dedicated Lambda function subscribed to an SNS failure topic handles all pipeline failure alerts. It formats a rich message and sends it to Slack (via HTTP), email, or PagerDuty. Centralising notifications in one Lambda keeps all other pipeline code clean.
import boto3, os, json
import urllib.request
SLACK_WEBHOOK = os.environ.get("SLACK_WEBHOOK_URL")
def handler(event, context):
"""Triggered by SNS pipeline-failure topic."""
for record in event.get("Records", []):
message = json.loads(record["Sns"]["Message"])
pipeline = message.get("pipeline", "unknown")
status = message.get("status", "FAILED")
error = message.get("error", "No error details")
run_id = message.get("run_id", "unknown")
run_date = message.get("run_date", "unknown")
# ── Format Slack message ──
text = (
f":rotating_light: *Pipeline Alert*\n"
f"Pipeline : `{pipeline}`\n"
f"Status : *{status}*\n"
f"Run ID : `{run_id}`\n"
f"Run Date : `{run_date}`\n"
f"Error : ```{error}```"
)
if SLACK_WEBHOOK:
payload = json.dumps({"text": text}).encode("utf-8")
req = urllib.request.Request(
SLACK_WEBHOOK,
data=payload,
headers={"Content-Type": "application/json"},
method="POST"
)
with urllib.request.urlopen(req) as resp:
print(f"Slack notification sent: {resp.status}")
return {"ok": True}
pipeline-alerts topic → SNS fans out to: (1) this Lambda for Slack/PagerDuty, (2) Email subscription, (3) SQS for dead-letter storage. One SNS topic, three notification channels.
Lambda automatically retries async invocations on failure (2 retries by default). This means your handler must be idempotent — safe to run twice with the same event. For Glue triggers, always check if a run is already in progress before starting another.
import boto3, os, json, logging
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
glue = boto3.client("glue", region_name="us-east-1")
GLUE_JOB_NAME = os.environ["GLUE_JOB_NAME"]
def is_job_already_running(job_name: str) -> bool:
"""Check if a Glue job run is already in progress."""
paginator = glue.get_paginator("get_job_runs")
for page in paginator.paginate(JobName=job_name, MaxResults=5):
for run in page["JobRuns"]:
if run["JobRunState"] in ("STARTING", "RUNNING", "STOPPING"):
logger.info(f"Job already running: {run['Id']}")
return True
return False
def handler(event, context):
try:
# Idempotency check — skip if already running
if is_job_already_running(GLUE_JOB_NAME):
return {"status": "skipped", "reason": "already_running"}
response = glue.start_job_run(
JobName = GLUE_JOB_NAME,
Arguments = {"--run_date": event.get("run_date", "latest")}
)
run_id = response["JobRunId"]
logger.info(f"Started Glue job: {run_id}")
return {"status": "started", "run_id": run_id}
except ClientError as e:
code = e.response["Error"]["Code"]
logger.error(f"ClientError [{code}]: {e}")
# Re-raise so Lambda marks this invocation as failed and retries
raise
Lambda has two built-in boto3 waiters that are useful in CI/CD pipelines:
• function_active — waits until a newly created function is fully active and invocable.
• function_updated — waits until a update_function_code() or update_function_configuration() update has fully propagated and the function is ready to invoke again.
Without waiters, invoking immediately after create/update can hit a stale state and fail with ResourceConflictException.
import boto3
lambda_client = boto3.client("lambda", region_name="us-east-1")
FUNCTION_NAME = "trigger-glue-on-arrival-fn"
# ── 1. Deploy new code ──
lambda_client.update_function_code(
FunctionName = FUNCTION_NAME,
S3Bucket = "my-datalake",
S3Key = "lambda-code/trigger-glue-v3.zip"
)
print("Code update submitted. Waiting for propagation...")
# ── 2. Wait until update is fully applied ──
waiter = lambda_client.get_waiter("function_updated")
waiter.wait(
FunctionName = FUNCTION_NAME,
WaiterConfig = {
"Delay" : 5, # poll every 5 seconds
"MaxAttempts": 20 # give up after 100 seconds
}
)
print("Update complete. Function is ready.")
# ── 3. Safe to invoke now ──
response = lambda_client.invoke(
FunctionName = FUNCTION_NAME,
InvocationType = "RequestResponse",
Payload = b'{"test": true}'
)
print(f"Test invoke status: {response['StatusCode']}")
update_function_code() → function_updated waiter → smoke-test invoke → done. Without the waiter, the smoke test might invoke the old version and give a false green.
| API | What It Does | Key Parameter | Returns |
|---|---|---|---|
invoke() | Call a Lambda function | InvocationType (RequestResponse / Event) | StatusCode, Payload, FunctionError |
list_functions() | List all functions (paginated) | — | List of function configs |
get_function() | Get metadata + code URL for a function | FunctionName | Configuration + Code |
create_function() | Deploy a new Lambda | Runtime, Role, Handler, Code | FunctionArn |
update_function_code() | Push new deployment package | S3Bucket/S3Key or ZipFile | Version, CodeSha256 |
update_function_configuration() | Change memory, timeout, env vars | MemorySize, Timeout, Environment | Updated config |
add_permission() | Allow a service to invoke Lambda | Principal, Action, SourceArn | Statement JSON |
delete_function() | Delete function or specific version | FunctionName, Qualifier | — |
Waiter: function_active | Wait until new function is invocable | FunctionName | — |
Waiter: function_updated | Wait until code/config update is live | FunctionName | — |
AWS Secrets Manager is where production pipelines store database passwords, API tokens, and private keys — never in code or environment variables committed to git. Every Data Engineer must know how to retrieve, create, rotate, and list secrets using boto3. This section covers every API you will actually use, with the full production pattern that appears in almost every real pipeline.
get_secret_value() is the single most important Secrets Manager API. It retrieves the current value of a secret. Secrets are stored either as a plain string (SecretString) or as binary data (SecretBinary). In Data Engineering, almost all secrets are stored as a JSON string inside SecretString — so you always call json.loads(SecretString) after retrieval.
response["SecretString"] for text secrets. The value is a raw string — if you stored JSON (username + password + host), you must parse it with json.loads(). This is the universal pattern in every production pipeline.
import boto3
import json
client = boto3.client("secretsmanager", region_name="us-east-1")
# ── Retrieve a secret ──
response = client.get_secret_value(SecretId="prod/orders-db/credentials")
# SecretString contains the raw JSON string stored in Secrets Manager
secret_string = response["SecretString"]
# Parse it into a Python dict
credentials = json.loads(secret_string)
# Now use individual fields
db_host = credentials["host"]
db_port = credentials["port"]
db_user = credentials["username"]
db_password = credentials["password"]
db_name = credentials["dbname"]
print(f"Connecting to {db_host}:{db_port}/{db_name} as {db_user}")
{"host":"orders-db.abc123.us-east-1.rds.amazonaws.com","port":5432,"username":"etl_user","password":"S3cur3P@ss!","dbname":"orders"}After
json.loads(), you get a Python dict with all those fields ready to use in your JDBC connection or psycopg2 call.
The SecretId parameter accepts either the secret name (human-readable) or the full ARN. In production, always prefer the ARN for cross-account access or when there could be naming conflicts. For same-account access, the name works fine and is easier to read in code.
import boto3, json
client = boto3.client("secretsmanager", region_name="us-east-1")
# Option 1 — by name (simple, same-account)
resp = client.get_secret_value(SecretId="prod/orders-db/credentials")
# Option 2 — by full ARN (cross-account, unambiguous)
resp = client.get_secret_value(
SecretId="arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/orders-db/credentials-AbCdEf"
)
# Option 3 — retrieve a specific VERSION (for rotation testing)
resp = client.get_secret_value(
SecretId = "prod/orders-db/credentials",
VersionStage= "AWSPREVIOUS" # AWSCURRENT (default) or AWSPREVIOUS
)
credentials = json.loads(resp["SecretString"])
print(credentials)
This is the pattern every real pipeline uses. Retrieve secret → parse JSON → build JDBC URL or psycopg2 connection. The secret is fetched once at startup (or per Lambda invocation) and cached in a module-level variable to avoid hitting Secrets Manager on every row processed.
import boto3, json, psycopg2
from functools import lru_cache
sm = boto3.client("secretsmanager", region_name="us-east-1")
@lru_cache(maxsize=1)
def get_db_credentials(secret_name: str) -> dict:
"""
Fetch and cache DB credentials from Secrets Manager.
lru_cache ensures we only call Secrets Manager ONCE per process lifetime.
In Lambda, this cache persists across warm invocations — saving cost + latency.
"""
resp = sm.get_secret_value(SecretId=secret_name)
return json.loads(resp["SecretString"])
def get_db_connection(secret_name: str):
"""Return a live psycopg2 connection using credentials from Secrets Manager."""
creds = get_db_credentials(secret_name)
conn = psycopg2.connect(
host = creds["host"],
port = creds.get("port", 5432),
user = creds["username"],
password = creds["password"],
dbname = creds["dbname"]
)
return conn
# ── Usage in a pipeline ──
if __name__ == "__main__":
SECRET = "prod/orders-db/credentials"
conn = get_db_connection(SECRET)
cursor = conn.cursor()
cursor.execute("SELECT COUNT(*) FROM orders WHERE status = 'PENDING'")
count = cursor.fetchone()[0]
print(f"Pending orders: {count}")
conn.close()
spark.read.jdbc() — never hardcode it in the url string or write it to a config file on disk.
import boto3, json
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("orders-etl").getOrCreate()
sm = boto3.client("secretsmanager", region_name="us-east-1")
# ── Fetch credentials ──
creds = json.loads(
sm.get_secret_value(SecretId="prod/orders-db/credentials")["SecretString"]
)
jdbc_url = (
f"jdbc:postgresql://{creds['host']}:{creds.get('port',5432)}/{creds['dbname']}"
)
# ── Read from PostgreSQL via JDBC ──
df = (
spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "orders")
.option("user", creds["username"])
.option("password", creds["password"]) # ← from Secrets Manager, not hardcoded
.option("driver", "org.postgresql.Driver")
.load()
)
df.show(5)
create_secret() creates a brand-new secret in Secrets Manager. You typically call this in Terraform or a one-time setup script — not in your running pipeline. It sets the name, description, optional KMS key, and the initial secret value.
import boto3, json
sm = boto3.client("secretsmanager", region_name="us-east-1")
# ── Create a new secret ──
response = sm.create_secret(
Name = "prod/orders-db/credentials", # The secret name (path-style is best practice)
Description = "PostgreSQL credentials for the orders pipeline",
KmsKeyId = "alias/my-pipeline-key", # Optional — use CMK instead of default key
SecretString= json.dumps({ # Always store as JSON for structured creds
"host" : "orders-db.abc123.us-east-1.rds.amazonaws.com",
"port" : 5432,
"username": "etl_user",
"password": "InitialP@ssword123!",
"dbname" : "orders"
}),
Tags = [
{"Key": "Project", "Value": "orders-platform"},
{"Key": "Environment", "Value": "prod"},
{"Key": "ManagedBy", "Value": "terraform"}
]
)
print(f"Secret created: {response['ARN']}")
create_secret() on a name that already exists, you get a ResourceExistsException. Always wrap in a try/except and fall back to put_secret_value() if the secret already exists, or use Terraform/CDK to manage the lifecycle.
put_secret_value() updates the value of an existing secret. This is used during manual rotation (changing a password), updating an API token that has expired, or programmatic rotation outside of Secrets Manager's built-in rotation. The old value becomes AWSPREVIOUS; the new value becomes AWSCURRENT.
import boto3, json
sm = boto3.client("secretsmanager", region_name="us-east-1")
SECRET_NAME = "prod/orders-db/credentials"
# ── Read current secret ──
current = json.loads(
sm.get_secret_value(SecretId=SECRET_NAME)["SecretString"]
)
# ── Build updated secret (only change what rotated) ──
updated = {**current, "password": "NewRotatedP@ss456!"}
# ── Write new version ──
response = sm.put_secret_value(
SecretId = SECRET_NAME,
SecretString= json.dumps(updated),
# VersionStages is optional — AWS sets AWSCURRENT automatically
)
print(f"Secret updated. New version: {response['VersionId']}")
put_secret_value() to update Secrets Manager. All pipelines reading that secret automatically get the fresh token on their next invocation — no restarts needed.
In automation scripts, you often don't know if the secret exists yet. The pattern is: try create_secret(), and if it fails with ResourceExistsException, call put_secret_value() instead. This is the safe upsert pattern for secrets.
import boto3, json
from botocore.exceptions import ClientError
sm = boto3.client("secretsmanager", region_name="us-east-1")
def upsert_secret(secret_name: str, secret_dict: dict) -> str:
"""Create secret if it doesn't exist, else update it."""
secret_string = json.dumps(secret_dict)
try:
resp = sm.create_secret(
Name = secret_name,
SecretString = secret_string
)
print(f"Created new secret: {secret_name}")
return resp["ARN"]
except ClientError as e:
if e.response["Error"]["Code"] == "ResourceExistsException":
# Secret already exists — just update the value
resp = sm.put_secret_value(
SecretId = secret_name,
SecretString = secret_string
)
print(f"Updated existing secret: {secret_name}")
return secret_name
else:
raise # Re-raise any other error
# ── Usage ──
upsert_secret(
"prod/snowflake/credentials",
{"account": "xy12345", "user": "etl_svc", "private_key": "..."}
)
describe_secret() returns metadata about a secret — its ARN, name, description, rotation configuration, version IDs, tags — but NOT the actual secret value. Use this to check if a secret exists, find its rotation status, or audit its tags without exposing the credential.
import boto3
from botocore.exceptions import ClientError
sm = boto3.client("secretsmanager", region_name="us-east-1")
def secret_exists(secret_name: str) -> bool:
"""Check if a secret exists without fetching its value."""
try:
meta = sm.describe_secret(SecretId=secret_name)
print(f"Secret found: {meta['ARN']}")
print(f"Last rotated: {meta.get('LastRotatedDate', 'Never')}")
print(f"Rotation enabled: {meta.get('RotationEnabled', False)}")
return True
except ClientError as e:
if e.response["Error"]["Code"] == "ResourceNotFoundException":
return False
raise
# ── Check rotation status in a pipeline health check ──
if not secret_exists("prod/orders-db/credentials"):
raise RuntimeError("Required secret not found — check Secrets Manager setup!")
print("Secret health check passed ✓")
describe_secret() when you only need to check existence or metadata (health checks, audits). Use get_secret_value() when you need the actual password or token. describe_secret does not count toward your Secrets Manager read quota the same way as get_secret_value.
list_secrets() returns all secrets in the account/region. Since an account can have hundreds of secrets, always use the paginator — never call list_secrets() in a bare loop with NextToken. You can filter by name prefix or tags to narrow results.
import boto3
sm = boto3.client("secretsmanager", region_name="us-east-1")
paginator = sm.get_paginator("list_secrets")
# ── List ALL secrets in the account ──
all_secrets = []
for page in paginator.paginate():
for secret in page["SecretList"]:
all_secrets.append({
"name" : secret["Name"],
"arn" : secret["ARN"],
"rotation_enabled": secret.get("RotationEnabled", False),
"last_rotated" : secret.get("LastRotatedDate", None),
"tags" : {t["Key"]: t["Value"] for t in secret.get("Tags", [])}
})
print(f"Total secrets: {len(all_secrets)}")
# ── Filter: find secrets for prod environment ──
prod_secrets = [s for s in all_secrets if s["tags"].get("Environment") == "prod"]
print(f"Prod secrets: {len(prod_secrets)}")
# ── Audit: find secrets where rotation is NOT enabled ──
no_rotation = [s for s in all_secrets if not s["rotation_enabled"]]
print(f"Secrets without rotation: {len(no_rotation)}")
for s in no_rotation:
print(f" ⚠️ {s['name']}")
The Filters parameter lets you narrow down secrets without fetching all of them. Filter by name, tag-key, tag-value, or description. This is much more efficient when you have hundreds of secrets and only care about a specific project's secrets.
import boto3
sm = boto3.client("secretsmanager", region_name="us-east-1")
paginator = sm.get_paginator("list_secrets")
# ── Find only secrets under the "prod/" prefix ──
for page in paginator.paginate(
Filters=[
{"Key": "name", "Values": ["prod/"]} # Prefix match on name
]
):
for secret in page["SecretList"]:
print(secret["Name"])
Secrets Manager does NOT immediately delete secrets. By default, a secret enters a recovery window (7–30 days) during which it can be restored. This protects against accidental deletion — if a pipeline breaks because a secret was deleted, you have time to recover. After the recovery window, the secret is permanently gone.
import boto3
sm = boto3.client("secretsmanager", region_name="us-east-1")
# ── Soft delete with 7-day recovery window (default behaviour) ──
response = sm.delete_secret(
SecretId = "prod/old-api/token",
RecoveryWindowInDays = 7 # Can be 7 to 30 days
)
print(f"Secret scheduled for deletion on: {response['DeletionDate']}")
# ── Hard delete (immediate, NO recovery possible) — use with extreme caution ──
sm.delete_secret(
SecretId = "dev/temp-secret/test",
ForceDeleteWithoutRecovery = True # Permanent — no going back
)
sm.restore_secret(SecretId="...") to cancel the deletion and bring the secret back to active status. After the recovery window expires, this is no longer possible.
Secrets Manager can automatically rotate credentials on a schedule using a Lambda function. When rotation triggers: (1) Lambda generates a new credential, (2) updates the DB/API with the new credential, (3) calls put_secret_value() to update Secrets Manager, (4) verifies the new credential works. The old value stays as AWSPREVIOUS until the next rotation. Your pipelines calling get_secret_value() always get AWSCURRENT automatically — no pipeline restarts needed.
import boto3
sm = boto3.client("secretsmanager", region_name="us-east-1")
# ── Enable automatic rotation with a 30-day schedule ──
sm.rotate_secret(
SecretId = "prod/orders-db/credentials",
RotationLambdaARN = "arn:aws:lambda:us-east-1:123456789012:function:SecretsManagerRotation",
RotationRules = {
"AutomaticallyAfterDays": 30 # Rotate every 30 days
}
)
print("Automatic rotation enabled. Lambda will rotate the secret every 30 days.")
Both services store secrets, but they serve different use cases. Knowing when to use which is a common interview question and a real decision you make in every project.
| Feature | Secrets Manager | Parameter Store |
|---|---|---|
| Cost | $0.40/secret/month + API calls | Free (Standard tier) |
| Automatic Rotation | ✅ Built-in with Lambda | ❌ Manual only |
| Max Value Size | 65KB | 4KB (Standard), 8KB (Advanced) |
| Cross-account access | ✅ Native resource policy | ⚠️ Requires more setup |
| Versioning | ✅ AWSCURRENT / AWSPREVIOUS | ✅ History with Advanced tier |
| KMS Encryption | ✅ Always encrypted | SecureString only |
| Best for | DB passwords, API tokens, private keys | Config values, feature flags, non-secret params |
Every real project has a secrets utility module that all pipelines import. This module handles: retrieval, JSON parsing, caching, error handling, and logging. Write it once, use it everywhere. Here is the complete production-grade version:
"""
secrets_util.py — Universal Secrets Manager utility for data pipelines.
Usage:
from secrets_util import get_secret, get_db_creds
"""
import boto3
import json
import logging
from functools import lru_cache
from typing import Any
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
# Module-level client (reused across calls in Lambda warm invocations)
_sm_client = None
def _get_client(region: str = "us-east-1"):
global _sm_client
if _sm_client is None:
_sm_client = boto3.client("secretsmanager", region_name=region)
return _sm_client
@lru_cache(maxsize=32)
def get_secret(secret_name: str, region: str = "us-east-1") -> dict:
"""
Retrieve and parse a JSON secret from Secrets Manager.
Result is cached in-process (safe for Lambda warm invocations).
Args:
secret_name: Secret name or ARN
region : AWS region
Returns:
dict parsed from the SecretString JSON
Raises:
ValueError : if secret is not valid JSON
RuntimeError: if secret not found or access denied
"""
client = _get_client(region)
try:
logger.info(f"Fetching secret: {secret_name}")
response = client.get_secret_value(SecretId=secret_name)
secret_string = response.get("SecretString")
if not secret_string:
raise ValueError(f"Secret '{secret_name}' has no SecretString (may be binary)")
return json.loads(secret_string)
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "ResourceNotFoundException":
raise RuntimeError(f"Secret not found: {secret_name}") from e
elif code == "AccessDeniedException":
raise RuntimeError(f"IAM permission denied for secret: {secret_name}") from e
elif code in ("DecryptionFailure", "InternalServiceError"):
raise RuntimeError(f"KMS/internal error retrieving {secret_name}: {code}") from e
else:
raise
except json.JSONDecodeError as e:
raise ValueError(f"Secret '{secret_name}' is not valid JSON: {e}") from e
def get_db_creds(secret_name: str, region: str = "us-east-1") -> dict:
"""
Convenience wrapper — returns DB credential keys with validation.
Expects secret to have: host, port, username, password, dbname
"""
creds = get_secret(secret_name, region)
required = {"host", "username", "password", "dbname"}
missing = required - creds.keys()
if missing:
raise ValueError(f"Secret '{secret_name}' missing keys: {missing}")
creds.setdefault("port", 5432)
return creds
def get_jdbc_url(secret_name: str, engine: str = "postgresql") -> tuple[str, str, str]:
"""
Returns (jdbc_url, user, password) ready for spark.read.jdbc().
engine: 'postgresql' | 'mysql' | 'redshift'
"""
creds = get_db_creds(secret_name)
driver_map = {
"postgresql": ("postgresql", "org.postgresql.Driver"),
"mysql" : ("mysql", "com.mysql.cj.jdbc.Driver"),
"redshift" : ("redshift", "com.amazon.redshift.jdbc42.Driver"),
}
scheme, _ = driver_map.get(engine, ("postgresql", "org.postgresql.Driver"))
url = f"jdbc:{scheme}://{creds['host']}:{creds['port']}/{creds['dbname']}"
return url, creds["username"], creds["password"]
# ─────────────────────────────────────────────
# Example usage in a Glue / EMR PySpark job
# ─────────────────────────────────────────────
if __name__ == "__main__":
SECRET = "prod/orders-db/credentials"
# Pattern 1 — get raw dict
creds = get_secret(SECRET)
print(f"Host: {creds['host']}")
# Pattern 2 — get JDBC tuple for Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
jdbc_url, user, pwd = get_jdbc_url(SECRET, engine="postgresql")
df = (
spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "orders")
.option("user", user)
.option("password", pwd)
.load()
)
df.printSchema()
secretsmanager:GetSecretValue permission on the specific secret ARN (not *). In EMR, the EC2 instance profile attached to the cluster must have the same permission. In Lambda, the execution role needs it. This is the IAM setup that makes boto3 auth work without any keys in code.
| API | Purpose | Key Parameters | Returns |
|---|---|---|---|
get_secret_value() | Retrieve secret value | SecretId, VersionStage | SecretString or SecretBinary |
create_secret() | Create brand-new secret | Name, SecretString, KmsKeyId | ARN, Name, VersionId |
put_secret_value() | Update/rotate existing secret | SecretId, SecretString | ARN, VersionId |
describe_secret() | Get metadata (no value) | SecretId | ARN, rotation info, tags |
list_secrets() | List all secrets (paginate!) | Filters | SecretList[] |
delete_secret() | Schedule deletion | SecretId, RecoveryWindowInDays | DeletionDate |
restore_secret() | Cancel scheduled deletion | SecretId | ARN, Name |
rotate_secret() | Enable/trigger rotation | SecretId, RotationLambdaARN, RotationRules | ARN |
SQS APIs
Amazon SQS (Simple Queue Service) is a fully managed message queue used to decouple pipeline stages, buffer events, and handle failures gracefully via Dead Letter Queues. Every DE must master the consume-process-delete pattern — the backbone of reliable event-driven pipelines.
Creates a new SQS queue. You specify the queue name and optional attributes like visibility timeout, message retention, and whether it's a FIFO queue.
create_queue() is registering a new mailbox. Messages go in one end, your consumer picks them up from the other.import boto3
sqs = boto3.client('sqs', region_name='us-east-1')
# Create a Standard Queue
response = sqs.create_queue(
QueueName='pipeline-events',
Attributes={
'VisibilityTimeout': '60', # seconds — how long msg is hidden after receive
'MessageRetentionPeriod': '86400', # 1 day in seconds
'ReceiveMessageWaitTimeSeconds': '20', # long polling — always set to 20
}
)
queue_url = response['QueueUrl']
print(queue_url)
# https://sqs.us-east-1.amazonaws.com/123456789/pipeline-events
# Create a FIFO Queue (ordered + exactly-once)
fifo_response = sqs.create_queue(
QueueName='ordered-jobs.fifo', # FIFO queues MUST end in .fifo
Attributes={
'FifoQueue': 'true',
'ContentBasedDeduplication': 'true'
}
)
Retrieves the URL of an existing queue by name. You need the URL for every SQS operation — treat it like a queue's address.
# Get the URL of an existing queue
response = sqs.get_queue_url(QueueName='pipeline-events')
queue_url = response['QueueUrl']
print(f"Queue URL: {queue_url}")
# For cross-account queue access, pass the owner account ID
response = sqs.get_queue_url(
QueueName='shared-pipeline-events',
QueueOwnerAWSAccountId='987654321012'
)
Permanently deletes a queue and all its messages. Irreversible — use with caution in production.
sqs.delete_queue(QueueUrl=queue_url)
print("Queue deleted")
# Note: After deletion, the queue name cannot be reused for 60 seconds
Sends a single message to the queue. The message body can be any string — typically a JSON payload. You can optionally set a delay and attach message attributes (metadata).
import json
# Basic send — JSON payload is the standard pattern
event = {
"pipeline_name": "customer_load",
"s3_path": "s3://my-bucket/raw/customers/2024-01-15.csv",
"triggered_by": "s3_event"
}
response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps(event),
DelaySeconds=0, # 0 = available immediately (default)
MessageAttributes={
'source': {
'DataType': 'String',
'StringValue': 's3-event-trigger'
},
'priority': {
'DataType': 'Number',
'StringValue': '1'
}
}
)
print(f"Message ID: {response['MessageId']}")
Sends up to 10 messages in one API call. More efficient than calling send_message() in a loop. Each entry needs a unique Id within the batch.
# Send 3 pipeline events in one batch call
files = [
"s3://bucket/raw/orders/part-001.parquet",
"s3://bucket/raw/orders/part-002.parquet",
"s3://bucket/raw/orders/part-003.parquet",
]
entries = [
{
'Id': str(i), # Unique ID within this batch
'MessageBody': json.dumps({'file': f}),
'DelaySeconds': 0
}
for i, f in enumerate(files)
]
response = sqs.send_message_batch(
QueueUrl=queue_url,
Entries=entries
)
# Always check for failures in batch operations
if response.get('Failed'):
for failure in response['Failed']:
print(f"Failed: {failure['Id']} — {failure['Message']}")
print(f"Sent: {len(response['Successful'])} messages")
response['Failed'] in batch calls. Unlike single send_message(), batch calls don't raise exceptions for individual failures — they just report them in the response.receive_message() fetches up to 10 messages at a time from the queue. After you receive a message, it becomes invisible to other consumers for the VisibilityTimeout duration — giving you time to process and delete it.
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10, # max allowed is 10
WaitTimeSeconds=20, # ALWAYS use 20 — enables long polling
VisibilityTimeout=120, # hide msg for 120s while you process
MessageAttributeNames=['All'] # receive custom attributes too
)
messages = response.get('Messages', []) # empty list if queue is empty
print(f"Received {len(messages)} messages")
for msg in messages:
body = json.loads(msg['Body'])
receipt_handle = msg['ReceiptHandle'] # needed to delete the message
message_id = msg['MessageId']
print(f"Processing: {body}")
print(f"Receipt: {receipt_handle[:40]}...")
Short polling (WaitTimeSeconds=0) returns immediately even if the queue is empty — wasteful and expensive. Long polling (WaitTimeSeconds=20) waits up to 20 seconds for a message to arrive — saves cost and reduces empty responses.
# ❌ Short polling — costs money for empty polls
response = sqs.receive_message(QueueUrl=queue_url, WaitTimeSeconds=0)
# ✅ Long polling — waits up to 20s, returns as soon as message arrives
response = sqs.receive_message(QueueUrl=queue_url, WaitTimeSeconds=20)
WaitTimeSeconds=20. It's the maximum and almost always the right choice. Long polling reduces the number of empty ReceiveMessage responses and lowers SQS costs significantly.After a message is received, it's hidden from other consumers for VisibilityTimeout seconds. If you don't delete it within that window, it becomes visible again and another consumer can pick it up. Set it to at least 3x your expected processing time.
# If processing takes longer than expected, extend the timeout
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=300 # extend by 5 more minutes
)
print("Extended visibility timeout")
After successfully processing a message, you must delete it using its ReceiptHandle. If you don't delete it, it reappears in the queue after the visibility timeout — causing duplicate processing.
# Delete a single message after processing
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle # from receive_message() response
)
print("Message deleted — processing confirmed")
Deletes up to 10 messages in a single API call. Use this when you receive a batch of messages and process them all together.
# Delete all successfully processed messages in one call
entries = [
{'Id': str(i), 'ReceiptHandle': msg['ReceiptHandle']}
for i, msg in enumerate(processed_messages)
]
response = sqs.delete_message_batch(
QueueUrl=queue_url,
Entries=entries
)
if response.get('Failed'):
print(f"Some deletes failed: {response['Failed']}")
Returns queue metadata including approximate message counts, configuration, and the DLQ redrive policy. Vital for monitoring queue depth in pipeline observability.
response = sqs.get_queue_attributes(
QueueUrl=queue_url,
AttributeNames=['All']
)
attrs = response['Attributes']
# Key metrics every DE monitors
visible_msgs = int(attrs['ApproximateNumberOfMessages'])
invisible_msgs = int(attrs['ApproximateNumberOfMessagesNotVisible'])
delayed_msgs = int(attrs['ApproximateNumberOfMessagesDelayed'])
redrive_policy = attrs.get('RedrivePolicy', 'No DLQ configured')
print(f"Visible (waiting): {visible_msgs}")
print(f"Invisible (in-flight): {invisible_msgs}")
print(f"Delayed: {delayed_msgs}")
print(f"DLQ policy: {redrive_policy}")
# Alert if queue is building up
if visible_msgs > 1000:
print("⚠️ Queue depth alarm — pipeline may be falling behind")
A DLQ is a separate queue where messages are automatically moved after failing processing maxReceiveCount times. It's your safety net — failed messages don't vanish, they accumulate in the DLQ for inspection and replay.
# Step 1 — Create the DLQ first
dlq_response = sqs.create_queue(QueueName='pipeline-events-dlq')
dlq_url = dlq_response['QueueUrl']
# Get DLQ ARN (needed for redrive policy)
dlq_attrs = sqs.get_queue_attributes(
QueueUrl=dlq_url,
AttributeNames=['QueueArn']
)
dlq_arn = dlq_attrs['Attributes']['QueueArn']
# Step 2 — Create main queue with DLQ redrive policy
main_queue = sqs.create_queue(
QueueName='pipeline-events',
Attributes={
'RedrivePolicy': json.dumps({
'deadLetterTargetArn': dlq_arn,
'maxReceiveCount': '3' # after 3 failed attempts → DLQ
}),
'VisibilityTimeout': '60'
}
)
# Step 3 — Monitor DLQ depth with CloudWatch alarm
# DLQ filling up = alert on-call immediately
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='pipeline-dlq-depth',
MetricName='ApproximateNumberOfMessagesVisible',
Namespace='AWS/SQS',
Dimensions=[{'Name': 'QueueName', 'Value': 'pipeline-events-dlq'}],
Threshold=1,
ComparisonOperator='GreaterThanOrEqualToThreshold',
EvaluationPeriods=1,
Period=60,
Statistic='Sum',
AlarmActions=['arn:aws:sns:us-east-1:123456789:on-call-alerts']
)
This is the most important SQS pattern for data engineers. The rule: receive → process → delete on success → let visibility timeout expire on failure (message returns to queue → eventually hits DLQ).
import boto3, json, time, logging
from botocore.exceptions import ClientError
logger = logging.getLogger(__name__)
sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456789/pipeline-events"
def process_message(body: dict) -> None:
"""Your actual pipeline logic goes here."""
pipeline_name = body['pipeline_name']
s3_path = body['s3_path']
logger.info(f"Starting pipeline: {pipeline_name} for {s3_path}")
# e.g. trigger Glue job, run Spark transform, etc.
glue = boto3.client('glue')
glue.start_job_run(
JobName=pipeline_name,
Arguments={'--input_path': s3_path}
)
def run_consumer_loop():
print("Consumer loop started...")
while True:
try:
# Step 1 — Receive (long polling, up to 10 at once)
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # long poll
VisibilityTimeout=120 # 2 min to process
)
messages = response.get('Messages', [])
if not messages:
print("No messages, waiting...")
continue
for msg in messages:
receipt = msg['ReceiptHandle']
body = json.loads(msg['Body'])
try:
# Step 2 — Process
process_message(body)
# Step 3 — Delete ONLY on success
sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=receipt)
logger.info(f"✅ Processed and deleted: {msg['MessageId']}")
except Exception as e:
# Step 4 — On failure: do NOT delete
# Message becomes visible again after VisibilityTimeout
# After maxReceiveCount attempts → goes to DLQ
logger.error(f"❌ Processing failed: {e} — message will retry")
except ClientError as e:
logger.error(f"SQS error: {e.response['Error']['Code']}")
time.sleep(5)
if __name__ == "__main__":
run_consumer_loop()
If your processing is taking longer than expected, extend the visibility timeout before it expires. Otherwise the message reappears and gets double-processed.
import threading
def keep_alive_heartbeat(queue_url, receipt_handle, interval=90):
"""Extend visibility every 90s for long-running jobs."""
while True:
time.sleep(interval)
try:
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=120 # reset the clock
)
logger.info("Heartbeat: visibility extended")
except Exception:
break # message already deleted or expired
# Start heartbeat in background thread during processing
t = threading.Thread(target=keep_alive_heartbeat, args=(QUEUE_URL, receipt), daemon=True)
t.start()
| API Call | What It Does | Key Param |
|---|---|---|
create_queue() | Create new queue (Standard or FIFO) | QueueName, Attributes |
get_queue_url() | Get URL of existing queue by name | QueueName |
delete_queue() | Permanently delete queue + messages | QueueUrl |
send_message() | Send one message | MessageBody, DelaySeconds |
send_message_batch() | Send up to 10 messages at once | Entries (list, max 10) |
receive_message() | Fetch up to 10 messages | WaitTimeSeconds=20, VisibilityTimeout |
delete_message() | Confirm processing — remove from queue | ReceiptHandle |
delete_message_batch() | Delete up to 10 messages at once | Entries with ReceiptHandles |
change_message_visibility() | Extend or reset visibility timeout | VisibilityTimeout |
get_queue_attributes() | Queue depth, DLQ policy, config | AttributeNames=['All'] |
SNS APIs
Amazon SNS (Simple Notification Service) is a fully managed pub/sub messaging service. One topic can fan-out a single message to many subscribers — email, SQS queues, Lambda functions, or HTTPS endpoints. For data engineers, SNS is the backbone of pipeline alerting (failure notifications, SLA breaches, CloudWatch alarms) and fan-out architectures where one event needs to trigger multiple independent consumers.
Creates a new SNS topic. A topic is a logical channel — publishers send messages to it, and SNS delivers a copy to every subscriber. Returns a TopicArn which you'll use for every other operation.
create_topic() launches the station. Anyone can tune in (subscribe), and when the station broadcasts (publish), every listener hears it at the same time — regardless of how many listeners there are.import boto3
sns = boto3.client('sns', region_name='us-east-1')
# Create a Standard topic (most common for DE)
response = sns.create_topic(
Name='pipeline-alerts',
Attributes={
'DisplayName': 'Pipeline Alerts' # shown in email subject prefix
}
)
topic_arn = response['TopicArn']
print(topic_arn)
# arn:aws:sns:us-east-1:123456789012:pipeline-alerts
# Create a FIFO topic — ordered, exactly-once delivery to SQS FIFO subscribers
fifo_response = sns.create_topic(
Name='pipeline-events.fifo', # FIFO topics MUST end in .fifo
Attributes={
'FifoTopic': 'true',
'ContentBasedDeduplication': 'true'
}
)
create_topic() is idempotent — calling it again with the same Name just returns the existing topic's ARN instead of erroring. Safe to call on every pipeline startup.Permanently deletes a topic and all its subscriptions. Any messages already delivered to subscribers (e.g. sitting in an SQS queue) are not removed — only the topic and the subscription links are deleted.
sns.delete_topic(TopicArn=topic_arn)
print("Topic deleted")
# Note: subscriptions attached to this topic are also removed,
# but messages already in downstream SQS queues remain there.
Lists all SNS topics in the account/region. Like most list operations, results are paginated — always use a paginator instead of manually following NextToken.
paginator = sns.get_paginator('list_topics')
for page in paginator.paginate():
for topic in page['Topics']:
print(topic['TopicArn'])
# Filter for pipeline-related topics only
pipeline_topics = [
t['TopicArn'] for page in paginator.paginate()
for t in page['Topics']
if 'pipeline' in t['TopicArn']
]
Attaches a subscriber to a topic. The Protocol tells SNS how to deliver the message, and Endpoint tells it where. Common protocols for data pipelines: email, sqs, lambda, https, sms.
subscribe() is like giving the radio station your address and saying "deliver every broadcast here, by mail / by SQS queue / by phone call." The station doesn't care how many addresses are on file — it broadcasts once, and the postal system (SNS) delivers to all of them.# 1. Email subscription — for on-call alerts
email_sub = sns.subscribe(
TopicArn=topic_arn,
Protocol='email',
Endpoint='data-oncall@company.com'
)
# Subscriber must confirm via a link sent to their inbox before
# messages start flowing — SNS marks this as PendingConfirmation
# 2. SQS subscription — for pipeline-to-pipeline fan-out
sqs_sub = sns.subscribe(
TopicArn=topic_arn,
Protocol='sqs',
Endpoint='arn:aws:sqs:us-east-1:123456789012:dq-failure-queue',
Attributes={
'RawMessageDelivery': 'true' # delivers raw JSON, not wrapped in SNS envelope
}
)
# 3. Lambda subscription — for routing to Slack via a Lambda function
lambda_sub = sns.subscribe(
TopicArn=topic_arn,
Protocol='lambda',
Endpoint='arn:aws:lambda:us-east-1:123456789012:function:slack-notifier'
)
# 4. HTTPS subscription — for webhooks (e.g. PagerDuty, custom API)
https_sub = sns.subscribe(
TopicArn=topic_arn,
Protocol='https',
Endpoint='https://events.pagerduty.com/integration/abc123/enqueue',
ReturnSubscriptionArn=True
)
email and https subscriptions sit in PendingConfirmation state until the endpoint confirms (clicking a link, or responding to an SNS handshake POST). sqs and lambda subscriptions where SNS has the right IAM permission are auto-confirmed.Removes a subscription using its SubscriptionArn (returned by subscribe() or found via list_subscriptions_by_topic()).
sns.unsubscribe(
SubscriptionArn=email_sub['SubscriptionArn']
)
print("Unsubscribed")
# Note: if a subscription is still PendingConfirmation,
# its SubscriptionArn will literally be the string 'PendingConfirmation'
# and cannot be unsubscribed via API — it must be left to expire (3 days)
Lists every subscriber attached to a specific topic — useful for auditing who/what receives pipeline alerts.
paginator = sns.get_paginator('list_subscriptions_by_topic')
for page in paginator.paginate(TopicArn=topic_arn):
for sub in page['Subscriptions']:
print(f"{sub['Protocol']:6} -> {sub['Endpoint']} ({sub['SubscriptionArn']})")
# email -> data-oncall@company.com (PendingConfirmation)
# sqs -> arn:aws:sqs:...:dq-failure-queue (arn:aws:sns:...:8f2e...)
# lambda -> arn:aws:lambda:...:slack-notifier (arn:aws:sns:...:a1b9...)
By default, every subscriber gets every message published to a topic. A filter policy lets a subscriber say "only send me messages where MessageAttributes match this pattern" — turning one topic into many logical channels.
import json
# Only deliver to this subscriber when severity = ERROR or CRITICAL
# AND the pipeline is one of the "tier-1" pipelines
filter_policy = {
"severity": ["ERROR", "CRITICAL"],
"pipeline_tier": ["tier1"]
}
sns.set_subscription_attributes(
SubscriptionArn=sqs_sub['SubscriptionArn'],
AttributeName='FilterPolicy',
AttributeValue=json.dumps(filter_policy)
)
# Optionally scope the filter policy to the message body itself
# instead of MessageAttributes (FilterPolicyScope='MessageBody')
sns.set_subscription_attributes(
SubscriptionArn=sqs_sub['SubscriptionArn'],
AttributeName='FilterPolicyScope',
AttributeValue='MessageBody'
)
Sends a single message to every subscriber of a topic (subject to filter policies). Subject becomes the email subject line for email subscribers and is ignored by other protocols. MessageAttributes is metadata used by filter policies and is delivered separately from the message body.
import json
# Basic alert publish — the classic "pipeline failed" notification
message_body = {
"pipeline_name": "customer_load",
"run_id": "run_20260615_0300",
"status": "FAILED",
"error": "Schema mismatch on column 'email'"
}
response = sns.publish(
TopicArn=topic_arn,
Subject="❌ Pipeline Failed: customer_load", # only used by email subscribers
Message=json.dumps(message_body),
MessageAttributes={
'severity': {
'DataType': 'String',
'StringValue': 'ERROR'
},
'pipeline_tier': {
'DataType': 'String',
'StringValue': 'tier1'
}
}
)
print(response['MessageId'])
MessageAttributes are not part of Message — they're separate key/value metadata sent alongside it. Filter policies match against MessageAttributes (or the message body, if FilterPolicyScope='MessageBody'), not against the raw text of Message.Publishes up to 10 messages in a single API call — much more efficient than calling publish() in a loop when a pipeline run produces multiple events (e.g. one alert per failed table in a multi-table load).
failed_tables = ["orders", "customers", "inventory"]
entries = []
for i, table in enumerate(failed_tables):
entries.append({
'Id': str(i), # unique within this batch only
'Message': json.dumps({
"table": table,
"status": "FAILED",
"run_id": "run_20260615_0300"
}),
'Subject': f"Table load failed: {table}",
'MessageAttributes': {
'severity': {'DataType': 'String', 'StringValue': 'ERROR'}
}
})
response = sns.publish_batch(
TopicArn=topic_arn,
PublishBatchRequestEntries=entries
)
# Always check for partial failures
for failed in response.get('Failed', []):
print(f"Entry {failed['Id']} failed: {failed['Code']} - {failed['Message']}")
publish_batch() can partially succeed — some entries in Successful, others in Failed. Always check both lists; a non-empty response doesn't mean every message was published.One published event needs to trigger multiple independent consumers — e.g. a "new file arrived" event should (1) trigger an ingestion pipeline, (2) update a data catalog, and (3) log to an audit system. Publish once to SNS; attach an SQS queue per consumer.
# Setup once: one topic, three SQS subscribers
topic_arn = sns.create_topic(Name='s3-file-arrivals')['TopicArn']
for queue_arn in [
'arn:aws:sqs:us-east-1:123456789012:ingestion-queue',
'arn:aws:sqs:us-east-1:123456789012:catalog-update-queue',
'arn:aws:sqs:us-east-1:123456789012:audit-log-queue',
]:
sns.subscribe(
TopicArn=topic_arn,
Protocol='sqs',
Endpoint=queue_arn,
Attributes={'RawMessageDelivery': 'true'}
)
# Each time a file lands, publish ONE event — all 3 queues get a copy
sns.publish(
TopicArn=topic_arn,
Message=json.dumps({
"bucket": "my-data-lake",
"key": "raw/customers/2026-06-15.csv",
"event_time": "2026-06-15T03:00:00Z"
})
)
The most common SNS pattern in data engineering: when a Glue job, EMR step, or Airflow task fails, publish to an alert topic. Email subscribers get a notification directly; a Lambda subscriber can reformat the message and post it to Slack.
# Inside a Lambda subscriber — reformats SNS message for Slack
import json
import urllib.request
SLACK_WEBHOOK = "https://hooks.slack.com/services/T000/B000/XXXX"
def lambda_handler(event, context):
for record in event['Records']:
sns_message = json.loads(record['Sns']['Message'])
slack_payload = {
"text": (
f"🚨 *Pipeline Failure*\\n"
f"Pipeline: `{sns_message['pipeline_name']}`\\n"
f"Run ID: `{sns_message['run_id']}`\\n"
f"Error: {sns_message['error']}`"
)
}
req = urllib.request.Request(
SLACK_WEBHOOK,
data=json.dumps(slack_payload).encode(),
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req)
CloudWatch Alarms publish directly to an SNS topic when a metric breaches a threshold (e.g. EMR step failure count, Glue job duration, DLQ depth). You don't call publish() yourself here — CloudWatch does it for you once the alarm is wired to the topic ARN.
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='glue-job-duration-exceeded',
Namespace='AWS/Glue',
MetricName='glue.driver.aggregate.elapsedTime',
Statistic='Maximum',
Period=300,
EvaluationPeriods=1,
Threshold=3600000, # 1 hour in ms
ComparisonOperator='GreaterThanThreshold',
AlarmActions=[topic_arn], # <-- SNS topic gets notified on breach
Dimensions=[{'Name': 'JobName', 'Value': 'customer_load_etl'}]
)
# Now: alarm fires -> SNS publishes -> every subscriber (email/Slack/PagerDuty) notified
| API Call | What It Does | Key Param |
|---|---|---|
create_topic() | Create a topic (Standard or FIFO) | Name (idempotent) |
delete_topic() | Delete topic + all subscriptions | TopicArn |
list_topics() | List all topics (paginated) | — |
subscribe() | Attach a subscriber to a topic | Protocol, Endpoint |
unsubscribe() | Remove a subscriber | SubscriptionArn |
list_subscriptions_by_topic() | List subscribers on a topic (paginated) | TopicArn |
set_subscription_attributes() | Set filter policy / raw delivery | AttributeName='FilterPolicy' |
publish() | Send one message to all subscribers | Message, MessageAttributes |
publish_batch() | Send up to 10 messages at once | PublishBatchRequestEntries |
RawMessageDelivery matters.DynamoDB APIs
Amazon DynamoDB is a fully managed, serverless key-value and document database. For Data Engineers, DynamoDB is not a primary data warehouse — it's the metadata backbone of your pipeline: job audit tables, control tables, watermark tracking, run history, and config storage. It's millisecond-latency at any scale, requires zero schema management, and is the go-to choice for storing pipeline state that your pipeline code reads and writes programmatically via boto3.
Creates a DynamoDB table. You must define the KeySchema (partition key + optional sort key) and their types in AttributeDefinitions. Every other attribute is schema-free — you just put any JSON you want per item. The two billing modes are PAY_PER_REQUEST (on-demand, great for pipelines with variable traffic) and PROVISIONED (fixed read/write capacity units).
import boto3
dynamodb = boto3.client('dynamodb', region_name='us-east-1')
# Pipeline audit table — partition key = pipeline_id, sort key = run_id
dynamodb.create_table(
TableName='pipeline_audit',
KeySchema=[
{'AttributeName': 'pipeline_id', 'KeyType': 'HASH'}, # partition key
{'AttributeName': 'run_id', 'KeyType': 'RANGE'} # sort key
],
AttributeDefinitions=[
{'AttributeName': 'pipeline_id', 'AttributeType': 'S'}, # S=String
{'AttributeName': 'run_id', 'AttributeType': 'S'}
],
BillingMode='PAY_PER_REQUEST' # no capacity planning needed
)
# Watermark table — just a partition key, no sort key
dynamodb.create_table(
TableName='pipeline_watermark',
KeySchema=[
{'AttributeName': 'pipeline_id', 'KeyType': 'HASH'}
],
AttributeDefinitions=[
{'AttributeName': 'pipeline_id', 'AttributeType': 'S'}
],
BillingMode='PAY_PER_REQUEST'
)
pipeline_id, run_id) appear in AttributeDefinitions. Every other field you'll write — status, row_count, error_message — is not declared anywhere. DynamoDB is schema-less beyond the key.Returns the full table description including its status (CREATING, ACTIVE, DELETING), key schema, billing mode, item count, and size. Use this to wait for a table to become ACTIVE after creation before writing to it.
import time
# Wait for table to be ACTIVE before writing
def wait_for_table_active(table_name, max_wait=60):
for _ in range(max_wait // 2):
resp = dynamodb.describe_table(TableName=table_name)
status = resp['Table']['TableStatus']
if status == 'ACTIVE':
print(f"{table_name} is ACTIVE")
return
print(f"Status: {status} — waiting...")
time.sleep(2)
raise TimeoutError("Table did not become ACTIVE in time")
wait_for_table_active('pipeline_audit')
# Also useful for reading current item count (eventually consistent)
info = dynamodb.describe_table(TableName='pipeline_audit')['Table']
print(f"Items: {info['ItemCount']}, Size: {info['TableSizeBytes']} bytes")
delete_table() permanently destroys the table and all its items. list_tables() returns all tables in the region — note it paginates via LastEvaluatedTableName, not the standard NextToken, so use the paginator.
# Delete a table
dynamodb.delete_table(TableName='pipeline_watermark')
# List all tables — always use paginator
paginator = dynamodb.get_paginator('list_tables')
for page in paginator.paginate():
for table_name in page['TableNames']:
print(table_name)
put_item() writes a complete item. If an item with the same key already exists, it is completely replaced. All attribute values must be typed using DynamoDB's type notation: S (string), N (number — always a string!), BOOL, NULL, L (list), M (map).
put_item() is like placing a whole new document into the filing cabinet. If a document with that label already exists, it gets shredded and replaced — not merged. If you want to update only specific fields, use update_item() instead.
from datetime import datetime, timezone
# Write a job audit record — every pipeline run writes one of these
dynamodb.put_item(
TableName='pipeline_audit',
Item={
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': 'run-2024-01-15-001'},
'status': {'S': 'SUCCEEDED'},
'start_time': {'S': '2024-01-15T06:00:00Z'},
'end_time': {'S': '2024-01-15T06:45:23Z'},
'rows_read': {'N': '1482930'}, # N is ALWAYS a string
'rows_written': {'N': '1482930'},
'rows_rejected':{'N': '0'},
'dq_score': {'N': '99.8'},
'error_message':{'NULL': True} # NULL type for no error
}
)
# Prevent overwrite — only write if this run_id does NOT already exist
try:
dynamodb.put_item(
TableName='pipeline_audit',
Item={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-001'}},
ConditionExpression='attribute_not_exists(run_id)'
)
except dynamodb.exceptions.ConditionalCheckFailedException:
print("Run ID already exists — skipping (idempotent write)")
N type requires the value to be a string — e.g. {'N': '1482930'} not {'N': 1482930}. This is a very common bug. When reading back, the number is also returned as a string and you must cast it: int(item['rows_read']['N']).get_item() fetches a single item by its exact partition key (and sort key if the table has one). This is a point read — DynamoDB routes directly to the partition that holds this key and returns it in single-digit milliseconds. There is no table scan involved.
# Read a specific pipeline run record
response = dynamodb.get_item(
TableName='pipeline_audit',
Key={
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': 'run-2024-01-15-001'}
}
)
# 'Item' key is absent if the item doesn't exist — always check!
if 'Item' in response:
item = response['Item']
print(f"Status : {item['status']['S']}")
print(f"Rows : {int(item['rows_written']['N'])}")
print(f"DQ : {float(item['dq_score']['N'])}")
else:
print("Item not found")
# Use ProjectionExpression to fetch only specific attributes (saves RCUs)
response = dynamodb.get_item(
TableName='pipeline_audit',
Key={
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': 'run-2024-01-15-001'}
},
ProjectionExpression='#s, rows_written', # only these fields
ExpressionAttributeNames={'#s': 'status'} # 'status' is a reserved word
)
status, name, size) that conflict with expression syntax. Always use ExpressionAttributeNames with a # placeholder when your attribute name matches a reserved word.update_item() modifies specific attributes of an existing item without replacing the whole item. This is the right choice when your pipeline wants to update just the status and end_time without losing other fields already written. Uses UpdateExpression with clauses: SET (add/update), REMOVE (delete attribute), ADD (increment numbers, add to sets), DELETE (remove from sets).
# Pattern: pipeline starts → write RUNNING status
# pipeline ends → update to SUCCEEDED/FAILED + end_time + counts
# Step 1: mark job as RUNNING at startup
dynamodb.update_item(
TableName='pipeline_audit',
Key={
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': 'run-2024-01-15-002'}
},
UpdateExpression='SET #s = :s, start_time = :t',
ExpressionAttributeNames={'#s': 'status'},
ExpressionAttributeValues={
':s': {'S': 'RUNNING'},
':t': {'S': datetime.now(timezone.utc).isoformat()}
}
)
# Step 2: mark job as SUCCEEDED at completion
dynamodb.update_item(
TableName='pipeline_audit',
Key={
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': 'run-2024-01-15-002'}
},
UpdateExpression='SET #s = :s, end_time = :e, rows_written = :r, dq_score = :d',
ExpressionAttributeNames={'#s': 'status'},
ExpressionAttributeValues={
':s': {'S': 'SUCCEEDED'},
':e': {'S': datetime.now(timezone.utc).isoformat()},
':r': {'N': '2341000'},
':d': {'N': '99.9'}
}
)
# ADD clause — atomically increment a counter (no race condition)
dynamodb.update_item(
TableName='pipeline_audit',
Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
UpdateExpression='ADD retry_count :one',
ExpressionAttributeValues={':one': {'N': '1'}}
)
# REMOVE clause — delete an attribute from the item
dynamodb.update_item(
TableName='pipeline_audit',
Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
UpdateExpression='REMOVE temp_debug_payload' # delete a field when done debugging
)
# ConditionExpression — optimistic locking: only update if still RUNNING
try:
dynamodb.update_item(
TableName='pipeline_audit',
Key={'pipeline_id': {'S': 'sales-etl-daily'}, 'run_id': {'S': 'run-002'}},
UpdateExpression='SET #s = :failed',
ConditionExpression='#s = :running', # only if currently RUNNING
ExpressionAttributeNames={'#s': 'status'},
ExpressionAttributeValues={':failed': {'S': 'FAILED'}, ':running': {'S': 'RUNNING'}}
)
except dynamodb.exceptions.ConditionalCheckFailedException:
print("Item was already updated by another process")
Deletes a single item by its key. Optionally use ConditionExpression to only delete if a condition is met — for example, only delete a watermark record if the pipeline is marked as inactive.
# Simple delete
dynamodb.delete_item(
TableName='pipeline_watermark',
Key={'pipeline_id': {'S': 'deprecated-pipeline'}}
)
# Conditional delete — only if the pipeline is marked inactive
try:
dynamodb.delete_item(
TableName='pipeline_watermark',
Key={'pipeline_id': {'S': 'sales-etl-daily'}},
ConditionExpression='is_active = :false',
ExpressionAttributeValues={':false': {'BOOL': False}}
)
except dynamodb.exceptions.ConditionalCheckFailedException:
print("Pipeline is still active — not deleting watermark")
batch_write_item() lets you write or delete up to 25 items per call across one or more tables in a single round trip. This is the standard pattern for writing pipeline audit records in bulk — e.g. writing one record per Spark partition processed. Important: batch writes do not support ConditionExpression, and any unprocessed items must be retried manually.
batch_write_item() is like mailing a bundle of 25 letters at once instead of one at a time. The post office (DynamoDB) delivers all 25 in one trip but may say "I couldn't deliver 3 of these today" (UnprocessedItems). You're responsible for resending those 3.
import time
def batch_write_audit_records(records):
"""Write a list of audit record dicts to DynamoDB in batches of 25."""
chunk_size = 25
for i in range(0, len(records), chunk_size):
chunk = records[i:i + chunk_size]
request_items = {
'pipeline_audit': [
{'PutRequest': {'Item': record}}
for record in chunk
]
}
# Retry unprocessed items with exponential backoff
delay = 0.5
for attempt in range(5):
response = dynamodb.batch_write_item(RequestItems=request_items)
unprocessed = response.get('UnprocessedItems', {})
if not unprocessed:
break
print(f"Retrying {len(unprocessed['pipeline_audit'])} unprocessed items...")
request_items = unprocessed
time.sleep(delay)
delay *= 2
# Example: write audit records for 3 Spark partitions
audit_records = [
{
'pipeline_id': {'S': 'sales-etl-daily'},
'run_id': {'S': f'run-2024-01-15-part-{p}'},
'status': {'S': 'SUCCEEDED'},
'rows_written': {'N': str(p * 50000)}
}
for p in range(3)
]
batch_write_audit_records(audit_records)
UnprocessedItems due to throttling or hot partitions — even if your table is on-demand mode. Always check this field and retry with backoff. Ignoring it silently drops writes.Fetches up to 100 items by their keys in a single call. Useful for looking up pipeline configs, watermarks, or status records for a set of pipeline IDs at once. Like batch_write_item(), it can return UnprocessedKeys that must be retried.
# Fetch watermark records for multiple pipelines at once
pipeline_ids = ['sales-etl-daily', 'inventory-etl', 'returns-etl']
response = dynamodb.batch_get_item(
RequestItems={
'pipeline_watermark': {
'Keys': [
{'pipeline_id': {'S': pid}}
for pid in pipeline_ids
],
'ProjectionExpression': 'pipeline_id, last_watermark_value'
}
}
)
for item in response['Responses']['pipeline_watermark']:
pid = item['pipeline_id']['S']
wm = item['last_watermark_value']['S']
print(f"{pid} → last watermark: {wm}")
# Always handle UnprocessedKeys
if response.get('UnprocessedKeys'):
print("Warning: some keys were not processed — retry!")
query() fetches all items that share the same partition key, with optional filtering on the sort key. This is efficient — DynamoDB goes directly to that partition and returns results. Use this to get all runs for a specific pipeline, filtered by date range on the run_id sort key.
query() is like walking to a specific drawer in the filing cabinet (partition key = sales-etl-daily) and reading all the folders inside it, optionally filtered by date. You go directly to the right drawer — no rummaging through the whole cabinet.
from boto3.dynamodb.conditions import Key, Attr
# Get all runs for sales-etl-daily in January 2024
paginator = dynamodb.get_paginator('query')
pages = paginator.paginate(
TableName='pipeline_audit',
KeyConditionExpression='pipeline_id = :pid AND begins_with(run_id, :prefix)',
ExpressionAttributeValues={
':pid': {'S': 'sales-etl-daily'},
':prefix': {'S': 'run-2024-01'} # sort key prefix filter
},
ScanIndexForward=False # descending order = newest run first
)
all_runs = []
for page in pages:
all_runs.extend(page['Items'])
print(f"Found {len(all_runs)} runs in January 2024")
# FilterExpression — post-query filter (applied AFTER fetching from partition)
# Use this for non-key attributes like status, but note it does NOT reduce RCU cost
pages = paginator.paginate(
TableName='pipeline_audit',
KeyConditionExpression='pipeline_id = :pid',
FilterExpression='#s = :failed', # filter to only FAILED runs
ExpressionAttributeNames={'#s': 'status'},
ExpressionAttributeValues={
':pid': {'S': 'sales-etl-daily'},
':failed': {'S': 'FAILED'}
}
)
failed_runs = [item for page in pages for item in page['Items']]
print(f"Failed runs: {len(failed_runs)}")
KeyConditionExpression operates on keys and is evaluated by DynamoDB before fetching — it reduces how much data is read. FilterExpression is applied after fetching, so it doesn't reduce read cost (RCUs) — only the result set size returned to your code.scan() reads every item in the table. It is expensive (every item costs a read), slow for large tables, and should be avoided in production hot paths. Acceptable use cases: one-time admin scripts, small config tables (<100 items), or tables where you genuinely need all items.
scan() on a table with 1 million items reads all 1 million items and charges you for every single one. On a pipeline audit table that grows daily, this gets expensive fast. Always design your access patterns around query() using partition + sort key.# Scan is OK for small config/control tables — always use paginator
paginator = dynamodb.get_paginator('scan')
pages = paginator.paginate(
TableName='pipeline_config', # small table: ~20 pipeline configs
FilterExpression='is_active = :true',
ExpressionAttributeValues={':true': {'BOOL': True}}
)
active_pipelines = [item for page in pages for item in page['Items']]
print(f"Active pipelines: {[p['pipeline_id']['S'] for p in active_pipelines]}")
# Parallel scan — for large tables that absolutely must be scanned
# Split into N segments and scan each in a separate thread
TOTAL_SEGMENTS = 4
import concurrent.futures
def scan_segment(segment):
items = []
paginator = dynamodb.get_paginator('scan')
for page in paginator.paginate(
TableName='pipeline_audit',
Segment=segment,
TotalSegments=TOTAL_SEGMENTS
):
items.extend(page['Items'])
return items
with concurrent.futures.ThreadPoolExecutor(max_workers=TOTAL_SEGMENTS) as ex:
results = list(ex.map(scan_segment, range(TOTAL_SEGMENTS)))
all_items = [item for segment in results for item in segment]
The most important DynamoDB use case for Data Engineers is storing pipeline run state: one record per job run with start time, end time, status, row counts, DQ score, and error messages. This is your audit trail, your SLA monitoring source, and your retry/recovery control plane — all in one place.
import boto3, uuid
from datetime import datetime, timezone
class PipelineAudit:
"""Write pipeline run records to DynamoDB. Used at start and end of every job."""
TABLE = 'pipeline_audit'
def __init__(self):
self.ddb = boto3.client('dynamodb', region_name='us-east-1')
def start_run(self, pipeline_id: str) -> str:
"""Call at job start. Returns the run_id to pass to end_run()."""
run_id = f"run-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}-{str(uuid.uuid4())[:8]}"
self.ddb.put_item(
TableName=self.TABLE,
Item={
'pipeline_id': {'S': pipeline_id},
'run_id': {'S': run_id},
'status': {'S': 'RUNNING'},
'start_time': {'S': datetime.now(timezone.utc).isoformat()},
'end_time': {'NULL': True},
'rows_read': {'N': '0'},
'rows_written': {'N': '0'},
'rows_rejected':{'N': '0'},
'dq_score': {'N': '0'},
'error_message':{'NULL': True}
}
)
return run_id
def end_run(self, pipeline_id: str, run_id: str, status: str,
rows_read=0, rows_written=0, rows_rejected=0,
dq_score=100.0, error_message=None):
"""Call at job end with final counts and status."""
item_updates = {
'status': {'S': status},
'end_time': {'S': datetime.now(timezone.utc).isoformat()},
'rows_read': {'N': str(rows_read)},
'rows_written': {'N': str(rows_written)},
'rows_rejected': {'N': str(rows_rejected)},
'dq_score': {'N': str(dq_score)},
'error_message': {'S': error_message} if error_message else {'NULL': True}
}
update_expr = 'SET ' + ', '.join(
f"#attr_{i} = :val_{i}" for i in range(len(item_updates))
)
attr_names = {f"#attr_{i}": k for i, k in enumerate(item_updates)}
attr_values = {f":val_{i}": v for i, v in enumerate(item_updates.values())}
self.ddb.update_item(
TableName=self.TABLE,
Key={'pipeline_id': {'S': pipeline_id}, 'run_id': {'S': run_id}},
UpdateExpression=update_expr,
ExpressionAttributeNames=attr_names,
ExpressionAttributeValues=attr_values
)
def get_last_run(self, pipeline_id: str) -> dict:
"""Get the most recent run record for a pipeline."""
response = self.ddb.query(
TableName=self.TABLE,
KeyConditionExpression='pipeline_id = :pid',
ExpressionAttributeValues={':pid': {'S': pipeline_id}},
ScanIndexForward=False, # newest first
Limit=1
)
items = response.get('Items', [])
return items[0] if items else None
# ── Usage in a Glue / EMR PySpark job ──────────────────────────
audit = PipelineAudit()
run_id = audit.start_run("sales-etl-daily")
try:
# ... your Spark transform logic here ...
rows_read, rows_written = 1482930, 1482930
audit.end_run("sales-etl-daily", run_id, "SUCCEEDED",
rows_read=rows_read, rows_written=rows_written, dq_score=99.8)
except Exception as e:
audit.end_run("sales-etl-daily", run_id, "FAILED", error_message=str(e))
raise
For incremental pipelines, DynamoDB is the perfect place to store the high watermark — the last processed timestamp or ID. Each pipeline run reads the watermark, processes everything since that point, then updates the watermark to the new maximum. This is the classic stateful incremental ETL pattern.
def get_watermark(pipeline_id: str, default: str = "1970-01-01T00:00:00Z") -> str:
"""Read last watermark. Return default if first run."""
response = dynamodb.get_item(
TableName='pipeline_watermark',
Key={'pipeline_id': {'S': pipeline_id}}
)
if 'Item' in response:
return response['Item']['last_watermark_value']['S']
return default # first ever run
def update_watermark(pipeline_id: str, new_watermark: str):
"""Save new watermark after a successful run."""
dynamodb.put_item(
TableName='pipeline_watermark',
Item={
'pipeline_id': {'S': pipeline_id},
'last_watermark_value': {'S': new_watermark},
'updated_timestamp': {'S': datetime.now(timezone.utc).isoformat()}
}
)
# In your Spark job:
watermark = get_watermark("sales-etl-daily")
print(f"Processing records AFTER: {watermark}")
# df = spark.read.jdbc(...).filter(f"updated_at > '{watermark}'")
# new_max_ts = df.agg({"updated_at": "max"}).collect()[0][0]
new_max_ts = "2024-01-15T06:00:00Z" # from your Spark result
# Only update watermark after job succeeds
update_watermark("sales-etl-daily", new_max_ts)
print(f"Watermark updated to: {new_max_ts}")
get_watermark() → gets 2024-01-14T23:59:59Z → reads only the new rows from the source DB → processes 50,000 new records → calls update_watermark() with 2024-01-15T23:59:59Z. Next night, it only processes the delta again. Without this pattern, it would full-scan millions of rows every night.
CloudWatch APIs
Amazon CloudWatch is the observability backbone of every AWS data pipeline. As a Data Engineer you use it to publish custom pipeline metrics, create alarms that page your team on failures, query logs with SQL-like syntax, and build dashboards that show pipeline health at a glance. Mastering these APIs turns your pipelines from black boxes into fully observable systems.
put_metric_data() lets you push custom metrics from your pipeline code into CloudWatch. Built-in AWS services (Glue, EMR, Lambda) publish their own metrics automatically, but things like rows processed, DQ failure count, or pipeline duration are business metrics that only your code knows. You publish these to a custom Namespace you define.
put_metric_data() is that manual log entry.
import boto3
from datetime import datetime, timezone
cw = boto3.client('cloudwatch', region_name='us-east-1')
# ── Publish a single metric: rows processed by this pipeline run ──
cw.put_metric_data(
Namespace='DataPlatform/Pipelines', # your custom namespace — use / to organise
MetricData=[
{
'MetricName': 'RowsProcessed',
'Value' : 5_432_100,
'Unit' : 'Count',
'Timestamp' : datetime.now(timezone.utc),
'Dimensions': [
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
]
}
]
)
print("Metric published")
import boto3, time
from datetime import datetime, timezone
cw = boto3.client('cloudwatch', region_name='us-east-1')
pipeline_name = 'orders-bronze-to-silver'
start_time = time.time()
# ... your ETL logic runs here ...
rows_read = 5_432_100
rows_written = 5_431_990
rows_rejected = 110
dq_score = 99.8
duration_secs = time.time() - start_time
# ── Publish all pipeline KPIs in one API call ──
ts = datetime.now(timezone.utc)
dims = [{'Name': 'PipelineName', 'Value': pipeline_name},
{'Name': 'Environment', 'Value': 'prod'}]
cw.put_metric_data(
Namespace='DataPlatform/Pipelines',
MetricData=[
{'MetricName': 'RowsRead', 'Value': rows_read, 'Unit': 'Count', 'Timestamp': ts, 'Dimensions': dims},
{'MetricName': 'RowsWritten', 'Value': rows_written, 'Unit': 'Count', 'Timestamp': ts, 'Dimensions': dims},
{'MetricName': 'RowsRejected', 'Value': rows_rejected, 'Unit': 'Count', 'Timestamp': ts, 'Dimensions': dims},
{'MetricName': 'DQScore', 'Value': dq_score, 'Unit': 'Percent', 'Timestamp': ts, 'Dimensions': dims},
{'MetricName': 'DurationSeconds', 'Value': duration_secs, 'Unit': 'Seconds', 'Timestamp': ts, 'Dimensions': dims},
]
)
print("All pipeline KPIs published to CloudWatch")
Unit. CloudWatch supports: Count, Seconds, Milliseconds, Bytes, Megabytes, Percent, None. Using the right unit lets you build meaningful alarms (e.g. alarm if DurationSeconds > 3600).
put_metric_data() accepts a maximum of 20 MetricData entries per API call. If you have more, split them into batches of 20 and call the API multiple times.
get_metric_statistics() retrieves aggregated data points for a single metric over a time range. Useful for checking "how many rows did this pipeline process in the last hour?" or "what was the average duration over the last 7 days?". You get back a list of timestamped data points.
import boto3
from datetime import datetime, timedelta, timezone
cw = boto3.client('cloudwatch', region_name='us-east-1')
# ── Get RowsProcessed for the last 24 hours, in 1-hour buckets ──
now = datetime.now(timezone.utc)
one_day_ago = now - timedelta(hours=24)
response = cw.get_metric_statistics(
Namespace ='DataPlatform/Pipelines',
MetricName ='RowsProcessed',
Dimensions =[
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
],
StartTime = one_day_ago,
EndTime = now,
Period = 3600, # 1 hour buckets in seconds
Statistics = ['Sum', 'Maximum', 'Average'],
Unit = 'Count'
)
# Sort by time and print
datapoints = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])
for dp in datapoints:
print(f"{dp['Timestamp']:%Y-%m-%d %H:%M} sum={dp['Sum']:,.0f} max={dp['Maximum']:,.0f}")
Period must be a multiple of 60 seconds. Common values: 60 (1 min), 300 (5 min), 3600 (1 hour), 86400 (1 day). The total time range divided by Period gives you the number of data points returned.
get_metric_data() is the more powerful successor to get_metric_statistics(). It lets you query multiple metrics simultaneously in one API call, supports metric math expressions (e.g. calculate rejection rate = rejected / read * 100), and is paginated for large result sets.
import boto3
from datetime import datetime, timedelta, timezone
cw = boto3.client('cloudwatch', region_name='us-east-1')
now = datetime.now(timezone.utc)
dims = [
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
]
response = cw.get_metric_data(
MetricDataQueries=[
# ── Raw metric: rows read ──
{
'Id' : 'm1',
'Label' : 'Rows Read',
'MetricStat': {
'Metric': {'Namespace': 'DataPlatform/Pipelines', 'MetricName': 'RowsRead', 'Dimensions': dims},
'Period': 3600,
'Stat' : 'Sum'
}
},
# ── Raw metric: rows rejected ──
{
'Id' : 'm2',
'Label' : 'Rows Rejected',
'MetricStat': {
'Metric': {'Namespace': 'DataPlatform/Pipelines', 'MetricName': 'RowsRejected', 'Dimensions': dims},
'Period': 3600,
'Stat' : 'Sum'
}
},
# ── Metric math: rejection rate % = (rejected / read) * 100 ──
{
'Id' : 'rejection_rate',
'Label' : 'Rejection Rate %',
'Expression' : '(m2 / m1) * 100', # metric math — references m1 and m2 above
'ReturnData' : True
},
],
StartTime = now - timedelta(hours=24),
EndTime = now,
)
for result in response['MetricDataResults']:
print(f"\n{result['Label']} ({result['Id']})")
for ts, val in zip(result['Timestamps'], result['Values']):
print(f" {ts:%H:%M} → {val:,.2f}")
Expression referencing other query IDs. Set 'ReturnData': False on raw metrics you only use for math (they won't appear in results).
CloudWatch Alarms watch a metric and trigger an action (SNS notification → email / Slack / PagerDuty) when the metric crosses a threshold. As a DE you create alarms for things like: Glue job duration exceeded, DLQ depth > 0, EMR step failed, or custom pipeline DQ score dropped below 95%.
import boto3
cw = boto3.client('cloudwatch', region_name='us-east-1')
SNS_ALERT_ARN = 'arn:aws:sns:us-east-1:123456789012:data-platform-alerts'
# ── Alarm 1: Alert if DQ rejection rate exceeds 1% ──
cw.put_metric_alarm(
AlarmName = 'orders-pipeline-high-rejection-rate',
AlarmDescription = 'DQ rejection rate exceeded 1% — investigate data quality',
Namespace = 'DataPlatform/Pipelines',
MetricName = 'RowsRejected',
Dimensions = [
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
],
Statistic = 'Sum',
Period = 3600, # evaluate over last 1 hour
EvaluationPeriods = 1, # 1 period must breach → alarm
Threshold = 1000, # alert if > 1000 rows rejected in the hour
ComparisonOperator = 'GreaterThanThreshold',
TreatMissingData = 'notBreaching', # no data = pipeline didn't run = OK
AlarmActions = [SNS_ALERT_ARN],
OKActions = [SNS_ALERT_ARN], # notify when alarm recovers too
)
# ── Alarm 2: Alert if pipeline takes longer than 2 hours ──
cw.put_metric_alarm(
AlarmName = 'orders-pipeline-duration-breach',
AlarmDescription = 'Pipeline exceeded 2-hour SLA',
Namespace = 'DataPlatform/Pipelines',
MetricName = 'DurationSeconds',
Dimensions = [
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
],
Statistic = 'Maximum',
Period = 3600,
EvaluationPeriods = 1,
Threshold = 7200, # 2 hours in seconds
ComparisonOperator = 'GreaterThanThreshold',
TreatMissingData = 'notBreaching',
AlarmActions = [SNS_ALERT_ARN],
)
# ── Alarm 3: DLQ depth > 0 — something failed and landed in DLQ ──
cw.put_metric_alarm(
AlarmName = 'pipeline-dlq-has-messages',
AlarmDescription = 'Messages in DLQ — pipeline messages failed processing',
Namespace = 'AWS/SQS', # built-in AWS namespace for SQS
MetricName = 'ApproximateNumberOfMessagesVisible',
Dimensions = [{'Name': 'QueueName', 'Value': 'pipeline-events-dlq'}],
Statistic = 'Sum',
Period = 300, # check every 5 minutes
EvaluationPeriods = 1,
Threshold = 0,
ComparisonOperator = 'GreaterThanThreshold',
TreatMissingData = 'notBreaching',
AlarmActions = [SNS_ALERT_ARN],
)
print("All alarms created")
notBreaching — no data is treated as OK (most common for batch pipelines that don't run 24/7). breaching — no data triggers the alarm. ignore — keeps previous state. missing — alarm state becomes INSUFFICIENT_DATA.
A composite alarm fires based on the state of other alarms — not directly on a metric. This lets you build logic like: "alert only if BOTH the rejection rate is high AND the pipeline duration is long" (AND), or "alert if ANY of our 5 pipelines fail" (OR). This reduces alert noise.
# ── Fire only when BOTH duration AND rejection alarms are breaching ──
cw.put_composite_alarm(
AlarmName = 'orders-pipeline-critical-composite',
AlarmDescription = 'Critical: both SLA breach AND high rejection rate simultaneously',
AlarmRule = (
'ALARM("orders-pipeline-duration-breach") AND '
'ALARM("orders-pipeline-high-rejection-rate")'
),
AlarmActions = ['arn:aws:sns:us-east-1:123456789012:data-platform-critical'],
)
# ── Fire when ANY of the three child alarms breach (OR logic) ──
cw.put_composite_alarm(
AlarmName = 'any-pipeline-failure',
AlarmRule = (
'ALARM("orders-pipeline-duration-breach") OR '
'ALARM("orders-pipeline-high-rejection-rate") OR '
'ALARM("pipeline-dlq-has-messages")'
),
AlarmActions = ['arn:aws:sns:us-east-1:123456789012:data-platform-alerts'],
)
Use describe_alarms() to check whether an alarm is currently in OK, ALARM, or INSUFFICIENT_DATA state. Useful in pipeline code to gate execution: "don't start today's load if yesterday's DQ alarm is still firing".
import boto3
cw = boto3.client('cloudwatch', region_name='us-east-1')
# ── Check specific alarms by name ──
response = cw.describe_alarms(
AlarmNames=[
'orders-pipeline-high-rejection-rate',
'orders-pipeline-duration-breach'
],
AlarmTypes=['MetricAlarm'] # or 'CompositeAlarm'
)
for alarm in response['MetricAlarms']:
name = alarm['AlarmName']
state = alarm['StateValue'] # 'OK', 'ALARM', or 'INSUFFICIENT_DATA'
reason = alarm['StateReason']
print(f"{name:50s} state={state} reason={reason[:60]}")
# ── Gate pattern: stop pipeline if any alarm is FIRING ──
firing = [a['AlarmName'] for a in response['MetricAlarms'] if a['StateValue'] == 'ALARM']
if firing:
raise RuntimeError(f"Blocking pipeline start — alarms firing: {firing}")
print("All alarms OK — safe to proceed")
Remove alarms when decommissioning a pipeline. You can delete up to 100 alarms in one call.
cw.delete_alarms(
AlarmNames=[
'orders-pipeline-high-rejection-rate',
'orders-pipeline-duration-breach',
'pipeline-dlq-has-messages'
]
)
print("Alarms deleted")
CloudWatch Logs are organized in a two-level hierarchy. A Log Group is a container (e.g. one per pipeline or application). A Log Stream is a sequence of events within a group (e.g. one per pipeline run or per host). Think of it as: Log Group = a notebook, Log Stream = one chapter per run.
import boto3
from datetime import datetime, timezone
logs = boto3.client('logs', region_name='us-east-1')
LOG_GROUP = '/data-platform/pipelines/orders-bronze-to-silver'
run_id = 'run-2024-01-15-083000'
LOG_STREAM = f'prod/{run_id}'
# ── Create the log group (idempotent — safe to call even if it exists) ──
try:
logs.create_log_group(
logGroupName = LOG_GROUP,
tags = {'Pipeline': 'orders-bronze-to-silver', 'Env': 'prod'}
)
except logs.exceptions.ResourceAlreadyExistsException:
pass # already exists — that's fine
# ── Set retention policy (don't keep logs forever — costs money) ──
logs.put_retention_policy(
logGroupName = LOG_GROUP,
retentionInDays = 90 # keep 90 days, then auto-delete
)
# ── Create a new log stream for this run ──
logs.create_log_stream(
logGroupName = LOG_GROUP,
logStreamName = LOG_STREAM
)
print(f"Log stream created: {LOG_GROUP}/{LOG_STREAM}")
put_log_events() writes one or more log events to a stream. Each event needs a Unix timestamp in milliseconds and a message string. For production pipelines, write structured JSON logs so you can query them later with Log Insights.
import boto3, json, time
from datetime import datetime, timezone
logs = boto3.client('logs', region_name='us-east-1')
LOG_GROUP = '/data-platform/pipelines/orders-bronze-to-silver'
LOG_STREAM = 'prod/run-2024-01-15-083000'
def ts_ms():
"""Current time in milliseconds — required by put_log_events"""
return int(time.time() * 1000)
# ── Build structured log events ──
events = [
{
'timestamp': ts_ms(),
'message' : json.dumps({
'level' : 'INFO',
'pipeline' : 'orders-bronze-to-silver',
'run_id' : 'run-2024-01-15-083000',
'stage' : 'extract',
'rows_read' : 5_432_100,
'source_table' : 'orders_raw',
'message' : 'Extract completed successfully'
})
},
{
'timestamp': ts_ms() + 1, # events must have strictly increasing timestamps
'message' : json.dumps({
'level' : 'WARN',
'pipeline' : 'orders-bronze-to-silver',
'run_id' : 'run-2024-01-15-083000',
'stage' : 'validate',
'rows_rejected' : 110,
'rejection_reason': 'null_order_id',
'message' : '110 rows rejected — null order_id'
})
},
]
# ── First call: no sequenceToken needed ──
response = logs.put_log_events(
logGroupName = LOG_GROUP,
logStreamName = LOG_STREAM,
logEvents = events
)
# ── Subsequent calls: must include the sequenceToken from previous response ──
next_token = response.get('nextSequenceToken')
more_events = [
{'timestamp': ts_ms() + 2, 'message': json.dumps({'level':'INFO','stage':'load','rows_written':5_431_990,'message':'Load complete'})}
]
logs.put_log_events(
logGroupName = LOG_GROUP,
logStreamName = LOG_STREAM,
logEvents = more_events,
sequenceToken = next_token # required for 2nd+ calls to same stream
)
put_log_events() call, every subsequent call to the same stream must include the sequenceToken from the previous response. Missing it causes InvalidSequenceTokenException. Always capture response['nextSequenceToken'] and pass it forward.
put_log_events() can carry up to 10,000 events or 1 MB total per call. In pipeline code, buffer log events during processing and flush in one batch at the end rather than calling the API for every single log line.
filter_log_events() searches across all streams in a log group using a filter pattern. Useful for programmatic debugging: "find all ERROR events from the orders pipeline in the last hour". Note: for complex queries, Log Insights (below) is faster.
import boto3
from datetime import datetime, timedelta, timezone
logs = boto3.client('logs', region_name='us-east-1')
now = datetime.now(timezone.utc)
# ── Find all ERROR log events from the last 1 hour ──
paginator = logs.get_paginator('filter_log_events')
pages = paginator.paginate(
logGroupName = '/data-platform/pipelines/orders-bronze-to-silver',
startTime = int((now - timedelta(hours=1)).timestamp() * 1000),
endTime = int(now.timestamp() * 1000),
filterPattern = '"ERROR"', # simple string match
)
for page in pages:
for event in page['events']:
ts = datetime.fromtimestamp(event['timestamp']/1000, tz=timezone.utc)
msg = event['message']
print(f"[{ts:%H:%M:%S}] {msg[:120]}")
CloudWatch Log Insights lets you run SQL-like queries across your log groups without downloading logs. It's purpose-built for querying JSON-structured logs at scale. You use it to answer questions like: "how many rows did each pipeline run process this week?", "which runs had DQ rejections?", "what was the average pipeline duration by day?".
Log Insights is asynchronous like Athena. You call start_query() to submit the query, get back a queryId, then poll get_query_results() until the status is Complete. The query language uses commands like fields, filter, stats, sort, limit.
import boto3, time
from datetime import datetime, timedelta, timezone
logs = boto3.client('logs', region_name='us-east-1')
now = datetime.now(timezone.utc)
# ── 1. Submit the query ──
query_response = logs.start_query(
logGroupName = '/data-platform/pipelines/orders-bronze-to-silver',
startTime = int((now - timedelta(days=7)).timestamp()), # epoch seconds (not ms!)
endTime = int(now.timestamp()),
queryString = """
fields @timestamp, run_id, stage, rows_read, rows_written, rows_rejected, level
| filter level = "INFO" and stage = "load"
| stats sum(rows_written) as total_rows by run_id
| sort total_rows desc
| limit 20
"""
)
query_id = query_response['queryId']
print(f"Query started: {query_id}")
# ── 2. Poll until complete ──
while True:
result = logs.get_query_results(queryId=query_id)
status = result['status'] # 'Running', 'Complete', 'Failed', 'Cancelled', 'Timeout'
print(f"Status: {status}")
if status in ('Complete', 'Failed', 'Cancelled', 'Timeout'):
break
time.sleep(2)
if status != 'Complete':
raise RuntimeError(f"Log Insights query failed: {status}")
# ── 3. Parse results ──
# results is a list of rows, each row is a list of {field, value} dicts
rows = result['results']
print(f"\nTop pipeline runs by rows written:")
for row in rows:
# Convert list of {field, value} to a dict for easy access
record = {item['field']: item['value'] for item in row}
print(f" run_id={record.get('run_id','?'):30s} rows_written={record.get('total_rows','?')}")
fields @timestamp, run_id, level— select fields (use@prefix for built-in fields)filter level = "ERROR"— filter rowsstats sum(rows_written) as total by run_id— aggregationsort total desc— sort resultslimit 50— cap result rows (max 10,000)parse @message '* rows written' as rows— extract from unstructured text
Combine Log Insights with CloudWatch metrics publishing to build a self-monitoring pipeline: query yesterday's logs every morning, compute SLA metrics (which runs finished late?), publish them as custom metrics, and trigger alarms if SLA breach count exceeds zero.
import boto3, json, time
from datetime import datetime, timedelta, timezone
logs = boto3.client('logs', region_name='us-east-1')
cw = boto3.client('cloudwatch', region_name='us-east-1')
now = datetime.now(timezone.utc)
# ── Query: find any runs that exceeded 2h duration (SLA breach) ──
r = logs.start_query(
logGroupName = '/data-platform/pipelines/orders-bronze-to-silver',
startTime = int((now - timedelta(hours=24)).timestamp()),
endTime = int(now.timestamp()),
queryString = """
fields run_id, duration_seconds
| filter stage = "complete" and duration_seconds > 7200
| stats count() as sla_breaches
"""
)
query_id = r['queryId']
while True:
res = logs.get_query_results(queryId=query_id)
if res['status'] == 'Complete': break
time.sleep(2)
breach_count = 0
if res['results']:
record = {item['field']: item['value'] for item in res['results'][0]}
breach_count = int(record.get('sla_breaches', 0))
# ── Publish SLA breach count as a CloudWatch metric ──
cw.put_metric_data(
Namespace='DataPlatform/SLA',
MetricData=[{
'MetricName': 'SLABreaches',
'Value' : breach_count,
'Unit' : 'Count',
'Timestamp' : now,
'Dimensions': [{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'}]
}]
)
print(f"SLA breaches in last 24h: {breach_count} — metric published")
Data freshness = how recent the latest data in your target table is. Your SLA might say "gold layer orders table must be updated by 06:00 UTC every day". A freshness check queries the max load timestamp in your table and compares it to the expected freshness window. Publish the result as a metric and alarm on it.
import boto3, time
from datetime import datetime, timezone
cw = boto3.client('cloudwatch', region_name='us-east-1')
athena = boto3.client('athena', region_name='us-east-1')
# ── 1. Query the target table for latest load timestamp ──
r = athena.start_query_execution(
QueryString = "SELECT MAX(load_dts) AS latest_load FROM gold.orders",
QueryExecutionContext = {'Database': 'gold'},
ResultConfiguration = {'OutputLocation': 's3://my-datalake/athena-results/'}
)
qid = r['QueryExecutionId']
while True:
state = athena.get_query_execution(QueryExecutionId=qid)['QueryExecution']['Status']['State']
if state in ('SUCCEEDED','FAILED'): break
time.sleep(3)
result = athena.get_query_results(QueryExecutionId=qid)
latest_str = result['ResultSet']['Rows'][1]['Data'][0]['VarCharValue']
latest_load = datetime.fromisoformat(latest_str).replace(tzinfo=timezone.utc)
# ── 2. Calculate staleness in hours ──
staleness_hours = (datetime.now(timezone.utc) - latest_load).total_seconds() / 3600
print(f"Data staleness: {staleness_hours:.1f} hours")
# ── 3. Publish as CloudWatch metric ──
cw.put_metric_data(
Namespace='DataPlatform/Freshness',
MetricData=[{
'MetricName': 'StalenessHours',
'Value' : staleness_hours,
'Unit' : 'Count',
'Dimensions': [{'Name': 'Table', 'Value': 'gold.orders'}]
}]
)
# Alarm on StalenessHours > 6 → pipeline likely missed its SLA window
Publishing daily row counts as a CloudWatch metric lets you use Anomaly Detection — CloudWatch automatically learns the expected range for each time period and alerts you when today's count is statistically unusual. This catches silent data drops without you having to set hard thresholds.
import boto3
cw = boto3.client('cloudwatch', region_name='us-east-1')
# ── First: train the anomaly detection model on your metric ──
cw.put_anomaly_detector(
Namespace = 'DataPlatform/Pipelines',
MetricName = 'RowsWritten',
Dimensions = [
{'Name': 'PipelineName', 'Value': 'orders-bronze-to-silver'},
{'Name': 'Environment', 'Value': 'prod'}
],
Stat = 'Sum',
Configuration = {'ExcludedTimeRanges': []} # optionally exclude maintenance windows
)
# ── Then: create an alarm that fires when RowsWritten is outside the band ──
cw.put_metric_alarm(
AlarmName = 'orders-rows-anomaly',
AlarmDescription = 'Row count is statistically anomalous — check data source',
Metrics=[
{
'Id': 'm1',
'MetricStat': {
'Metric': {'Namespace':'DataPlatform/Pipelines','MetricName':'RowsWritten',
'Dimensions':[{'Name':'PipelineName','Value':'orders-bronze-to-silver'}]},
'Period':86400, 'Stat':'Sum'
}
},
{
'Id' : 'ad1',
'Expression': 'ANOMALY_DETECTION_BAND(m1, 2)', # 2 = standard deviations
'Label' : 'Expected Band'
}
],
ComparisonOperator = 'LessThanLowerOrGreaterThanUpperThreshold',
ThresholdMetricId = 'ad1',
EvaluationPeriods = 1,
TreatMissingData = 'breaching', # missing data (pipeline didn't run) = alarm
AlarmActions = ['arn:aws:sns:us-east-1:123456789012:data-platform-alerts']
)
print("Anomaly detection alarm created")
| API | What it does | Key Parameters |
|---|---|---|
put_metric_data() | Publish custom metrics | Namespace, MetricData (list, max 20) |
get_metric_statistics() | Query one metric over time | Namespace, MetricName, StartTime, EndTime, Period, Statistics |
get_metric_data() | Query multiple metrics + math | MetricDataQueries (list with Id, MetricStat/Expression) |
put_metric_alarm() | Create/update threshold alarm | AlarmName, Threshold, ComparisonOperator, AlarmActions (SNS ARN) |
put_composite_alarm() | Alarm from other alarm states | AlarmRule (AND/OR logic on alarm names) |
put_anomaly_detector() | Enable ML anomaly detection | Namespace, MetricName, Stat |
describe_alarms() | Check alarm state (OK/ALARM) | AlarmNames, AlarmTypes |
delete_alarms() | Remove alarms | AlarmNames (list, max 100) |
create_log_group() | Create log container | logGroupName, tags |
put_retention_policy() | Set log expiry | logGroupName, retentionInDays |
create_log_stream() | Create log stream (per run) | logGroupName, logStreamName |
put_log_events() | Write log lines | logGroupName, logStreamName, logEvents, sequenceToken |
filter_log_events() | Search log events | logGroupName, filterPattern, startTime, endTime |
start_query() | Start Log Insights query | logGroupName, startTime, endTime, queryString |
get_query_results() | Poll + fetch Insights results | queryId |
STS APIs
AWS Security Token Service (STS) issues temporary credentials that expire automatically. For data engineers, STS is the backbone of cross-account access — you assume a role in another account and get short-lived keys to read their S3, Glue, or Redshift without ever sharing long-term credentials.
get_caller_identity() returns the Account ID, User ID, and ARN of the currently authenticated identity — whether that is an IAM user, an IAM role, or an assumed role session. It requires no parameters and makes no changes. It is the AWS equivalent of whoami on Linux.
get_caller_identity() is exactly that — it tells you which identity boto3 is currently operating as, so you can verify the right role is active before running a destructive operation.
import boto3
sts = boto3.client('sts', region_name='us-east-1')
identity = sts.get_caller_identity()
print("Account :", identity['Account']) # e.g. '123456789012'
print("UserId :", identity['UserId']) # e.g. 'AROAEXAMPLEID:my-session'
print("ARN :", identity['Arn']) # e.g. 'arn:aws:sts::123456789012:assumed-role/GlueExecutionRole/my-session'
get_caller_identity() and log the result to CloudWatch. If a pipeline ever misbehaves in production you can immediately see "was it running as the right role, in the right account?" rather than guessing.
import boto3, sys
EXPECTED_ACCOUNT = '123456789012' # prod account ID
sts = boto3.client('sts')
identity = sts.get_caller_identity()
if identity['Account'] != EXPECTED_ACCOUNT:
print(f"ABORT: running in account {identity['Account']}, expected {EXPECTED_ACCOUNT}")
sys.exit(1)
print("Account verified — proceeding with pipeline")
assume_role() is how one AWS identity temporarily gains the permissions of another IAM role. You call STS, it verifies your identity has permission to assume the target role, and it returns temporary credentials (Access Key, Secret Key, Session Token) that expire after 15 minutes to 12 hours.
import boto3
sts = boto3.client('sts', region_name='us-east-1')
# ── Step 1: Assume the target role ──
response = sts.assume_role(
RoleArn = 'arn:aws:iam::999888777666:role/DataLakeReadRole',
RoleSessionName = 'orders-pipeline-cross-account-read',
DurationSeconds = 3600 # 1 hour
)
# ── Step 2: Extract the temporary credentials ──
creds = response['Credentials']
# creds contains: AccessKeyId, SecretAccessKey, SessionToken, Expiration
print("Temp creds expire at:", creds['Expiration'])
# ── Step 3: Build a boto3 session using the temp credentials ──
assumed_session = boto3.Session(
aws_access_key_id = creds['AccessKeyId'],
aws_secret_access_key = creds['SecretAccessKey'],
aws_session_token = creds['SessionToken'],
region_name = 'us-east-1'
)
# ── Now use this session to create service clients in the target account ──
s3_target = assumed_session.client('s3')
glue_target = assumed_session.client('glue')
# ── Read data from the target account's S3 bucket ──
obj = s3_target.get_object(
Bucket = 'partner-data-lake-999888777666',
Key = 'bronze/orders/2024-01-15/orders.parquet'
)
print("Read", len(obj['Body'].read()), "bytes from target account S3")
# ── List Glue tables in target account's catalog ──
tables_resp = glue_target.get_tables(DatabaseName='partner_bronze_db')
for t in tables_resp['TableList']:
print(" table:", t['Name'])
aws_access_key_id, aws_secret_access_key, AND aws_session_token. Omitting the session token causes AuthFailure or InvalidClientTokenId errors, even though the access key and secret look valid.
Temporary credentials expire. If your pipeline runs longer than the session duration the boto3 clients will start throwing ExpiredTokenException. The pattern for long pipelines is to check expiry before each API call and re-assume the role when needed.
import boto3
from datetime import datetime, timezone, timedelta
class AssumedRoleSession:
"""Wrapper that auto-refreshes STS credentials before expiry."""
def __init__(self, role_arn, session_name, duration=3600, refresh_before_secs=300):
self.role_arn = role_arn
self.session_name = session_name
self.duration = duration
self.refresh_before_secs = refresh_before_secs # refresh 5 min before expiry
self.sts = boto3.client('sts')
self._session = None
self._expiration = None
def _refresh(self):
resp = self.sts.assume_role(
RoleArn = self.role_arn,
RoleSessionName = self.session_name,
DurationSeconds = self.duration
)
creds = resp['Credentials']
self._expiration = creds['Expiration']
self._session = boto3.Session(
aws_access_key_id = creds['AccessKeyId'],
aws_secret_access_key = creds['SecretAccessKey'],
aws_session_token = creds['SessionToken']
)
print(f"[STS] Role assumed — expires {self._expiration}")
def session(self):
now = datetime.now(tz=timezone.utc)
if self._session is None or (self._expiration - now) < timedelta(seconds=self.refresh_before_secs):
self._refresh()
return self._session
# ── Usage ──
role_session = AssumedRoleSession(
role_arn = 'arn:aws:iam::999888777666:role/DataLakeReadRole',
session_name = 'long-running-pipeline'
)
# Each call to .session() auto-refreshes if expiry is near
s3 = role_session.session().client('s3')
# ... do work ...
s3 = role_session.session().client('s3') # safe to call again hours later
Large organisations split AWS accounts by team, domain, or environment. Your pipeline runs in Account A (the source account) but the data lives in Account B (the target account). STS assume_role() bridges the two accounts — no VPN, no credential sharing, no permanent access.
2. The role's Permission Policy must grant the actual resource actions (s3:GetObject, glue:GetTable etc.). Both are needed — the trust is the door, the permission is the key.
A complete pattern: assume role in target account → list S3 objects → download Parquet → read Glue table metadata → write results back to source account S3.
import boto3
import io
from botocore.exceptions import ClientError
# ════════════════════════════════════════════════
# CONFIG
# ════════════════════════════════════════════════
SOURCE_ACCOUNT = '123456789012'
TARGET_ACCOUNT = '999888777666'
TARGET_ROLE_ARN = f'arn:aws:iam::{TARGET_ACCOUNT}:role/DataLakeReadRole'
TARGET_BUCKET = 'partner-data-lake-999888777666'
TARGET_GLUE_DB = 'partner_bronze_db'
OUTPUT_BUCKET = 'my-pipeline-output-123456789012' # source account bucket
OUTPUT_KEY = 'silver/orders/cross_account_result.parquet'
# ════════════════════════════════════════════════
# STEP 1 — Verify current identity (sanity check)
# ════════════════════════════════════════════════
sts = boto3.client('sts')
identity = sts.get_caller_identity()
print(f"Running as: {identity['Arn']} in account {identity['Account']}")
# ════════════════════════════════════════════════
# STEP 2 — Assume role in target account
# ════════════════════════════════════════════════
try:
response = sts.assume_role(
RoleArn = TARGET_ROLE_ARN,
RoleSessionName = 'orders-cross-account-read-2024',
DurationSeconds = 3600
)
except ClientError as e:
print(f"Failed to assume role: {e.response['Error']['Code']}: {e.response['Error']['Message']}")
raise
creds = response['Credentials']
print(f"Assumed role — session expires: {creds['Expiration']}")
# ════════════════════════════════════════════════
# STEP 3 — Build clients in target account
# ════════════════════════════════════════════════
target_session = boto3.Session(
aws_access_key_id = creds['AccessKeyId'],
aws_secret_access_key = creds['SecretAccessKey'],
aws_session_token = creds['SessionToken'],
region_name = 'us-east-1'
)
s3_target = target_session.client('s3')
glue_target = target_session.client('glue')
# ════════════════════════════════════════════════
# STEP 4 — List objects in target S3 (with paginator)
# ════════════════════════════════════════════════
paginator = s3_target.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=TARGET_BUCKET, Prefix='bronze/orders/2024-01-15/')
keys = []
for page in pages:
for obj in page.get('Contents', []):
keys.append(obj['Key'])
print(f"Found {len(keys)} objects in target bucket")
# ════════════════════════════════════════════════
# STEP 5 — Read Glue table metadata from target account
# ════════════════════════════════════════════════
tables_resp = glue_target.get_tables(DatabaseName=TARGET_GLUE_DB)
print("\nGlue tables in target account:")
for t in tables_resp['TableList']:
print(f" {t['Name']} — {t.get('StorageDescriptor',{}).get('Location','?')}")
# ════════════════════════════════════════════════
# STEP 6 — Write result back to SOURCE account S3
# (use default boto3, not target_session)
# ════════════════════════════════════════════════
s3_source = boto3.client('s3') # default creds → source account
s3_source.put_object(
Bucket = OUTPUT_BUCKET,
Key = OUTPUT_KEY,
Body = str({'keys_found': len(keys), 'tables': [t['Name'] for t in tables_resp['TableList']]}).encode()
)
print(f"\nResults written to s3://{OUTPUT_BUCKET}/{OUTPUT_KEY}")
When Spark runs on Amazon EKS (Kubernetes), driver and executor pods need AWS credentials to access S3, Glue, etc. The modern approach is IRSA — IAM Roles for Service Accounts. Kubernetes automatically injects a web identity token into each pod, and boto3 exchanges that token for temporary credentials using assume_role_with_web_identity(). You do not call this manually — boto3 does it automatically when IRSA is configured. But you need to understand what is happening under the hood.
# When IRSA is configured on the EKS pod, boto3 automatically does this:
# 1. Reads the web identity token from the file at $AWS_WEB_IDENTITY_TOKEN_FILE
# 2. Reads the role ARN from $AWS_ROLE_ARN environment variable
# 3. Calls assume_role_with_web_identity() transparently
# 4. Returns a boto3 session with the role's permissions
# You just write normal boto3 code — IRSA handles the credential chain:
import boto3
s3 = boto3.client('s3') # automatically uses IRSA creds on EKS
response = s3.list_buckets()
print([b['Name'] for b in response['Buckets']])
# ── If you ever need to call it manually (rare) ──
sts = boto3.client('sts')
with open('/var/run/secrets/eks.amazonaws.com/serviceaccount/token') as f:
web_identity_token = f.read()
response = sts.assume_role_with_web_identity(
RoleArn = 'arn:aws:iam::123456789012:role/SparkDriverRole',
RoleSessionName = 'spark-driver-pod',
WebIdentityToken = web_identity_token,
DurationSeconds = 3600
)
creds = response['Credentials']
spark_session = boto3.Session(
aws_access_key_id = creds['AccessKeyId'],
aws_secret_access_key = creds['SecretAccessKey'],
aws_session_token = creds['SessionToken']
)
eks.amazonaws.com/role-arn) and EKS automatically injects the token file. This is the approved zero-credential pattern for Spark on EKS.
| API | What it does | Key Parameters | Returns |
|---|---|---|---|
get_caller_identity() |
Return current identity info | None | Account, UserId, Arn |
assume_role() |
Get temp creds for an IAM role | RoleArn, RoleSessionName, DurationSeconds, ExternalId | Credentials (AccessKeyId, SecretAccessKey, SessionToken, Expiration) |
assume_role_with_web_identity() |
Get temp creds using OIDC token (EKS/IRSA) | RoleArn, RoleSessionName, WebIdentityToken | Credentials + AssumedRoleUser |
assume_role_with_saml() |
Get temp creds using SAML SSO assertion | RoleArn, PrincipalArn, SAMLAssertion | Credentials |
get_session_token() |
MFA-based temp creds for IAM user | DurationSeconds, SerialNumber, TokenCode (MFA) | Credentials |
get_federation_token() |
Temp creds with policy scoping (for app users) | Name, Policy, DurationSeconds | Credentials + FederatedUser |
get_caller_identity() for sanity-checking which identity is active, and assume_role() for cross-account access. The others (SAML, web identity, federation) are handled by infrastructure/platform teams.
| Error Code | Cause | Fix |
|---|---|---|
AccessDenied |
Your role does not have sts:AssumeRole permission on the target role ARN |
Add sts:AssumeRole to your role's permission policy for the target ARN |
AccessDenied (trust policy) |
Target role's trust policy does not list your identity as a trusted principal | Add your role ARN to the target role's trust policy (done by target account admin) |
ExpiredTokenException |
Using a client built from expired temp credentials | Re-assume the role, rebuild the boto3 Session with fresh creds |
InvalidClientTokenId |
Access key in the credentials is wrong or stale | Ensure you are passing all three fields: AccessKeyId + SecretAccessKey + SessionToken |
ValidationError: RoleSessionName |
Session name has invalid characters (spaces, slashes) | Use only alphanumeric, hyphen, underscore, dot in RoleSessionName |
EventBridge APIs
Amazon EventBridge is the event bus of AWS. As a data engineer you use it to trigger pipelines on a schedule, react to S3/Glue/EMR events, and publish your own custom pipeline events so downstream systems can react. All of this is scriptable via boto3.
EventBridge is a serverless event router. Events flow in → rules filter them → targets (Lambda, SQS, Glue, Step Functions…) execute. Think of it as an intelligent if-this-then-that engine for AWS.
| Term | What it is | Data Engineering Context |
|---|---|---|
| Event Bus | Channel that receives events. Default bus = AWS service events. Custom bus = your own events. | Use default bus for S3/Glue/EMR events. Create custom bus for pipeline events. |
| Event | JSON payload describing something that happened. Max 256 KB. | S3 object created, Glue job state change, your custom pipeline.completed event. |
| Rule | Pattern-match expression that filters events and routes to targets. | "Run Glue job daily at 06:00 UTC" or "Alert SNS when Glue job state = FAILED". |
| Target | AWS resource that EventBridge invokes when a rule matches. Up to 5 targets per rule. | Lambda, SQS, SNS, Step Functions, Glue Job, EMR (via Lambda), Kinesis. |
| Schedule | Cron or rate expression embedded in a rule. No source event needed. | rate(1 day), cron(0 6 * * ? *) — trigger Glue/EMR daily. |
put_events() lets your pipeline announce itself to the rest of the system. Instead of hardcoding "after step A, call step B", you publish an event like pipeline.completed and let downstream rules decide what to do. This decouples producers from consumers.
{"detail-type": "silver.layer.ready", "source": "com.mycompany.pipeline"}. A rule triggers a Lambda that starts the Gold layer aggregation job — without the Silver job knowing anything about Gold.
Sends up to 10 events per API call. Each event has these key fields:
| Field | Required | Description |
|---|---|---|
Source | Yes | Who sent it. Convention: com.yourcompany.service |
DetailType | Yes | Human-readable event category. Used in rule pattern matching. |
Detail | Yes | JSON string with the event payload. Put your pipeline metadata here. |
EventBusName | No | Omit = default bus. Specify name/ARN for custom bus. |
Time | No | Event timestamp. Defaults to now. |
Resources | No | List of ARNs related to this event (S3 bucket, table, etc.) |
import boto3, json
from datetime import datetime, timezone
events_client = boto3.client('events', region_name='us-east-1')
# ── Publish a custom pipeline completion event ──
response = events_client.put_events(
Entries=[
{
'Source': 'com.mycompany.data-pipeline', # your app identifier
'DetailType': 'pipeline.silver.completed', # event category
'Detail': json.dumps({ # payload as JSON string
'pipeline_name': 'customer_silver_etl',
'run_id': 'run_2024_01_15_001',
'status': 'SUCCESS',
'rows_written': 1_450_000,
'output_s3_path': 's3://my-lake/silver/customers/',
'completed_at': datetime.now(timezone.utc).isoformat()
}),
'EventBusName': 'data-platform-bus', # custom bus (omit = default)
'Resources': [
'arn:aws:s3:::my-lake'
]
}
]
)
# ── Check for failures (partial failure is possible) ──
failed = response.get('FailedEntryCount', 0)
if failed > 0:
for entry in response['Entries']:
if 'ErrorCode' in entry:
print(f"Failed: {entry['ErrorCode']} — {entry['ErrorMessage']}")
else:
event_id = response['Entries'][0]['EventId']
print(f"Event published: {event_id}")
put_events() never throws an exception for rejected events — it returns them in Entries with an ErrorCode. Always check FailedEntryCount and loop through entries. The position in Entries matches your input order.
Up to 10 events per call, total payload ≤ 256 KB. For high-volume scenarios, chunk your events.
import boto3, json
events_client = boto3.client('events')
def publish_events_batch(events_list, bus_name='default'):
"""Publish events in chunks of 10 (EventBridge limit)."""
chunk_size = 10
total_failed = 0
for i in range(0, len(events_list), chunk_size):
chunk = events_list[i:i + chunk_size]
entries = [
{
'Source': evt['source'],
'DetailType': evt['detail_type'],
'Detail': json.dumps(evt['detail']),
'EventBusName': bus_name
}
for evt in chunk
]
response = events_client.put_events(Entries=entries)
total_failed += response['FailedEntryCount']
print(f"Published {len(events_list)} events, {total_failed} failed")
# ── Usage ──
pipeline_events = [
{'source': 'com.co.pipeline', 'detail_type': 'table.loaded', 'detail': {'table': 'orders', 'rows': 50000}},
{'source': 'com.co.pipeline', 'detail_type': 'table.loaded', 'detail': {'table': 'customers', 'rows': 12000}},
]
publish_events_batch(pipeline_events, bus_name='data-platform-bus')
| Type | When it fires | DE Use Case | Key Parameter |
|---|---|---|---|
| Schedule Rule | On a cron or rate schedule. No incoming event needed. | Trigger Glue job daily at 06:00 UTC | ScheduleExpression |
| Event Pattern Rule | When an event on the bus matches a JSON pattern. | Alert when Glue job state changes to FAILED | EventPattern |
import boto3
events_client = boto3.client('events')
# ── Rate-based schedule: every 1 day ──
response = events_client.create_rule(
Name='daily-silver-etl-trigger',
ScheduleExpression='rate(1 day)', # runs every 24h
State='ENABLED',
Description='Triggers Silver ETL Glue job every day',
EventBusName='default' # schedule rules always on default bus
)
rule_arn = response['RuleArn']
print(f"Created rule: {rule_arn}")
# ── Cron-based schedule: 06:00 UTC every weekday ──
events_client.create_rule(
Name='weekday-gold-etl-trigger',
ScheduleExpression='cron(0 6 ? * MON-FRI *)', # Mon–Fri 06:00 UTC
State='ENABLED',
Description='Gold layer aggregation, weekdays only'
)
# ── Common cron expressions ──
# cron(Minutes Hours Day-of-month Month Day-of-week Year)
# cron(0 6 * * ? *) → every day at 06:00 UTC
# cron(0 */6 * * ? *) → every 6 hours
# cron(0 8 1 * ? *) → 1st of every month at 08:00 UTC
# cron(0 6 ? * MON-FRI *) → weekdays at 06:00 UTC
?. So cron(0 6 * * MON-FRI *) is INVALID; use cron(0 6 ? * MON-FRI *).
Event pattern rules match JSON structure of incoming events. You specify which fields must have which values. Partial match is sufficient — unspecified fields are ignored.
import boto3, json
events_client = boto3.client('events')
# ── Rule: fire when any Glue job reaches FAILED or TIMEOUT state ──
glue_failure_pattern = {
"source": ["aws.glue"], # only Glue events
"detail-type": ["Glue Job State Change"], # only state change events
"detail": {
"state": ["FAILED", "TIMEOUT", "ERROR"] # only failure states
}
}
events_client.create_rule(
Name='glue-job-failure-alert',
EventPattern=json.dumps(glue_failure_pattern),
State='ENABLED',
Description='Fires when any Glue job fails'
)
# ── Rule: fire only for a specific Glue job name ──
specific_job_pattern = {
"source": ["aws.glue"],
"detail-type": ["Glue Job State Change"],
"detail": {
"jobName": ["customer-silver-etl"], # specific job only
"state": ["SUCCEEDED"] # only on success
}
}
events_client.create_rule(
Name='customer-etl-success-trigger',
EventPattern=json.dumps(specific_job_pattern),
State='ENABLED',
Description='Triggers Gold job after Customer Silver ETL succeeds'
)
# ── Rule: match your own custom events ──
custom_event_pattern = {
"source": ["com.mycompany.data-pipeline"],
"detail-type": ["pipeline.silver.completed"]
}
events_client.create_rule(
Name='silver-complete-to-gold-trigger',
EventPattern=json.dumps(custom_event_pattern),
EventBusName='data-platform-bus', # custom bus
State='ENABLED'
)
# Disable a rule (e.g. maintenance window, non-business days)
events_client.disable_rule(Name='daily-silver-etl-trigger')
# Re-enable it
events_client.enable_rule(Name='daily-silver-etl-trigger')
# Delete rule (must remove targets first)
events_client.remove_targets(Rule='daily-silver-etl-trigger', Ids=['1'])
events_client.delete_rule(Name='daily-silver-etl-trigger')
A target is what EventBridge invokes when a rule fires. Each rule can have up to 5 targets. As a data engineer your most common targets are Lambda (to orchestrate), SQS (to buffer), and SNS (to alert).
import boto3, json
events_client = boto3.client('events')
lambda_client = boto3.client('lambda')
RULE_NAME = 'daily-silver-etl-trigger'
LAMBDA_ARN = 'arn:aws:lambda:us-east-1:123456789012:function:trigger-glue-job'
# ── Add Lambda as target ──
response = events_client.put_targets(
Rule=RULE_NAME,
Targets=[
{
'Id': '1', # unique ID within this rule (string)
'Arn': LAMBDA_ARN, # target ARN
'Input': json.dumps({ # static JSON passed to Lambda as event
'pipeline': 'customer-silver-etl',
'trigger_source': 'eventbridge-schedule'
})
}
]
)
if response['FailedEntryCount'] > 0:
print("Target registration failed:", response['FailedEntries'])
# ── Grant EventBridge permission to invoke Lambda ──
# (EventBridge needs lambda:InvokeFunction permission on the function)
try:
lambda_client.add_permission(
FunctionName=LAMBDA_ARN,
StatementId='eventbridge-invoke-permission',
Action='lambda:InvokeFunction',
Principal='events.amazonaws.com',
SourceArn=f'arn:aws:events:us-east-1:123456789012:rule/{RULE_NAME}'
)
except lambda_client.exceptions.ResourceConflictException:
pass # permission already exists, that's fine
lambda.add_permission() to grant EventBridge the right to invoke it. Otherwise the rule fires silently and nothing happens — a very common mistake.
import boto3, json
events_client = boto3.client('events')
SQS_ARN = 'arn:aws:sqs:us-east-1:123456789012:pipeline-trigger-queue'
SNS_ARN = 'arn:aws:sns:us-east-1:123456789012:pipeline-alerts'
# ── Route Glue failure events to BOTH SNS (alert) and SQS (retry queue) ──
events_client.put_targets(
Rule='glue-job-failure-alert',
Targets=[
{
'Id': 'alert-ops-team',
'Arn': SNS_ARN, # SNS for email/Slack alert
'InputTransformer': { # reshape event before sending
'InputPathsMap': {
'job': '$.detail.jobName',
'state': '$.detail.state'
},
'InputTemplate': '"ALERT: Glue job <job> entered state <state>"'
}
},
{
'Id': 'queue-for-retry',
'Arn': SQS_ARN, # SQS for retry handling
}
]
)
# Note: For SQS standard queue, EventBridge needs sqs:SendMessage permission
# Add this to the SQS queue policy (not via boto3 add_permission)
InputTransformer lets you extract specific fields from the event and shape them before passing to the target. Use $.detail.fieldName (JSONPath) to pick fields. Wrap the template value in escaped quotes for string output.
import boto3
events_client = boto3.client('events')
# ── List all rules (paginated) ──
paginator = events_client.get_paginator('list_rules')
for page in paginator.paginate(EventBusName='default'):
for rule in page['Rules']:
print(f"{rule['Name']:40} {rule['State']:10} {rule.get('ScheduleExpression', rule.get('EventPattern', ''))[:60]}")
# ── List all rules with a name prefix ──
for page in paginator.paginate(NamePrefix='daily-', EventBusName='default'):
for rule in page['Rules']:
print(rule['Name'], rule['State'])
# ── List targets attached to a specific rule ──
response = events_client.list_targets_by_rule(Rule='daily-silver-etl-trigger')
for target in response['Targets']:
print(f"Target ID: {target['Id']} ARN: {target['Arn']}")
# ── Get full rule details ──
rule = events_client.describe_rule(Name='daily-silver-etl-trigger')
print("Schedule:", rule.get('ScheduleExpression'))
print("State:", rule['State'])
print("ARN:", rule['Arn'])
You cannot delete a rule that still has targets. Always remove targets first.
import boto3
events_client = boto3.client('events')
def delete_rule_safely(rule_name, bus_name='default'):
"""Remove all targets then delete the rule."""
# Step 1: get all target IDs
response = events_client.list_targets_by_rule(Rule=rule_name, EventBusName=bus_name)
target_ids = [t['Id'] for t in response['Targets']]
# Step 2: remove targets
if target_ids:
events_client.remove_targets(
Rule=rule_name,
EventBusName=bus_name,
Ids=target_ids
)
print(f"Removed {len(target_ids)} targets")
# Step 3: delete rule
events_client.delete_rule(Name=rule_name, EventBusName=bus_name)
print(f"Deleted rule: {rule_name}")
delete_rule_safely('daily-silver-etl-trigger')
The most common DE pattern: a cron rule fires a Lambda that starts a Glue job with dynamic arguments.
import boto3, json
from datetime import datetime, timezone
glue_client = boto3.client('glue')
events_client = boto3.client('events')
def lambda_handler(event, context):
"""EventBridge schedule → Lambda → Glue start."""
pipeline = event.get('pipeline', 'customer-silver-etl')
run_date = datetime.now(timezone.utc).strftime('%Y-%m-%d')
try:
response = glue_client.start_job_run(
JobName=pipeline,
Arguments={
'--run_date': run_date,
'--trigger_source': 'eventbridge-schedule'
}
)
run_id = response['JobRunId']
print(f"Started {pipeline} run: {run_id}")
return {'statusCode': 200, 'jobRunId': run_id}
except Exception as e:
# Publish failure event so another rule can alert ops
events_client.put_events(Entries=[{
'Source': 'com.co.pipeline-launcher',
'DetailType': 'pipeline.launch.failed',
'Detail': json.dumps({'pipeline': pipeline, 'error': str(e)})
}])
raise
S3 sends object-created events to EventBridge (if enabled on the bucket). A rule filters for the right prefix and fires a Lambda to process the file.
import boto3, json
events_client = boto3.client('events')
# S3 sends events like: source=aws.s3, detail-type="Object Created"
# detail.bucket.name = your bucket, detail.object.key = the S3 key
s3_event_pattern = {
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["my-raw-data-lake"] # specific bucket
},
"object": {
"key": [{"prefix": "raw/customers/"}] # specific prefix
}
}
}
events_client.create_rule(
Name='raw-customer-file-arrived',
EventPattern=json.dumps(s3_event_pattern),
State='ENABLED',
Description='Fires when a new file lands in raw/customers/'
)
# Then put_targets() → Lambda that reads the file and triggers Glue
# Pre-req: enable EventBridge notifications on the S3 bucket
s3_client = boto3.client('s3')
s3_client.put_bucket_notification_configuration(
Bucket='my-raw-data-lake',
NotificationConfiguration={
'EventBridgeConfiguration': {} # empty dict = enable all events to EventBridge
}
)
Each stage publishes a completion event → the next stage is triggered by a rule. Fully decoupled, fully event-driven.
| API Call | What it does | Key Parameters | Returns |
|---|---|---|---|
put_events() | Publish 1–10 custom events to a bus | Entries[]: Source, DetailType, Detail, EventBusName | FailedEntryCount, Entries[EventId or ErrorCode] |
create_rule() | Create a schedule or event-pattern rule | Name, ScheduleExpression OR EventPattern, State, EventBusName | RuleArn |
put_targets() | Attach targets (Lambda/SQS/SNS) to a rule | Rule, Targets[]: Id, Arn, Input/InputTransformer | FailedEntryCount |
list_rules() | List all rules (paginated) | NamePrefix, EventBusName, NextToken | Rules[], NextToken |
describe_rule() | Get full details of one rule | Name, EventBusName | Rule object with all fields |
list_targets_by_rule() | Get all targets for a rule | Rule, EventBusName | Targets[] |
enable_rule() | Enable a disabled rule | Name, EventBusName | — |
disable_rule() | Pause a rule without deleting it | Name, EventBusName | — |
remove_targets() | Detach targets from a rule (required before delete) | Rule, Ids[], EventBusName | FailedEntryCount |
delete_rule() | Delete a rule (must have no targets) | Name, EventBusName | — |
create_event_bus() | Create a custom event bus | Name | EventBusArn |
list_event_buses() | List all event buses | NamePrefix | EventBuses[] |
delete_event_bus() | Delete a custom event bus | Name | — |
RDS / Redshift Data APIs
The RDS Data API and Redshift Data API let you run SQL against Aurora Serverless / Redshift over HTTPS using boto3 — no JDBC driver, no persistent connection, no VPC networking required from your client. This is the standard way Lambda functions and orchestration code run SQL without managing database connections.
Normally, to run SQL from Python you open a TCP connection with a driver like psycopg2 — this needs the database inside your VPC (or a public endpoint), a network path, a connection pool, and credentials passed at connect time. For short-lived compute like Lambda, opening/closing thousands of DB connections causes connection storms and exhausts the database's max-connections limit.
You call execute_statement() over the standard AWS API (HTTPS + IAM auth) — boto3 handles this like any other AWS service call. AWS internally manages a connection pool to the database. No driver, no VPC access needed from your Lambda/script, and authentication is via IAM or Secrets Manager instead of hardcoded DB passwords.
| Aspect | RDS Data API | Redshift Data API |
|---|---|---|
| Applies to | Aurora Serverless (PostgreSQL / MySQL compatible) | Redshift provisioned clusters & Redshift Serverless |
| Auth | Secrets Manager ARN (secretArn) | Secrets Manager ARN OR temporary IAM credentials (DbUser) |
| Transactions | begin_transaction() / commit_transaction() | Not exposed the same way — each execute_statement() is its own unit |
| Async by default? | No — synchronous response | Yes — must poll describe_statement() |
| Typical DE use | Metadata/audit DB writes from Lambda | Running transforms/UNLOADs on the warehouse from orchestration code |
execute_statement() returns immediately with an Id, and you must poll until the status is FINISHED. The RDS Data API is synchronous for simple statements — the result comes back in the same call (though it can also be used asynchronously for long-running statements).
Runs a single SQL statement against an Aurora Serverless database. Requires the cluster ARN, the secret ARN holding credentials, the database name, and the SQL string. Use named parameters (:param) instead of string formatting to avoid SQL injection.
import boto3
rds_data = boto3.client('rds-data', region_name='us-east-1')
CLUSTER_ARN = 'arn:aws:rds:us-east-1:123456789012:cluster:my-aurora-cluster'
SECRET_ARN = 'arn:aws:secretsmanager:us-east-1:123456789012:secret:rds-creds-AbCdEf'
DATABASE = 'pipeline_metadata'
# ── Run a parameterized INSERT (audit log row) ──
response = rds_data.execute_statement(
resourceArn=CLUSTER_ARN,
secretArn=SECRET_ARN,
database=DATABASE,
sql="""
INSERT INTO pipeline_audit (run_id, pipeline_name, status, rows_processed)
VALUES (:run_id, :pipeline_name, :status, :rows_processed)
""",
parameters=[
{'name': 'run_id', 'value': {'stringValue': 'run_2024_01_15_001'}},
{'name': 'pipeline_name', 'value': {'stringValue': 'customer_silver_etl'}},
{'name': 'status', 'value': {'stringValue': 'SUCCESS'}},
{'name': 'rows_processed', 'value': {'longValue': 1450000}},
]
)
print("Rows affected:", response['numberOfRecordsUpdated'])
For SELECT queries, set includeResultMetadata=True to get column names back, and parse response['records'] — each record is a list of typed value dicts (stringValue, longValue, doubleValue, booleanValue, isNull).
# ── SELECT and parse results into list of dicts ──
response = rds_data.execute_statement(
resourceArn=CLUSTER_ARN,
secretArn=SECRET_ARN,
database=DATABASE,
sql="SELECT run_id, pipeline_name, status, rows_processed FROM pipeline_audit ORDER BY run_id DESC LIMIT 5",
includeResultMetadata=True
)
# Build column name list from metadata
columns = [col['name'] for col in response['columnMetadata']]
# Helper to extract the actual value regardless of type key
def extract_value(field):
if field.get('isNull'):
return None
for key in ('stringValue', 'longValue', 'doubleValue', 'booleanValue'):
if key in field:
return field[key]
return None
rows = []
for record in response['records']:
row = {columns[i]: extract_value(field) for i, field in enumerate(record)}
rows.append(row)
print(rows)
# [{'run_id': 'run_2024_01_15_001', 'pipeline_name': 'customer_silver_etl', ...}]
Runs the same SQL statement multiple times with different parameter sets in one call — ideal for bulk inserts (e.g., writing many audit rows or control-table entries at once) without looping individual execute_statement() calls.
# ── Bulk insert multiple audit rows in one call ──
param_sets = [
[
{'name': 'run_id', 'value': {'stringValue': 'run_001'}},
{'name': 'status', 'value': {'stringValue': 'SUCCESS'}},
],
[
{'name': 'run_id', 'value': {'stringValue': 'run_002'}},
{'name': 'status', 'value': {'stringValue': 'FAILED'}},
],
]
response = rds_data.batch_execute_statement(
resourceArn=CLUSTER_ARN,
secretArn=SECRET_ARN,
database=DATABASE,
sql="UPDATE pipeline_audit SET status = :status WHERE run_id = :run_id",
parameterSets=param_sets
)
print(len(response['updateResults']), "statements executed")
execute_statement() calls to update the watermark table, one batch_execute_statement() updates all 20 rows in a single round trip.
For multi-statement atomic operations (e.g., update a watermark table AND write an audit row — both must succeed or both must fail), wrap calls in a transaction. Pass the returned transactionId into each execute_statement() call, then commit (or roll back) at the end.
# ── Atomic: update watermark + insert audit row together ──
tx = rds_data.begin_transaction(
resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE
)
tx_id = tx['transactionId']
try:
rds_data.execute_statement(
resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE,
transactionId=tx_id,
sql="UPDATE watermark_table SET last_value = :wm WHERE pipeline_id = :pid",
parameters=[
{'name': 'wm', 'value': {'stringValue': '2024-01-15T06:00:00Z'}},
{'name': 'pid', 'value': {'longValue': 42}},
]
)
rds_data.execute_statement(
resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, database=DATABASE,
transactionId=tx_id,
sql="INSERT INTO pipeline_audit (run_id, status) VALUES (:rid, 'SUCCESS')",
parameters=[{'name': 'rid', 'value': {'stringValue': 'run_2024_01_15_001'}}]
)
rds_data.commit_transaction(resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, transactionId=tx_id)
except Exception as e:
rds_data.rollback_transaction(resourceArn=CLUSTER_ARN, secretArn=SECRET_ARN, transactionId=tx_id)
raise
Identify the target with ClusterIdentifier (provisioned) or WorkgroupName (Redshift Serverless), the Database name, and either DbUser (temporary IAM credentials) or SecretArn (Secrets Manager). This call returns immediately with a statement Id — it does not wait for the query to finish.
import boto3
redshift_data = boto3.client('redshift-data', region_name='us-east-1')
# ── Submit a SQL statement (returns immediately) ──
response = redshift_data.execute_statement(
ClusterIdentifier='my-redshift-cluster',
Database='analytics',
DbUser='etl_service_user', # IAM-based auth (no password)
Sql="""
SELECT pipeline_name, COUNT(*) AS run_count, SUM(rows_processed) AS total_rows
FROM pipeline_audit
WHERE run_date = CURRENT_DATE
GROUP BY pipeline_name
"""
)
statement_id = response['Id']
print("Submitted, statement Id:", statement_id)
Id) immediately, but the letter (your query) is still being processed. You must check the tracking number later to see if it's "delivered."
Because execution is async, poll describe_statement() with a short delay until Status becomes FINISHED, FAILED, or ABORTED. This is the exact "manual waiter" pattern referenced in 29.30.4 — Redshift has no built-in waiter for this.
import time
# ── Poll describe_statement until terminal state ──
while True:
desc = redshift_data.describe_statement(Id=statement_id)
status = desc['Status'] # PICKED | STARTED | FINISHED | FAILED | ABORTED
print("Current status:", status)
if status == 'FINISHED':
break
elif status in ('FAILED', 'ABORTED'):
raise Exception(f"Query {status}: {desc.get('Error', 'unknown error')}")
time.sleep(1) # short backoff between polls
Once FINISHED, call get_statement_result() to retrieve Records (paginated via NextToken) and ColumnMetadata (column names/types) — combine them the same way as the RDS Data API result parsing above.
# ── Fetch and parse results into list of dicts ──
result = redshift_data.get_statement_result(Id=statement_id)
columns = [col['name'] for col in result['ColumnMetadata']]
def extract_value(field):
if field.get('isNull'):
return None
for key in ('stringValue', 'longValue', 'doubleValue', 'booleanValue'):
if key in field:
return field[key]
return None
rows = []
for record in result['Records']:
rows.append({columns[i]: extract_value(f) for i, f in enumerate(record)})
# ── Handle pagination for large result sets ──
while 'NextToken' in result and result['NextToken']:
result = redshift_data.get_statement_result(Id=statement_id, NextToken=result['NextToken'])
for record in result['Records']:
rows.append({columns[i]: extract_value(f) for i, f in enumerate(record)})
import pandas as pd
df = pd.DataFrame(rows)
print(df)
Runs several different SQL statements sequentially as one logical unit (e.g., TRUNCATE then COPY then ANALYZE). Each sub-statement gets its own Id, queryable individually via describe_statement() with Id in the form parentId:index.
# ── Truncate + COPY + ANALYZE as one batch ──
response = redshift_data.batch_execute_statement(
ClusterIdentifier='my-redshift-cluster',
Database='analytics',
DbUser='etl_service_user',
Sqls=[
"TRUNCATE TABLE staging.customers",
"""
COPY staging.customers
FROM 's3://my-lake/silver/customers/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET
""",
"ANALYZE staging.customers",
]
)
batch_id = response['Id']
print("Batch Id:", batch_id)
# Poll describe_statement(Id=batch_id) — when FINISHED, all sub-statements ran
batch_execute_statement() call truncates the staging table, loads it via COPY, and refreshes statistics with ANALYZE — all orchestrated from Airflow with no JDBC connection.
Returns past statement executions for auditing/debugging — filter by ClusterIdentifier, Database, Status, or a StatementName you assigned. Useful for finding "what ran in the last hour and did it fail?" without a custom audit table.
# ── Find recent failed statements for troubleshooting ──
response = redshift_data.list_statements(
ClusterIdentifier='my-redshift-cluster',
Status='FAILED',
MaxResults=20
)
for stmt in response['Statements']:
print(stmt['Id'], stmt['QueryString'][:60], stmt['Status'], stmt.get('Error'))
Production code never repeats the submit/poll/fetch boilerplate inline — wrap it in a helper with exponential backoff (per 29.30.2) so any pipeline step can run SQL and get a DataFrame back in one call.
import boto3, time
import pandas as pd
redshift_data = boto3.client('redshift-data', region_name='us-east-1')
def run_redshift_sql(sql, cluster_id, database, db_user, max_wait_seconds=300):
"""Execute SQL on Redshift Data API and return a pandas DataFrame."""
# 1. Submit
resp = redshift_data.execute_statement(
ClusterIdentifier=cluster_id, Database=database, DbUser=db_user, Sql=sql
)
stmt_id = resp['Id']
# 2. Poll with exponential backoff
waited, delay = 0, 1
while True:
desc = redshift_data.describe_statement(Id=stmt_id)
status = desc['Status']
if status == 'FINISHED':
break
if status in ('FAILED', 'ABORTED'):
raise RuntimeError(f"Redshift query {status}: {desc.get('Error')}")
if waited >= max_wait_seconds:
raise TimeoutError(f"Statement {stmt_id} did not finish in {max_wait_seconds}s")
time.sleep(delay)
waited += delay
delay = min(delay * 2, 10) # cap backoff at 10s
# 3. Fetch (handles non-SELECT statements with no result set)
if not desc.get('HasResultSet', False):
return pd.DataFrame() # e.g. COPY / TRUNCATE return nothing
result = redshift_data.get_statement_result(Id=stmt_id)
columns = [c['name'] for c in result['ColumnMetadata']]
def _val(f):
if f.get('isNull'): return None
for k in ('stringValue','longValue','doubleValue','booleanValue'):
if k in f: return f[k]
return None
rows = [{columns[i]: _val(f) for i, f in enumerate(rec)} for rec in result['Records']]
while 'NextToken' in result and result['NextToken']:
result = redshift_data.get_statement_result(Id=stmt_id, NextToken=result['NextToken'])
rows += [{columns[i]: _val(f) for i, f in enumerate(rec)} for rec in result['Records']]
return pd.DataFrame(rows)
# ── Usage ──
df = run_redshift_sql(
sql="SELECT pipeline_name, run_count FROM v_daily_pipeline_summary",
cluster_id='my-redshift-cluster', database='analytics', db_user='etl_service_user'
)
print(df)
get_job_run), Athena (get_query_execution), and EMR steps in 29.30.6–29.30.8 — boto3's async AWS APIs almost always follow submit → poll → fetch. Once you've internalized this pattern once, you can apply it everywhere.
| API Call | Service | What it does | Key Parameters |
|---|---|---|---|
execute_statement() | rds-data | Run one SQL statement (sync result) | resourceArn, secretArn, database, sql, parameters[] |
batch_execute_statement() | rds-data | Run same SQL with multiple parameter sets | resourceArn, secretArn, database, sql, parameterSets[] |
begin_transaction() | rds-data | Start a multi-statement transaction | resourceArn, secretArn, database |
commit_transaction() / rollback_transaction() | rds-data | End the transaction | resourceArn, secretArn, transactionId |
execute_statement() | redshift-data | Submit SQL asynchronously, returns Id | ClusterIdentifier/WorkgroupName, Database, DbUser/SecretArn, Sql |
describe_statement() | redshift-data | Poll status of a submitted statement | Id → Status (PICKED/STARTED/FINISHED/FAILED/ABORTED) |
get_statement_result() | redshift-data | Fetch result rows (paginated) | Id, NextToken → Records, ColumnMetadata |
batch_execute_statement() | redshift-data | Run multiple different SQL statements as one unit | ClusterIdentifier, Database, DbUser, Sqls[] |
list_statements() | redshift-data | List statement execution history | ClusterIdentifier, Status, StatementName |
cancel_statement() | redshift-data | Cancel a running statement | Id |
stringValue for an integer column). BadRequestException (Redshift) — invalid ClusterIdentifier/DbUser combination, or the cluster is paused. StatementTimeoutException — query exceeded the Data API's max runtime (currently capped — long-running transforms should go through Spark/Glue, not the Data API). AccessDeniedException — the IAM role lacks redshift-data:* or rds-data:* permissions, or GetSecretValue on the linked secret.
COPY/UNLOAD (Module 29.10) or run the heavy work in Spark/Glue — use the Data API for control-plane SQL: audit writes, watermark updates, small lookups, and triggering COPY/MERGE statements.
Pipeline Patterns P1 – P8
These are the 8 production-grade end-to-end pipeline architectures that every senior Data Engineer must know. Each pattern combines multiple AWS services with boto3, error handling, audit logging, and observability. Study these as complete blueprints.
A file lands in S3. That event triggers a chain reaction: SQS buffers the notification, Lambda validates and kicks off a Glue ETL job, Glue transforms and writes Parquet, a Crawler updates the Glue Catalog, Lambda writes an audit record to DynamoDB, and SNS sends a success notification. This is the most common batch ingestion pattern in AWS data platforms.
# ── Pattern 1: File Arrival Batch Pipeline ──────────────────────────────
# Lambda handler — triggered by SQS which receives S3 event notifications
import boto3, json, time
from datetime import datetime, timezone
from botocore.exceptions import ClientError
s3 = boto3.client('s3')
glue = boto3.client('glue')
dynamo = boto3.resource('dynamodb')
sns = boto3.client('sns')
GLUE_JOB_NAME = 'raw-to-silver-transform'
GLUE_CRAWLER = 'silver-catalog-crawler'
AUDIT_TABLE = 'pipeline-audit'
SNS_TOPIC_ARN = 'arn:aws:sns:us-east-1:123456789:pipeline-alerts'
def lambda_handler(event, context):
# ── Step 1: Parse S3 key from SQS message ──────────────────────────
for record in event['Records']:
body = json.loads(record['body'])
s3_event = body['Records'][0]
bucket = s3_event['s3']['bucket']['name']
key = s3_event['s3']['object']['key']
run_id = context.aws_request_id
try:
# ── Step 2: Validate file exists and is non-zero ────────────────
head = s3.head_object(Bucket=bucket, Key=key)
file_size = head['ContentLength']
if file_size == 0:
raise ValueError(f"Empty file: s3://{bucket}/{key}")
print(f"✅ File validated: {key} ({file_size} bytes)")
# ── Step 3: Start Glue ETL Job ──────────────────────────────────
glue_response = glue.start_job_run(
JobName=GLUE_JOB_NAME,
Arguments={
'--source_bucket': bucket,
'--source_key': key,
'--run_id': run_id
}
)
job_run_id = glue_response['JobRunId']
print(f"🚀 Glue job started: {job_run_id}")
# ── Step 4: Poll Glue job until terminal state ──────────────────
while True:
run_detail = glue.get_job_run(JobName=GLUE_JOB_NAME, RunId=job_run_id)
state = run_detail['JobRun']['JobRunState']
if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
break
time.sleep(15)
# ── Step 5: Start Glue Crawler to update catalog ───────────────
if state == 'SUCCEEDED':
glue.start_crawler(Name=GLUE_CRAWLER)
# poll crawler until READY
while True:
crawler_state = glue.get_crawler(Name=GLUE_CRAWLER)['Crawler']['State']
if crawler_state == 'READY':
break
time.sleep(10)
# ── Step 6: Write audit record to DynamoDB ─────────────────────
table = dynamo.Table(AUDIT_TABLE)
table.put_item(Item={
'run_id': run_id,
'job_name': GLUE_JOB_NAME,
'source_key': key,
'status': state,
'glue_run_id': job_run_id,
'file_size_bytes': file_size,
'timestamp': datetime.now(timezone.utc).isoformat()
})
# ── Step 7: Publish SNS notification ───────────────────────────
msg = f"Pipeline {'SUCCESS' if state == 'SUCCEEDED' else 'FAILURE'}\nFile: s3://{bucket}/{key}\nGlue run: {job_run_id}\nStatus: {state}"
sns.publish(TopicArn=SNS_TOPIC_ARN, Subject=f"Pipeline {state}", Message=msg)
except ClientError as e:
print(f"❌ ClientError: {e.response['Error']['Code']} — {e.response['Error']['Message']}")
sns.publish(TopicArn=SNS_TOPIC_ARN, Subject="Pipeline FAILED", Message=str(e))
raise # re-raise so SQS retries / routes to DLQ
A cron-based EventBridge rule fires Lambda every morning. Lambda reads pipeline config from DynamoDB, spins up an EMR cluster, submits a Spark step that reads S3 and writes to Redshift, polls until complete, then terminates the cluster and publishes metrics + alerts. Cost-efficient because the cluster lives only for the job duration.
# ── Pattern 2: Daily Scheduled Batch on EMR ─────────────────────────────
import boto3, time, json
from datetime import datetime, timezone
from botocore.exceptions import ClientError
emr = boto3.client('emr')
dynamo = boto3.resource('dynamodb')
cw = boto3.client('cloudwatch')
sns = boto3.client('sns')
SNS_ARN = 'arn:aws:sns:us-east-1:123456789:emr-alerts'
AUDIT_TBL = 'pipeline-audit'
def lambda_handler(event, context):
run_id = context.aws_request_id
start_ts = datetime.now(timezone.utc)
cluster_id = None
try:
# ── Step 1: Read pipeline config from DynamoDB ──────────────────
table = dynamo.Table('pipeline-config')
config = table.get_item(Key={'pipeline_id': 'daily-s3-to-redshift'})['Item']
s3_input = config['s3_input_path']
rs_table = config['redshift_table']
emr_release = config.get('emr_release', 'emr-7.1.0')
# ── Step 2: Spin up EMR cluster ─────────────────────────────────
cluster_response = emr.run_job_flow(
Name=f"daily-pipeline-{run_id[:8]}",
ReleaseLabel=emr_release,
Applications=[{'Name': 'Spark'}, {'Name': 'Hadoop'}],
Instances={
'MasterInstanceType': 'm5.xlarge',
'SlaveInstanceType': 'm5.2xlarge',
'InstanceCount': 3,
'Ec2KeyName': 'my-keypair',
'KeepJobFlowAliveWhenNoSteps': True # keeps cluster up to add steps
},
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
AutoTerminationPolicy={'IdleTimeout': 3600}, # auto-kill if idle 1 hour
Configurations=[
{'Classification': 'spark-defaults',
'Properties': {'spark.sql.shuffle.partitions': '200',
'spark.executor.memory': '8g'}}
],
LogUri='s3://my-logs/emr/',
VisibleToAllUsers=True
)
cluster_id = cluster_response['JobFlowId']
print(f"🚀 Cluster started: {cluster_id}")
# ── Step 3: Wait for cluster to be in WAITING state ─────────────
waiter = emr.get_waiter('cluster_running')
waiter.wait(ClusterId=cluster_id)
# ── Step 4: Submit Spark step ────────────────────────────────────
step_response = emr.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[{
'Name': 's3-to-redshift-transform',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'spark-submit', '--deploy-mode', 'cluster',
'--py-files', 's3://my-bucket/deps/utils.zip',
's3://my-bucket/jobs/transform.py',
'--input', s3_input,
'--output-table', rs_table,
'--run-id', run_id
]
}
}]
)
step_id = step_response['StepIds'][0]
# ── Step 5: Poll step until complete ─────────────────────────────
step_waiter = emr.get_waiter('step_complete')
step_waiter.wait(ClusterId=cluster_id, StepId=step_id,
WaiterConfig={'Delay': 30, 'MaxAttempts': 120})
step_detail = emr.describe_step(ClusterId=cluster_id, StepId=step_id)
final_state = step_detail['Step']['Status']['State']
print(f"Step final state: {final_state}")
# ── Step 6: Terminate cluster ───────────────────────────────────
emr.terminate_job_flows(JobFlowIds=[cluster_id])
# ── Step 7: Publish CloudWatch metric ────────────────────────────
duration_s = (datetime.now(timezone.utc) - start_ts).total_seconds()
cw.put_metric_data(
Namespace='DataPipelines',
MetricData=[
{'MetricName': 'PipelineDurationSeconds', 'Value': duration_s,
'Unit': 'Seconds', 'Dimensions': [{'Name': 'Pipeline', 'Value': 'daily-s3-redshift'}]},
{'MetricName': 'PipelineSuccess',
'Value': 1 if final_state == 'COMPLETED' else 0,
'Unit': 'Count',
'Dimensions': [{'Name': 'Pipeline', 'Value': 'daily-s3-redshift'}]}
]
)
# ── Step 8: SNS alert ────────────────────────────────────────────
sns.publish(
TopicArn=SNS_ARN,
Subject=f"EMR Pipeline {final_state}",
Message=f"Cluster: {cluster_id}\nStep: {step_id}\nDuration: {duration_s:.0f}s\nState: {final_state}"
)
except ClientError as e:
if cluster_id:
emr.terminate_job_flows(JobFlowIds=[cluster_id]) # always clean up
sns.publish(TopicArn=SNS_ARN, Subject="EMR Pipeline FAILED", Message=str(e))
raise
Instead of hardcoding source/target paths in each job, you store pipeline configurations in a DynamoDB control table. One orchestrator Lambda reads all active pipelines, loops through them, starts the correct Glue job with dynamic arguments, polls completion, and writes per-pipeline audit records. Adding a new pipeline = adding a row to DynamoDB — no code change needed.
# ── Pattern 3: Metadata-Driven Multi-Pipeline Orchestrator ──────────────
import boto3, time
from boto3.dynamodb.conditions import Attr
from datetime import datetime, timezone
from botocore.exceptions import ClientError
import traceback
glue = boto3.client('glue')
dynamo = boto3.resource('dynamodb')
cw = boto3.client('cloudwatch')
sns = boto3.client('sns')
GLUE_JOB = 'generic-etl-job' # one reusable Glue job, parameterized
CONFIG_TABLE = 'pipeline-config'
AUDIT_TABLE = 'pipeline-audit'
SNS_ARN = 'arn:aws:sns:us-east-1:123456789:pipeline-dlq'
def get_active_pipelines():
"""Scan DynamoDB config table for all active pipelines."""
table = dynamo.Table(CONFIG_TABLE)
results = []
response = table.scan(FilterExpression=Attr('is_active').eq(True))
results.extend(response['Items'])
while 'LastEvaluatedKey' in response:
response = table.scan(
FilterExpression=Attr('is_active').eq(True),
ExclusiveStartKey=response['LastEvaluatedKey']
)
results.extend(response['Items'])
return results
def run_pipeline(pipeline_cfg, run_id):
"""Start and poll one Glue job run. Return (state, rows_processed)."""
pid = pipeline_cfg['pipeline_id']
start_t = datetime.now(timezone.utc)
response = glue.start_job_run(
JobName=GLUE_JOB,
Arguments={
'--pipeline_id': pid,
'--source_path': pipeline_cfg['source_path'],
'--target_table': pipeline_cfg['target_table'],
'--run_id': run_id
}
)
job_run_id = response['JobRunId']
print(f" [{pid}] Glue run started: {job_run_id}")
# Poll until terminal state
while True:
detail = glue.get_job_run(JobName=GLUE_JOB, RunId=job_run_id)
state = detail['JobRun']['JobRunState']
if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
break
time.sleep(20)
duration = (datetime.now(timezone.utc) - start_t).total_seconds()
rows = int(detail['JobRun'].get('ExecutionTime', 0)) # or read from custom metric
err_msg = detail['JobRun'].get('ErrorMessage', '')
# Write per-pipeline audit record
audit = dynamo.Table(AUDIT_TABLE)
audit.put_item(Item={
'run_id': run_id + '#' + pid,
'pipeline_id': pid,
'glue_run_id': job_run_id,
'status': state,
'duration_s': str(duration),
'error_message': err_msg,
'timestamp': datetime.now(timezone.utc).isoformat()
})
# Publish pipeline-level CloudWatch metric
cw.put_metric_data(
Namespace='DataPipelines',
MetricData=[{
'MetricName': 'PipelineSuccess',
'Value': 1 if state == 'SUCCEEDED' else 0,
'Unit': 'Count',
'Dimensions': [{'Name': 'PipelineId', 'Value': pid}]
}]
)
return state, err_msg
def lambda_handler(event, context):
run_id = context.aws_request_id
pipelines = get_active_pipelines()
print(f"Found {len(pipelines)} active pipelines")
failures = []
for cfg in pipelines:
try:
state, err = run_pipeline(cfg, run_id)
if state != 'SUCCEEDED':
failures.append({'pipeline_id': cfg['pipeline_id'], 'error': err})
except Exception as e:
failures.append({'pipeline_id': cfg['pipeline_id'], 'error': str(e)})
if failures:
sns.publish(TopicArn=SNS_ARN, Subject="Multi-Pipeline Failures",
Message=f"Failed pipelines:\n{failures}")
print(f"Run complete. Failures: {len(failures)}/{len(pipelines)}")
CDC (Change Data Capture) captures every INSERT, UPDATE, and DELETE from a source database and streams them as events. Debezium or AWS DMS reads the database transaction log and publishes events to an MSK (Kafka) topic. Spark Structured Streaming consumes those events and uses MERGE INTO on a Delta table to apply inserts, updates, and deletes in near-real-time.
# ── Pattern 4: CDC Streaming Pipeline with Delta MERGE ──────────────────
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, get_json_object
from pyspark.sql.types import StructType, StructField, StringType, LongType
from delta.tables import DeltaTable
spark = SparkSession.builder \
.appName("cdc-streaming") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
KAFKA_BROKERS = "b-1.mycluster.kafka.us-east-1.amazonaws.com:9092"
KAFKA_TOPIC = "db.public.orders"
DELTA_TABLE_PATH = "s3://my-lake/silver/orders/"
CHECKPOINT_PATH = "s3://my-lake/checkpoints/orders-cdc/"
# Schema for the CDC after-image payload
order_schema = StructType([
StructField("order_id", LongType(), False),
StructField("customer_id",LongType(), True),
StructField("amount", StringType(), True),
StructField("status", StringType(), True),
StructField("updated_at", StringType(), True)
])
# Read from MSK (Kafka)
raw_stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BROKERS) \
.option("subscribe", KAFKA_TOPIC) \
.option("startingOffsets", "latest") \
.option("maxOffsetsPerTrigger", 10000) \
.load()
# Parse CDC envelope (Debezium JSON format)
parsed = raw_stream.select(
get_json_object(col("value").cast("string"), "$.op").alias("op"),
from_json(
get_json_object(col("value").cast("string"), "$.after"),
order_schema
).alias("after"),
get_json_object(col("value").cast("string"), "$.before.order_id").alias("delete_id")
)
def apply_cdc_batch(batch_df, batch_id):
"""Apply one micro-batch of CDC events to Delta table."""
batch_df.cache()
delta_tbl = DeltaTable.forPath(spark, DELTA_TABLE_PATH)
# ── Apply INSERTS and UPDATES (op = 'c', 'u', 'r') ─────────────
upsert_df = batch_df \
.filter(col("op").isin('c', 'u', 'r')) \
.select("after.*")
if upsert_df.count() > 0:
delta_tbl.alias("tgt").merge(
upsert_df.alias("src"),
"tgt.order_id = src.order_id"
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# ── Apply DELETES (op = 'd') ────────────────────────────────────
delete_ids = batch_df \
.filter(col("op") == 'd') \
.select(col("delete_id").cast(LongType()).alias("order_id"))
if delete_ids.count() > 0:
delta_tbl.alias("tgt").merge(
delete_ids.alias("src"),
"tgt.order_id = src.order_id"
).whenMatchedDelete() \
.execute()
batch_df.unpersist()
print(f"✅ Batch {batch_id} applied to Delta")
# Start the streaming query
query = parsed.writeStream \
.foreachBatch(apply_cdc_batch) \
.option("checkpointLocation", CHECKPOINT_PATH) \
.trigger(processingTime="2 minutes") \
.start()
query.awaitTermination()
Athena has no built-in boto3 waiter. You must manually poll get_query_execution() until the state is SUCCEEDED or FAILED, then use a paginator to fetch results. This pattern is used in Lambda, Glue Python Shell jobs, and Airflow DAGs to run SQL on S3 data and convert results to a DataFrame for further processing.
# ── Pattern 5: Athena Query Automation ──────────────────────────────────
import boto3, time
from botocore.exceptions import ClientError
athena = boto3.client('athena')
OUTPUT_LOCATION = 's3://my-athena-results/query-results/'
DATABASE = 'silver_db'
WORKGROUP = 'primary'
def run_athena_query(sql: str, database: str = DATABASE) -> list[dict]:
"""
Run an Athena query and return results as a list of dicts.
Raises RuntimeError on query failure.
"""
# ── Step 1: Start query ──────────────────────────────────────────
response = athena.start_query_execution(
QueryString=sql,
QueryExecutionContext={'Database': database},
ResultConfiguration={'OutputLocation': OUTPUT_LOCATION},
WorkGroup=WORKGROUP
)
qid = response['QueryExecutionId']
print(f"⏳ Athena query started: {qid}")
# ── Step 2: Poll until terminal state (no built-in waiter!) ─────
delay = 2
while True:
detail = athena.get_query_execution(QueryExecutionId=qid)
state = detail['QueryExecution']['Status']['State']
if state == 'SUCCEEDED':
print(f"✅ Query succeeded: {qid}")
break
elif state in ('FAILED', 'CANCELLED'):
reason = detail['QueryExecution']['Status'].get('StateChangeReason', 'Unknown')
raise RuntimeError(f"Athena query {state}: {reason}")
# Exponential backoff capped at 30s
time.sleep(delay)
delay = min(delay * 1.5, 30)
# ── Step 3: Paginate results ────────────────────────────────────
paginator = athena.get_paginator('get_query_results')
pages = paginator.paginate(QueryExecutionId=qid)
rows = []
headers = None
for page in pages:
result_set = page['ResultSet']
if headers is None:
# First row of first page = column names
headers = [c['Label'] for c in result_set['ResultSetMetadata']['ColumnInfo']]
data_rows = result_set['Rows'][1:] # skip header row
else:
data_rows = result_set['Rows']
for row in data_rows:
values = [cell.get('VarCharValue', None) for cell in row['Data']]
rows.append(dict(zip(headers, values)))
print(f"📊 Fetched {len(rows)} rows")
return rows
# ── Usage ────────────────────────────────────────────────────────────────
sql = """
SELECT customer_id, SUM(amount) AS total_spent
FROM silver_db.orders
WHERE order_date >= DATE('2024-01-01')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 1000
"""
results = run_athena_query(sql)
# Convert to pandas (in Lambda / Glue Python Shell)
import pandas as pd
df = pd.DataFrame(results)
print(df.head())
# Or write to S3 as Parquet
df.to_parquet('/tmp/top_customers.parquet')
s3 = boto3.client('s3')
s3.upload_file('/tmp/top_customers.parquet', 'my-bucket', 'gold/top_customers.parquet')
get_query_results includes the column headers as the very first row in the first page. Always skip Rows[0] of the first page, or you'll have the column names mixed in with your data.Large enterprises split data into multiple AWS accounts — a raw data account, a processing account, a consumers account. Your pipeline running in Account A needs to read from S3 in Account B and write results back. The solution is STS AssumeRole: your code assumes a role in the target account, gets temporary credentials, and uses them to build boto3 clients for that account.
# ── Pattern 6: Cross-Account Data Access via STS AssumeRole ─────────────
import boto3
from botocore.exceptions import ClientError
def get_cross_account_session(target_role_arn: str, session_name: str = 'CrossAccountSession'):
"""
Assume a role in a different AWS account and return a boto3 Session
with temporary credentials valid for up to 1 hour.
"""
sts = boto3.client('sts')
# Verify who we are (useful for debugging)
identity = sts.get_caller_identity()
print(f"Caller identity: {identity['Arn']}")
try:
assumed = sts.assume_role(
RoleArn=target_role_arn,
RoleSessionName=session_name,
DurationSeconds=3600 # 1 hour max
)
except ClientError as e:
if e.response['Error']['Code'] == 'AccessDenied':
raise PermissionError(
f"Cannot assume role {target_role_arn}. Check trust policy."
) from e
raise
creds = assumed['Credentials']
session = boto3.Session(
aws_access_key_id=creds['AccessKeyId'],
aws_secret_access_key=creds['SecretAccessKey'],
aws_session_token=creds['SessionToken'],
region_name='us-east-1'
)
print(f"✅ Assumed role in target account. Expires: {creds['Expiration']}")
return session
# ── Usage: read from Account B, write results to Account A ──────────────
TARGET_ROLE = 'arn:aws:iam::999999999999:role/cross-account-data-reader'
TARGET_BUCKET = 'account-b-raw-data'
SOURCE_PREFIX = 'orders/year=2024/month=01/'
# Get session with target account credentials
target_session = get_cross_account_session(TARGET_ROLE)
# Build clients in target account
s3_target = target_session.client('s3')
glue_target = target_session.client('glue')
# List files in Account B's S3
paginator = s3_target.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=TARGET_BUCKET, Prefix=SOURCE_PREFIX)
file_list = []
for page in pages:
for obj in page.get('Contents', []):
file_list.append(f"s3://{TARGET_BUCKET}/{obj['Key']}")
print(f"Found {len(file_list)} files in Account B")
# Read Glue table definition from Account B's catalog
glue_table = glue_target.get_table(DatabaseName='source_db', Name='orders')
print(f"Schema: {glue_table['Table']['StorageDescriptor']['Columns']}")
# Write results back to Account A (default boto3 session = Account A)
s3_source = boto3.client('s3') # uses default Account A creds
s3_source.put_object(
Bucket='account-a-processed',
Key='cross-account-results/manifest.json',
Body=str(file_list).encode()
)
print("✅ Results written to Account A")
sts:AssumeRole. Without the trust policy, you get AccessDenied no matter what the permission policy says.After a Glue ETL job completes, a Data Quality gate runs to validate the output. If the DQ score is below threshold, the pipeline stops — preventing bad data from reaching downstream consumers. DQ results are stored in DynamoDB for auditing, a metric is published to CloudWatch, and SNS sends an alert on failure. This is a fail-fast, fail-loud design.
# ── Pattern 7: Data Quality Gate ────────────────────────────────────────
import boto3, time, json
from datetime import datetime, timezone
from botocore.exceptions import ClientError
from decimal import Decimal
glue = boto3.client('glue')
dynamo = boto3.resource('dynamodb')
cw = boto3.client('cloudwatch')
sns = boto3.client('sns')
RULESET_NAME = 'orders-silver-ruleset'
GLUE_DATABASE = 'silver_db'
GLUE_TABLE = 'orders'
DQ_THRESHOLD = 0.95 # 95% rules must pass
AUDIT_TABLE = 'pipeline-dq-audit'
SNS_ARN = 'arn:aws:sns:us-east-1:123456789:dq-alerts'
def run_dq_gate(run_id: str, pipeline_name: str) -> bool:
"""
Run Glue DQ evaluation. Returns True if passed, False if failed.
Writes results to DynamoDB and CloudWatch.
"""
# ── Step 1: Start DQ evaluation ─────────────────────────────────
eval_response = glue.start_data_quality_ruleset_evaluation_run(
DataSource={
'GlueTable': {'DatabaseName': GLUE_DATABASE, 'TableName': GLUE_TABLE}
},
Role='arn:aws:iam::123456789:role/GlueServiceRole',
RulesetNames=[RULESET_NAME]
)
eval_run_id = eval_response['RunId']
print(f"⏳ DQ evaluation started: {eval_run_id}")
# ── Step 2: Poll until complete ──────────────────────────────────
while True:
status = glue.get_data_quality_ruleset_evaluation_run(RunId=eval_run_id)
state = status['Status']
if state in ('SUCCEEDED', 'FAILED', 'STOPPED', 'ERROR'):
break
time.sleep(15)
# ── Step 3: Parse DQ results ─────────────────────────────────────
result_ids = status.get('ResultIds', [])
passed_rules = 0
total_rules = 0
failed_detail = []
for result_id in result_ids:
result = glue.get_data_quality_result(ResultId=result_id)
rule_results = result.get('RuleResults', [])
for rule in rule_results:
total_rules += 1
if rule['Result'] == 'PASS':
passed_rules += 1
else:
failed_detail.append({
'rule': rule.get('Name', 'unknown'),
'result': rule['Result'],
'message': rule.get('EvaluationMessage', '')
})
dq_score = (passed_rules / total_rules) if total_rules > 0 else 0.0
passed = dq_score >= DQ_THRESHOLD
print(f"DQ Score: {dq_score:.2%} ({passed_rules}/{total_rules} rules passed)")
# ── Step 4: Write DQ audit record to DynamoDB ───────────────────
table = dynamo.Table(AUDIT_TABLE)
table.put_item(Item={
'run_id': run_id,
'pipeline_name': pipeline_name,
'eval_run_id': eval_run_id,
'dq_score': Decimal(str(round(dq_score, 4))),
'passed_rules': passed_rules,
'total_rules': total_rules,
'passed': passed,
'failed_rules': json.dumps(failed_detail),
'timestamp': datetime.now(timezone.utc).isoformat()
})
# ── Step 5: Publish DQ metric to CloudWatch ──────────────────────
cw.put_metric_data(
Namespace='DataPipelines',
MetricData=[{
'MetricName': 'DQScore',
'Value': dq_score * 100,
'Unit': 'Percent',
'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}]
}]
)
# ── Step 6: Alert on failure ──────────────────────────────────────
if not passed:
msg = (
f"❌ DQ GATE FAILED for {pipeline_name}\n"
f"Score: {dq_score:.2%} (threshold: {DQ_THRESHOLD:.0%})\n"
f"Failed rules:\n{json.dumps(failed_detail, indent=2)}"
)
sns.publish(TopicArn=SNS_ARN, Subject=f"DQ Failure: {pipeline_name}", Message=msg)
return passed
# ── Usage in pipeline ───────────────────────────────────────────────────
import uuid
run_id = str(uuid.uuid4())
dq_passed = run_dq_gate(run_id, 'orders-silver-pipeline')
if not dq_passed:
print("🛑 Pipeline halted due to DQ failure. Check DynamoDB audit table.")
exit(1) # Glue job will fail — prevents downstream table from being updated
print("✅ DQ gate passed. Proceeding to Gold layer.")
Any production pipeline will fail. The question is: what happens when it does? This pattern implements a complete error recovery architecture: classify the error, write it to DynamoDB, log it to CloudWatch, alert via SNS, retry with exponential backoff, and after max retries route to a Dead Letter Queue. Operations can inspect the DLQ and trigger manual or automated re-runs.
# ── Pattern 8: Error Recovery Pipeline ──────────────────────────────────
import boto3, time, json, traceback, uuid
from datetime import datetime, timezone
from botocore.exceptions import ClientError
dynamo = boto3.resource('dynamodb')
logs = boto3.client('logs')
cw = boto3.client('cloudwatch')
sns = boto3.client('sns')
sqs = boto3.client('sqs')
AUDIT_TABLE = 'pipeline-errors'
LOG_GROUP = '/data-pipelines/errors'
LOG_STREAM = 'pipeline-error-stream'
SNS_ARN = 'arn:aws:sns:us-east-1:123456789:pipeline-oncall'
DLQ_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/pipeline-dlq'
MAX_RETRIES = 5
# Errors that are recoverable (worth retrying)
RECOVERABLE_ERRORS = {
'ThrottlingException', 'ServiceUnavailableException',
'ProvisionedThroughputExceededException', 'RequestExpired',
'InternalError', 'InternalServiceError'
}
def log_error_to_dynamo(run_id, pipeline_name, error_code, error_msg, attempt):
"""Write structured error record to DynamoDB."""
table = dynamo.Table(AUDIT_TABLE)
table.put_item(Item={
'error_id': str(uuid.uuid4()),
'run_id': run_id,
'pipeline_name': pipeline_name,
'error_code': error_code,
'error_message': error_msg,
'attempt_number': attempt,
'is_recoverable': error_code in RECOVERABLE_ERRORS,
'timestamp': datetime.now(timezone.utc).isoformat()
})
def log_error_to_cloudwatch(pipeline_name, error_msg):
"""Push error message to CloudWatch Logs."""
try:
# Get or create log stream sequence token
streams = logs.describe_log_streams(
logGroupName=LOG_GROUP, logStreamNamePrefix=LOG_STREAM
)['logStreams']
seq_token = streams[0].get('uploadSequenceToken') if streams else None
kwargs = {
'logGroupName': LOG_GROUP,
'logStreamName': LOG_STREAM,
'logEvents': [{
'timestamp': int(datetime.now(timezone.utc).timestamp() * 1000),
'message': json.dumps({'pipeline': pipeline_name, 'error': error_msg})
}]
}
if seq_token:
kwargs['sequenceToken'] = seq_token
logs.put_log_events(**kwargs)
except Exception as e:
print(f"Warning: CloudWatch log write failed: {e}") # don't crash on logging failure
def send_to_dlq(run_id, pipeline_name, error_msg):
"""Send failed job to Dead Letter Queue for manual inspection / replay."""
sqs.send_message(
QueueUrl=DLQ_URL,
MessageBody=json.dumps({
'run_id': run_id,
'pipeline_name': pipeline_name,
'error': error_msg,
'timestamp': datetime.now(timezone.utc).isoformat(),
'action': 'REQUIRES_MANUAL_REVIEW'
}),
MessageAttributes={
'pipeline': {'DataType': 'String', 'StringValue': pipeline_name}
}
)
print(f"📬 Sent to DLQ: {pipeline_name}")
def run_with_recovery(pipeline_fn, run_id: str, pipeline_name: str):
"""
Execute pipeline_fn with full error recovery:
retry recoverable errors with exponential backoff,
route to DLQ after max retries.
"""
delay = 2
for attempt in range(1, MAX_RETRIES + 1):
try:
print(f"▶ Attempt {attempt}/{MAX_RETRIES}: {pipeline_name}")
pipeline_fn()
print(f"✅ Pipeline succeeded on attempt {attempt}")
return True # success
except ClientError as e:
error_code = e.response['Error']['Code']
error_msg = e.response['Error']['Message']
tb = traceback.format_exc()
print(f"❌ Attempt {attempt} failed: [{error_code}] {error_msg}")
# 1. Write to DynamoDB audit
log_error_to_dynamo(run_id, pipeline_name, error_code, error_msg, attempt)
# 2. Write to CloudWatch Logs
log_error_to_cloudwatch(pipeline_name, f"[{error_code}] {error_msg}")
# 3. Increment CloudWatch failure metric
cw.put_metric_data(
Namespace='DataPipelines',
MetricData=[{'MetricName': 'PipelineFailure', 'Value': 1,
'Unit': 'Count',
'Dimensions': [{'Name': 'Pipeline', 'Value': pipeline_name}]}]
)
# 4. Check if recoverable and if retries remain
if error_code not in RECOVERABLE_ERRORS:
print(f"🛑 Non-recoverable error: {error_code}. Skipping retries.")
sns.publish(TopicArn=SNS_ARN, Subject=f"Non-recoverable: {pipeline_name}",
Message=f"[{error_code}] {error_msg}\n\n{tb}")
send_to_dlq(run_id, pipeline_name, error_msg)
return False
if attempt == MAX_RETRIES:
print(f"🛑 Max retries ({MAX_RETRIES}) reached.")
sns.publish(TopicArn=SNS_ARN, Subject=f"Max retries: {pipeline_name}",
Message=f"Gave up after {MAX_RETRIES} attempts.\n[{error_code}] {error_msg}")
send_to_dlq(run_id, pipeline_name, error_msg)
return False
# 5. Wait with exponential backoff before retry
print(f"⏳ Retrying in {delay}s...")
time.sleep(delay)
delay = min(delay * 2, 30) # cap at 30s
# ── Usage ────────────────────────────────────────────────────────────────
def my_pipeline():
# your actual boto3 / Spark code here
glue = boto3.client('glue')
glue.start_job_run(JobName='my-etl-job')
run_id = str(uuid.uuid4())
success = run_with_recovery(my_pipeline, run_id, 'my-etl-pipeline')
if not success:
exit(1) # signal failure to orchestrator (Airflow / EventBridge)
run_with_recovery wrapper is reusable across all pipelines. Pass any callable as pipeline_fn. The error classification, audit, alerting, and DLQ routing all happen automatically. This is the kind of framework that differentiates senior engineers.| # | Pattern | Trigger | Key Services | Use When |
|---|---|---|---|---|
| P1 | File Arrival Batch | S3 Event → SQS | Lambda, Glue, DynamoDB, SNS | Files land unpredictably |
| P2 | Scheduled EMR Batch | EventBridge cron | EMR, Redshift, CloudWatch, SNS | Daily/hourly large-scale Spark |
| P3 | Metadata-Driven Multi | EventBridge | DynamoDB, Glue, CloudWatch | Many similar pipelines |
| P4 | CDC Streaming | Continuous | MSK, Spark Streaming, Delta | Near-real-time DB sync |
| P5 | Athena Automation | On-demand | Athena, S3, Pandas | SQL on S3 in Lambda/Glue |
| P6 | Cross-Account Access | Any | STS, S3, Glue | Multi-account enterprise setup |
| P7 | DQ Gate | Post-ETL | Glue DQ, DynamoDB, CloudWatch | Prevent bad data in Gold layer |
| P8 | Error Recovery | On failure | DynamoDB, CW Logs, SNS, SQS DLQ | Every production pipeline |
Module 29 Summary
You have now covered the complete AWS + Boto3 toolkit for production Data Engineering. Here is a quick recap of every area covered.