1.1
Why Big Data?
Understanding the fundamental problems that Big Data solves and why traditional systems break down at scale.
Traditional Database Limitations
THE PROBLEM
▾
What are Traditional Databases?
Traditional databases (MySQL, PostgreSQL, Oracle) are relational systems designed in the 1970s–90s when data sizes were in MBs or GBs. They store data in rows and columns on a single machine. They work great for structured, predictable data — like a bank's transactions or a shop's inventory.
🧠 Analogy
A traditional database is like a single filing cabinet. Perfect for a small office. But when your office grows to 100,000 employees, one cabinet won't work.Key Limitations at Scale
When data grows to TBs or PBs, traditional databases hit these walls:
Storage Ceiling
Single machine disk is limited. You can't store 500TB on one server easily.
Query Speed
Full table scans on billions of rows take hours on a single CPU.
Joins Are Expensive
JOINs between huge tables require loading all data into RAM — impossible at scale.
Single Point of Failure
If the one server goes down, everything stops. No redundancy.
Unstructured Data
Can't store videos, logs, JSON, images in a relational table natively.
Real-Time Streams
Processing millions of events per second is impossible with traditional RDBMS.
📖 Real Example
Facebook generates ~4 petabytes of data per day. A traditional MySQL server would take years to process a single day's data. This is why they built distributed systems.The 3 V's of Big Data
| V | Meaning | Example | Challenge |
|---|---|---|---|
| Volume | Size of data | Google: 8.5 billion searches/day | Storage & processing |
| Velocity | Speed of data | Twitter: 500K tweets/minute | Real-time ingestion |
| Variety | Types of data | Logs, images, JSON, CSV, video | Schema flexibility |
Vertical Scaling (Scale Up)
SOLUTION #1
▾
What is Vertical Scaling?
Vertical scaling = making your single machine more powerful. You add more RAM, faster CPUs, bigger SSDs to the same server. This was the original answer to database performance problems.
🧠 Analogy
Vertical scaling = upgrading your car's engine. Instead of adding more cars, you put a bigger engine in the one you have.concept
# Before vertical scaling
Server: 8 CPU cores, 32GB RAM, 1TB disk
Query time for 100M rows: 45 minutes
# After vertical scaling
Server: 64 CPU cores, 512GB RAM, 10TB SSD
Query time for 100M rows: 3 minutes
# But... you hit the hardware ceiling
# Max RAM per machine: ~12TB (extremely expensive!)
# Still ONE point of failure — server dies = downtime
Cost Explodes
Doubling CPU power often costs 4x more. Non-linear cost curve.
Physical Limit
There's a maximum RAM/CPU that fits on one motherboard.
Downtime Required
Upgrading hardware = server must go offline.
⚠️ Key Insight
Vertical scaling is a band-aid. At petabyte scale, no single machine is powerful enough.
Horizontal Scaling (Scale Out)
THE REAL SOLUTION
▾
What is Horizontal Scaling?
Horizontal scaling = adding more machines and distributing the work across all of them. Instead of one powerful server, you have many commodity servers working together as a cluster.
🧠 Analogy
Instead of one super-chef cooking everything, you hire 100 regular chefs. Each cooks a portion of the meal simultaneously. Total cooking time drops dramatically.| Aspect | Vertical (Scale Up) | Horizontal (Scale Out) |
|---|---|---|
| Strategy | Bigger machine | More machines |
| Cost | Exponential | Linear |
| Limit | Hardware ceiling | Virtually unlimited |
| Downtime | Required for upgrade | Add nodes without stopping |
| Fault Tolerance | Single point of failure | Redundant nodes |
| Complexity | Simple | Requires distributed coordination |
🔑 Key Insight
Spark is built for horizontal scaling. A Spark cluster of 100 cheap machines can process petabytes of data in minutes by running operations in parallel across all nodes simultaneously.
Distributed Systems
CORE CONCEPT
▾
What is a Distributed System?
A distributed system is a collection of independent computers that appear to users as a single system. Machines coordinate via message passing over a network. Spark, Hadoop, Kafka are all distributed systems.
🧠 Analogy
An ant colony is a distributed system. No single ant knows the full plan, but through local rules and communication, thousands of ants build complex structures efficiently.Concurrency
Many nodes work on different parts of the problem simultaneously.
Fault Tolerance
If 2 nodes fail out of 100, the others continue. Auto-recovery.
Message Passing
Nodes coordinate via network messages (not shared memory).
No Shared Memory
Each node has its own RAM. Data is shared by copying over the network (shuffle).
python — pyspark distributed execution
from pyspark.sql import SparkSession
# Spark creates a distributed computation cluster
spark = SparkSession.builder \
.appName("MyDistributedApp") \
.master("spark://cluster:7077") \
.getOrCreate()
# This 10TB file is distributed across all cluster nodes
# Each node holds a partition (chunk) of the data
df = spark.read.parquet("s3://data/events/")
# filter() runs in PARALLEL on all nodes simultaneously
# Node 1 filters its partition, Node 2 its partition, etc.
result = df.filter(df.country == "IN") \
.groupBy("city") \
.count()
result.show() # Results collected and merged
CAP Theorem
THEORY
▾
What is CAP Theorem?
CAP Theorem (2000) states a distributed system can guarantee only 2 out of 3 properties at the same time. Fundamental constraint every data engineer must know.
C
Consistency
Every read gets the most recent write. All nodes see same data at same time.
A
Availability
Every request gets a response (even if stale). System always up.
P
Partition Tolerance
System works even if network packets are lost between nodes.
⚠️ The Catch
In real distributed systems, P is non-negotiable — networks ALWAYS fail. So your real choice is between C and A.| System | Guarantees | Trade-off |
|---|---|---|
| HDFS | CP | Blocks writes during partition. Prioritizes consistency. |
| Cassandra | AP | Always available, may return slightly stale data. |
| Kafka | AP | Always accepts messages, replication may lag. |
| Delta Lake | CP | ACID via transaction log. No dirty reads. |
🧠 Analogy
2 bank branches lose network connection.CP: Stop all transactions until restored (consistent but unavailable).
AP: Both continue, reconcile conflicts later (available but inconsistent).
Data Locality
PERFORMANCE
▾
What is Data Locality?
Data locality = moving computation to where the data lives, rather than moving data to computation. One of Spark's most critical performance optimizations.
🧠 Analogy
Instead of shipping all ingredients to a central kitchen, build a mini-kitchen at the farm and cook there (data locality). Only the cooked result travels — much smaller!| Level | Where is data? | Speed |
|---|---|---|
| PROCESS_LOCAL | Same JVM process (in memory) | Fastest — memory speed |
| NODE_LOCAL | Same machine, different process | Fast — local disk |
| RACK_LOCAL | Same network rack | OK — short network hop |
| ANY | Anywhere in cluster | Slow — cross-rack transfer |
python — spark locality config
# Spark tries locality levels in order (best to worst)
# You can see locality level in Spark UI → Stages → Tasks
# Configure how long Spark waits for a local slot
spark.conf.set("spark.locality.wait", "3s") # default
spark.conf.set("spark.locality.wait.node", "3s")
spark.conf.set("spark.locality.wait.rack", "3s")
# Increase if your cluster has network bottlenecks
🔑 Interview Tip
When asked "why is Spark fast?", data locality is a key answer. The task scheduler assigns tasks to nodes where the data already lives, minimizing expensive network I/O.🧠 Quick Check: Which CAP property is NON-NEGOTIABLE in real distributed systems?
1.2
Big Data Ecosystem
The major tools and technologies that make up the modern big data stack — and how they work together.
Hadoop
FOUNDATION
▾
What is Hadoop?
Apache Hadoop (2006) was the first open-source framework for distributed storage and processing. Spark was later built to overcome Hadoop's limitations — primarily disk I/O between every step.
HDFS
Distributed file system. Splits files into 128MB blocks across nodes with 3x replication.
MapReduce
Original processing engine. Map = split work, Reduce = combine. Slow due to disk I/O.
YARN
Resource manager. Decides which node gets how much CPU/memory per task.
⚡ Spark vs Hadoop MapReduce
MapReduce writes to disk after every step. Spark keeps data in memory across operations. Result: Spark is 10–100x faster for iterative workloads (ML, graph processing).
HDFS — Hadoop Distributed File System
STORAGE
▾
How HDFS Works
HDFS splits files into 128MB blocks across DataNodes. Each block is replicated 3 times by default. The NameNode (master) tracks where every block lives.
🧠 Analogy
HDFS is like a distributed library. The NameNode is the librarian who knows where every book is. DataNodes are shelves across different rooms. Every book has 3 copies — if one room floods, you still find the book.bash — hdfs commands
# List files in HDFS
hdfs dfs -ls /user/data/
# Upload a file to HDFS
hdfs dfs -put localfile.csv /user/data/
# Read HDFS file in PySpark
df = spark.read.csv("hdfs://namenode:9000/user/data/sales.csv")
# Check replication factor for a file
hdfs fsck /user/data/myfile.parquet -files -blocks
YARN — Resource Manager
SCHEDULING
▾
What is YARN?
YARN (Yet Another Resource Negotiator) is the cluster resource manager. When you submit a Spark job, YARN decides how many containers (CPU + memory) to allocate, and on which nodes.
bash — submit spark to yarn
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-memory 4g \
--executor-cores 2 \
my_spark_job.py
Apache Hive
SQL ON BIG DATA
▾
What is Hive?
Apache Hive provides a SQL-like interface (HiveQL) over Hadoop data. It translates SQL into MapReduce/Tez/Spark jobs. Hive also provides the Hive Metastore — a central registry of table schemas used by Spark.
sql — hiveql
-- Create external Hive table over HDFS Parquet data
CREATE EXTERNAL TABLE sales (
order_id STRING,
amount DECIMAL(10,2),
city STRING
)
STORED AS PARQUET
LOCATION '/data/sales/';
-- In PySpark, read the same Hive table directly
df = spark.sql("SELECT city, SUM(amount) FROM sales GROUP BY city")
🔑 Why It Matters for Spark
Spark integrates directly with Hive Metastore. spark.sql("SELECT * FROM my_hive_table") uses the metastore to discover table location and schema automatically.
Apache Spark
THE STAR OF THIS COURSE
▾
What Makes Spark Special?
Apache Spark (2009, UC Berkeley) keeps data in memory across operations, making it 10–100x faster than Hadoop MapReduce. It's a unified engine for batch, streaming, SQL, and ML.
In-Memory Processing
Data stays in RAM between operations. No disk I/O overhead between steps.
Lazy Evaluation
Spark builds a query plan but executes only when you call an action. Enables optimization.
Unified Engine
Batch + Streaming + ML + SQL — all one framework.
PySpark API
Python wrapper around Spark. Write Python, runs distributed on a cluster.
Apache Kafka
STREAMING
▾
What is Kafka?
Apache Kafka is a distributed event streaming platform. Producers write events, Consumers read them. Spark Structured Streaming reads from Kafka in near real-time.
🧠 Analogy
Kafka is like a newspaper printing press. Publishers write news. The press stores all editions. Subscribers can read any edition at their own pace, even yesterday's paper.python — spark reads from kafka
# Read real-time data from Kafka into Spark Structured Streaming
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka-host:9092") \
.option("subscribe", "user-events") \
.load()
# Kafka messages come as bytes — decode them
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
Airflow, Trino, Flink, HBase
ECOSYSTEM
▾
Other Key Ecosystem Tools
| Tool | Type | Purpose | Spark Connection |
|---|---|---|---|
| Apache Airflow | Orchestrator | Schedules data pipeline DAGs | Triggers Spark jobs via SparkSubmitOperator |
| Trino (Presto) | SQL Engine | Interactive SQL across many data sources | Complementary — Trino for ad-hoc, Spark for ETL |
| Apache Flink | Stream Processor | True real-time event processing | Competitor for streaming; Spark Streaming is micro-batch |
| HBase | NoSQL DB | Random read/write on HDFS at row level | Spark can read/write HBase for random access |
1.3
Types of Data
The three categories of data Big Data systems must handle — and how Spark deals with each.
Structured Data
MOST COMMON IN SPARK
▾
What is Structured Data?
Structured data has a predefined schema — rows and columns with fixed data types. Think of a spreadsheet or SQL table. Easiest type to work with in Spark.
🧠 Analogy
Structured data is like a perfectly organized Excel sheet — every column has a name and type, every row follows the same format. No surprises.python — structured data in pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
spark = SparkSession.builder.appName("StructuredData").getOrCreate()
data = [
("Alice", 30, "Engineer", 95000.0),
("Bob", 25, "Designer", 72000.0),
("Carol", 35, "Manager", 120000.0)
]
schema = StructType([
StructField("name", StringType(), nullable=False),
StructField("age", IntegerType(), nullable=True),
StructField("role", StringType(), nullable=True),
StructField("salary", DoubleType(), nullable=True)
])
df = spark.createDataFrame(data, schema)
df.show()
# +-----+---+--------+--------+
# | name|age| role| salary|
# |Alice| 30|Engineer| 95000.0|
# | Bob| 25|Designer| 72000.0|
# Also from files:
df2 = spark.read.parquet("s3://bucket/transactions/")
df3 = spark.read.jdbc(url="jdbc:postgresql://...", table="orders")
CSV files
Parquet files
SQL Database tables
ORC files
Excel sheets
Semi-Structured Data
MOST COMMON IN REAL WORLD
▾
What is Semi-Structured Data?
Semi-structured data has some organization (keys/tags) but no rigid schema. It's self-describing — structure is embedded in the data. JSON and XML are the most common. Rows can have different fields.
🧠 Analogy
Semi-structured data is like a business card. Everyone has name and contact info, but some have Twitter handles, some have fax numbers. Flexible structure, no fixed schema enforced.python — semi-structured json in pyspark
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
json_data = [
('{"user":"alice","tags":["python","spark"],"address":{"city":"Bangalore"}}',),
('{"user":"bob","tags":["java"],"phone":"9999999999"}',) # different fields!
]
df = spark.createDataFrame(json_data, ["json_col"])
schema = StructType([
StructField("user", StringType()),
StructField("tags", ArrayType(StringType())),
StructField("address", StructType([StructField("city", StringType())]))
])
parsed = df.withColumn("parsed", from_json(col("json_col"), schema))
parsed.select("parsed.user", "parsed.tags", "parsed.address.city").show()
# +-----+----------------+---------+
# |alice|[python, spark] |Bangalore|
# | bob| [java] | null| ← missing fields become null
JSON files
XML files
AVRO files
Log files
Kafka messages
Unstructured Data
FASTEST GROWING
▾
What is Unstructured Data?
Unstructured data has no predefined format or schema. Requires specialized tools (ML, NLP, Computer Vision) to extract meaning. ~80% of the world's data is unstructured.
🧠 Analogy
Unstructured data is a pile of photographs. Each contains valuable information, but you can't run SQL on it. You need AI/ML to describe what's in each photo before analysis.python — unstructured text in pyspark
from pyspark.sql.functions import split, size, length
reviews = [
("1", "This product is absolutely amazing! Best purchase ever."),
("2", "Terrible quality. Broke after 2 days. Very disappointed."),
]
df = spark.createDataFrame(reviews, ["id", "review"])
# Basic processing on raw text
df = df.withColumn("word_count", size(split(col("review"), " "))) \
.withColumn("char_count", length(col("review")))
# For NLP: use Spark NLP or Pandas UDF with transformers
from pyspark.ml.feature import Tokenizer, HashingTF
tokenizer = Tokenizer(inputCol="review", outputCol="words")
words_df = tokenizer.transform(df)
Images (PNG, JPEG)
Videos (MP4)
Audio files
Text documents
Social media posts
PDF files
Complete Comparison
SUMMARY
▾
| Property | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Fixed, predefined | Flexible / self-describing | None |
| Storage | RDBMS, Parquet, ORC | JSON, XML, AVRO | S3/HDFS raw files |
| Query Method | SQL directly | SQL after parsing | Requires ML/NLP |
| Spark Tool | DataFrame API | from_json(), schema inference | MLlib, Python UDFs |
| % of World's Data | ~20% | ~10% | ~70% |
1.4
File Formats
Deep dive into every file format used in Big Data, plus critical micro-topics: Row vs Columnar, Compression, Predicate Pushdown, Schema Evolution, Serialization, and Delta vs Iceberg vs Hudi.
All File Formats — Overview
OVERVIEW
▾
CSV
Row · Text
- Human readable
- No compression
- No schema
- Slow for big data
JSON
Row · Text
- Nested structures
- Schema embedded
- Verbose/large
- Flexible
XML
Row · Text
- Tags + attributes
- Very verbose
- Legacy systems
- Complex parsing
AVRO
Row · Binary
- Schema evolution
- Kafka standard
- Compact binary
- Streaming-friendly
ORC
Columnar · Binary
- Hive optimized
- Built-in indexes
- Great compression
- ACID support
PARQUET
Columnar · Binary
- Spark's default
- Predicate pushdown
- Column pruning
- Best for analytics
DELTA
Columnar + Log
- ACID transactions
- Time travel
- Schema evolution
- CDC support
ICEBERG
Open Table Format
- Hidden partitioning
- Partition evolution
- Multi-engine
- Snapshot isolation
HUDI
Open Table Format
- Native upserts
- COW + MOR types
- Incremental reads
- Record-level index
Row vs Columnar Storage
MICRO TOPIC — CRITICAL
▾
Row-Oriented Storage (CSV, JSON, AVRO)
Stores data row by row on disk. To read a single column from 1 million rows, you must load ALL columns from ALL rows into memory first. Best for OLTP (insert/update individual records).
concept — row storage layout
# Disk layout: each row stored together
# [Alice | 30 | 95000] [Bob | 25 | 72000] [Carol | 35 | 120000]
# Query: SELECT SUM(salary) FROM employees
# MUST read: name + age + salary for EVERY row
# Wasted: name + age columns read but never used!
Columnar Storage (Parquet, ORC)
Stores all values for each column together on disk.
SELECT SUM(salary) only reads the salary column — all other columns skipped entirely. This is why Parquet is 10–100x faster for analytical queries.concept — columnar storage layout
# Disk layout: each column stored together
# [name col: Alice | Bob | Carol]
# [age col: 30 | 25 | 35]
# [salary: 95000 | 72000 | 120000]
# Query: SELECT SUM(salary) → ONLY reads salary block!
# Savings: name + age columns NEVER loaded from disk
# Also: similar data compresses much better (ints next to ints)
| Use Case | Best Format | Why |
|---|---|---|
| Insert/update single records | Row (CSV, AVRO) | Access full row at once |
| Analytics: SUM, AVG over 1 column | Columnar (Parquet) | Only read needed column |
| Streaming ingestion (one event at a time) | AVRO (row) | Write one event at a time |
| Data warehouse queries | Parquet / ORC | Column pruning = fast scans |
Compression
MICRO TOPIC
▾
Why Compression Matters
In Big Data, I/O is the bottleneck — smaller files = faster reads and less S3/network cost. Columnar formats compress much better because similar data (all integers, all strings) is adjacent on disk.
python — compression in pyspark
# Snappy: best balance of speed + ratio (default for Parquet)
df.write.parquet("s3://bucket/data/", compression="snappy")
# ZSTD: new standard, best ratio + speed combo
df.write.parquet("s3://bucket/data/", compression="zstd")
# Gzip: best ratio but slow and NOT splittable for CSV
df.write.csv("s3://bucket/data/", compression="gzip")
# Set default globally
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")
| Algorithm | Speed | Ratio | Splittable | Best For |
|---|---|---|---|---|
| Snappy | Very fast | ~3:1 | Yes | Default — balanced |
| ZSTD | Fast | ~5:1 | Yes | Best for Parquet (recommended) |
| Gzip | Slow | ~5:1 | No (CSV) | Archive / cold data |
| None | Fastest | None | Yes | Dev / testing only |
Predicate Pushdown
MICRO TOPIC — VERY IMPORTANT
▾
What is Predicate Pushdown?
Predicate pushdown = pushing filter conditions down to the file/storage level so only relevant data is loaded into Spark's memory. The file format itself skips irrelevant row groups before Spark even sees the data.
🧠 Analogy
You want India orders from a 1TB file. Without pushdown: load all 1TB into Spark, then filter. With pushdown: Parquet's statistics say "rows 5M–7M are India" — so only read those 2M rows. Like an index in a book.python — predicate pushdown demo
# Parquet stores min/max statistics per row group (128MB chunks)
# Spark checks: does this row group overlap the filter range?
# If not → entire row group SKIPPED without reading
df = spark.read.parquet("s3://data/orders/")
result = df.filter(df.order_date >= "2024-01-01")
# Verify pushdown is happening:
result.explain(True)
# Look for: PushedFilters: [GreaterThanOrEqual(order_date,2024-01-01)]
# Column pruning (also pushed down)
# Only reads 'city' and 'amount' from Parquet, skips all other columns
df.select("city", "amount").filter(df.country == "IN").show()
🔑 Interview Point
Predicate pushdown works with Parquet and ORC because they store per-column statistics. It does NOT work with CSV/JSON — they have no embedded statistics.
Schema Evolution
MICRO TOPIC
▾
What is Schema Evolution?
Schema evolution = ability to change a table's schema (add columns, rename, change types) without breaking existing data or pipelines. Critical in production where schemas change as business evolves.
python — schema evolution in delta lake
# Original table: name, age only
df_v1 = spark.createDataFrame([("Alice", 30)], ["name", "age"])
df_v1.write.format("delta").save("s3://bucket/users")
# Business adds 'email' column later
df_v2 = spark.createDataFrame([("Bob", 25, "bob@example.com")], ["name", "age", "email"])
# mergeSchema: adds new columns automatically, no error!
df_v2.write.format("delta") \
.option("mergeSchema", "true") \
.mode("append") \
.save("s3://bucket/users")
# Old rows get null for new column — clean merge
spark.read.format("delta").load("s3://bucket/users").show()
# |Alice|30| null| ← old row, null for new column
# | Bob|25|bob@example.com|
| Format | Schema Evolution | Method |
|---|---|---|
| CSV / JSON | Manual only | No built-in support |
| Parquet | Add columns only | Schema merge on read |
| AVRO | Full (with defaults) | Schema Registry |
| Delta Lake | Full | mergeSchema option |
| Iceberg | Best-in-class | Schema evolution API |
| Hudi | Full | Schema-on-read |
Serialization
MICRO TOPIC
▾
What is Serialization?
Serialization = converting an in-memory object into bytes for storage or network transfer. Spark does this millions of times during shuffles. It's a critical performance factor.
🧠 Analogy
Packing a suitcase: your room (in-memory object) → packed suitcase (bytes on wire) → unpacked at destination (deserialization). Faster and smaller packing = better performance.python — serialization config
# Default Java serialization — slow, large output
spark.conf.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
# Kryo serialization — 3x faster, smaller (recommended for RDDs)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
# Register custom classes for maximum Kryo performance
spark.conf.set("spark.kryo.classesToRegister", "com.example.MyClass")
# For DataFrames: Tungsten UnsafeRow is used AUTOMATICALLY
# Binary format with no Java object overhead
# This is why DataFrames are faster than RDDs!
| Serializer | Speed | Size | Best For |
|---|---|---|---|
| Java (default) | Slow | Large | Simple/small use cases |
| Kryo | 3x faster than Java | Smaller | RDD-heavy workloads |
| Tungsten UnsafeRow | Fastest | Binary minimal | DataFrame API (automatic) |
Delta vs Iceberg vs Hudi — Deep Comparison
MICRO TOPIC — INTERVIEW CRITICAL
▾
Why Open Table Formats Exist
Raw Parquet/ORC files don't support ACID transactions, time travel, or upserts. Open Table Formats add a metadata layer on top of Parquet to provide database-like features for data lakes.
Query Engine (Spark / Trino / Flink / Athena)
↓
Table Format (Delta / Iceberg / Hudi) — ACID, time travel, upserts
↓
Parquet data files on S3 / HDFS / GCS
Side-by-Side Comparison
| Feature | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|
| Creator | Databricks | Netflix | Uber |
| ACID Transactions | ✓ Full | ✓ Full | ✓ Full |
| Time Travel | ✓ Version/timestamp | ✓ Snapshot-based | ✓ Limited |
| Schema Evolution | ✓ Full | ✓ Best-in-class | ✓ Full |
| Partition Evolution | ✗ No | ✓ Best feature | Limited |
| Native Upserts | ✓ MERGE INTO | ✓ MERGE INTO | ✓ Native UPSERT |
| Multi-Engine Support | Databricks-first | Best (Spark+Trino+Flink) | Spark-primary |
| Best For | Databricks users | Multi-engine lakehouses | High upsert/CDC workloads |
Internals: Transaction Logs
concept — delta lake transaction log
# Delta Lake file structure
s3://bucket/my_table/
├── _delta_log/
│ ├── 00000000000000000000.json # version 0: CREATE TABLE
│ ├── 00000000000000000001.json # version 1: INSERT
│ ├── 00000000000000000002.json # version 2: UPDATE
│ └── 00000000000000000010.checkpoint.parquet # checkpoint every 10 commits
├── part-00000-abc.snappy.parquet # actual data files (Parquet)
└── part-00001-def.snappy.parquet
# Iceberg file structure
s3://bucket/my_iceberg_table/
├── metadata/
│ ├── v1.metadata.json # Table metadata + snapshot refs
│ ├── snap-123.avro # Manifest list (list of manifests)
│ └── manifest-456.avro # Manifest (list of data files + stats)
└── data/
└── country=IN/
└── part-00000.parquet
# Time travel with Delta
spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("s3://bucket/my_table")
# Time travel with Iceberg
spark.read.option("snapshot-id", 123456789) \
.table("my_iceberg_table")
🧠 Quick Check:
SELECT SUM(revenue) FROM sales WHERE country = 'IN' runs on a Parquet file. Which optimizations apply automatically?