MODULE 1 Big Data Fundamentals

1 / 4

Why Big Data?

Understanding the fundamental problems that Big Data solves and why traditional systems break down at scale.

🗄️

Traditional Database Limitations THE PROBLEM ▾

What are Traditional Databases?

Traditional databases (MySQL, PostgreSQL, Oracle) are relational systems designed in the 1970s–90s when data sizes were in MBs or GBs. They store data in rows and columns on a single machine. They work great for structured, predictable data — like a bank's transactions or a shop's inventory.

🧠 Analogy

A traditional database is like a single filing cabinet. Perfect for a small office. But when your office grows to 100,000 employees, one cabinet won't work.

Key Limitations at Scale

When data grows to TBs or PBs, traditional databases hit these walls:

💾

Storage Ceiling

Single machine disk is limited. You can't store 500TB on one server easily.

⚡

Query Speed

Full table scans on billions of rows take hours on a single CPU.

🔗

Joins Are Expensive

JOINs between huge tables require loading all data into RAM — impossible at scale.

💥

Single Point of Failure

If the one server goes down, everything stops. No redundancy.

🧩

Unstructured Data

Can't store videos, logs, JSON, images in a relational table natively.

🌊

Real-Time Streams

Processing millions of events per second is impossible with traditional RDBMS.

📖 Real Example

Facebook generates ~4 petabytes of data per day. A traditional MySQL server would take years to process a single day's data. This is why they built distributed systems.

The 3 V's of Big Data

V	Meaning	Example	Challenge
Volume	Size of data	Google: 8.5 billion searches/day	Storage & processing
Velocity	Speed of data	Twitter: 500K tweets/minute	Real-time ingestion
Variety	Types of data	Logs, images, JSON, CSV, video	Schema flexibility

📈

Vertical Scaling (Scale Up) SOLUTION #1 ▾

What is Vertical Scaling?

Vertical scaling = making your single machine more powerful. You add more RAM, faster CPUs, bigger SSDs to the same server. This was the original answer to database performance problems.

🧠 Analogy

Vertical scaling = upgrading your car's engine. Instead of adding more cars, you put a bigger engine in the one you have.

concept

# Before vertical scaling
Server: 8 CPU cores, 32GB RAM, 1TB disk
Query time for 100M rows: 45 minutes

# After vertical scaling
Server: 64 CPU cores, 512GB RAM, 10TB SSD
Query time for 100M rows: 3 minutes

# But... you hit the hardware ceiling
# Max RAM per machine: ~12TB (extremely expensive!)
# Still ONE point of failure — server dies = downtime

💸

Cost Explodes

Doubling CPU power often costs 4x more. Non-linear cost curve.

🧱

Physical Limit

There's a maximum RAM/CPU that fits on one motherboard.

🔴

Downtime Required

Upgrading hardware = server must go offline.

⚠️ Key Insight

Vertical scaling is a band-aid. At petabyte scale, no single machine is powerful enough.

🖧

Horizontal Scaling (Scale Out) THE REAL SOLUTION ▾

What is Horizontal Scaling?

Horizontal scaling = adding more machines and distributing the work across all of them. Instead of one powerful server, you have many commodity servers working together as a cluster.

🧠 Analogy

Instead of one super-chef cooking everything, you hire 100 regular chefs. Each cooks a portion of the meal simultaneously. Total cooking time drops dramatically.

Aspect	Vertical (Scale Up)	Horizontal (Scale Out)
Strategy	Bigger machine	More machines
Cost	Exponential	Linear
Limit	Hardware ceiling	Virtually unlimited
Downtime	Required for upgrade	Add nodes without stopping
Fault Tolerance	Single point of failure	Redundant nodes
Complexity	Simple	Requires distributed coordination

🔑 Key Insight

Spark is built for horizontal scaling. A Spark cluster of 100 cheap machines can process petabytes of data in minutes by running operations in parallel across all nodes simultaneously.

🌐

Distributed Systems CORE CONCEPT ▾

What is a Distributed System?

A distributed system is a collection of independent computers that appear to users as a single system. Machines coordinate via message passing over a network. Spark, Hadoop, Kafka are all distributed systems.

🧠 Analogy

An ant colony is a distributed system. No single ant knows the full plan, but through local rules and communication, thousands of ants build complex structures efficiently.

🔄

Concurrency

Many nodes work on different parts of the problem simultaneously.

🛡️

Fault Tolerance

If 2 nodes fail out of 100, the others continue. Auto-recovery.

📡

Message Passing

Nodes coordinate via network messages (not shared memory).

🔀

No Shared Memory

Each node has its own RAM. Data is shared by copying over the network (shuffle).

python — pyspark distributed execution

from pyspark.sql import SparkSession

# Spark creates a distributed computation cluster
spark = SparkSession.builder \
    .appName("MyDistributedApp") \
    .master("spark://cluster:7077") \
    .getOrCreate()

# This 10TB file is distributed across all cluster nodes
# Each node holds a partition (chunk) of the data
df = spark.read.parquet("s3://data/events/")

# filter() runs in PARALLEL on all nodes simultaneously
# Node 1 filters its partition, Node 2 its partition, etc.
result = df.filter(df.country == "IN") \
           .groupBy("city") \
           .count()

result.show()  # Results collected and merged

🔺

CAP Theorem THEORY ▾

What is CAP Theorem?

CAP Theorem (2000) states a distributed system can guarantee only 2 out of 3 properties at the same time. Fundamental constraint every data engineer must know.

Consistency

Every read gets the most recent write. All nodes see same data at same time.

Availability

Every request gets a response (even if stale). System always up.

Partition Tolerance

System works even if network packets are lost between nodes.

⚠️ The Catch

In real distributed systems, P is non-negotiable — networks ALWAYS fail. So your real choice is between C and A.

System	Guarantees	Trade-off
HDFS	CP	Blocks writes during partition. Prioritizes consistency.
Cassandra	AP	Always available, may return slightly stale data.
Kafka	AP	Always accepts messages, replication may lag.
Delta Lake	CP	ACID via transaction log. No dirty reads.

🧠 Analogy

2 bank branches lose network connection.
CP: Stop all transactions until restored (consistent but unavailable).
AP: Both continue, reconcile conflicts later (available but inconsistent).

📍

Data Locality PERFORMANCE ▾

What is Data Locality?

Data locality = moving computation to where the data lives, rather than moving data to computation. One of Spark's most critical performance optimizations.

🧠 Analogy

Instead of shipping all ingredients to a central kitchen, build a mini-kitchen at the farm and cook there (data locality). Only the cooked result travels — much smaller!

Level	Where is data?	Speed
PROCESS_LOCAL	Same JVM process (in memory)	Fastest — memory speed
NODE_LOCAL	Same machine, different process	Fast — local disk
RACK_LOCAL	Same network rack	OK — short network hop
ANY	Anywhere in cluster	Slow — cross-rack transfer

python — spark locality config

# Spark tries locality levels in order (best to worst)
# You can see locality level in Spark UI → Stages → Tasks

# Configure how long Spark waits for a local slot
spark.conf.set("spark.locality.wait", "3s")     # default
spark.conf.set("spark.locality.wait.node", "3s")
spark.conf.set("spark.locality.wait.rack", "3s")
# Increase if your cluster has network bottlenecks

🔑 Interview Tip

When asked "why is Spark fast?", data locality is a key answer. The task scheduler assigns tasks to nodes where the data already lives, minimizing expensive network I/O.

🧠 Quick Check: Which CAP property is NON-NEGOTIABLE in real distributed systems?

Consistency (C)

Availability (A)

Partition Tolerance (P)

All three

1.2

Big Data Ecosystem

The major tools and technologies that make up the modern big data stack — and how they work together.

🐘

Hadoop FOUNDATION ▾

What is Hadoop?

Apache Hadoop (2006) was the first open-source framework for distributed storage and processing. Spark was later built to overcome Hadoop's limitations — primarily disk I/O between every step.

🗂️

HDFS

Distributed file system. Splits files into 128MB blocks across nodes with 3x replication.

🗺️

MapReduce

Original processing engine. Map = split work, Reduce = combine. Slow due to disk I/O.

🧶

YARN

Resource manager. Decides which node gets how much CPU/memory per task.

⚡ Spark vs Hadoop MapReduce

MapReduce writes to disk after every step. Spark keeps data in memory across operations. Result: Spark is 10–100x faster for iterative workloads (ML, graph processing).

💿

HDFS — Hadoop Distributed File System STORAGE ▾

How HDFS Works

HDFS splits files into 128MB blocks across DataNodes. Each block is replicated 3 times by default. The NameNode (master) tracks where every block lives.

🧠 Analogy

HDFS is like a distributed library. The NameNode is the librarian who knows where every book is. DataNodes are shelves across different rooms. Every book has 3 copies — if one room floods, you still find the book.

bash — hdfs commands

# List files in HDFS
hdfs dfs -ls /user/data/

# Upload a file to HDFS
hdfs dfs -put localfile.csv /user/data/

# Read HDFS file in PySpark
df = spark.read.csv("hdfs://namenode:9000/user/data/sales.csv")

# Check replication factor for a file
hdfs fsck /user/data/myfile.parquet -files -blocks

🧶

YARN — Resource Manager SCHEDULING ▾

What is YARN?

YARN (Yet Another Resource Negotiator) is the cluster resource manager. When you submit a Spark job, YARN decides how many containers (CPU + memory) to allocate, and on which nodes.

bash — submit spark to yarn

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-memory 4g \
  --executor-cores 2 \
  my_spark_job.py

🐝

Apache Hive SQL ON BIG DATA ▾

What is Hive?

Apache Hive provides a SQL-like interface (HiveQL) over Hadoop data. It translates SQL into MapReduce/Tez/Spark jobs. Hive also provides the Hive Metastore — a central registry of table schemas used by Spark.

sql — hiveql

-- Create external Hive table over HDFS Parquet data
CREATE EXTERNAL TABLE sales (
  order_id  STRING,
  amount    DECIMAL(10,2),
  city      STRING
)
STORED AS PARQUET
LOCATION '/data/sales/';

-- In PySpark, read the same Hive table directly
df = spark.sql("SELECT city, SUM(amount) FROM sales GROUP BY city")

🔑 Why It Matters for Spark

Spark integrates directly with Hive Metastore. spark.sql("SELECT * FROM my_hive_table") uses the metastore to discover table location and schema automatically.

⚡

Apache Spark THE STAR OF THIS COURSE ▾

What Makes Spark Special?

Apache Spark (2009, UC Berkeley) keeps data in memory across operations, making it 10–100x faster than Hadoop MapReduce. It's a unified engine for batch, streaming, SQL, and ML.

🧠

In-Memory Processing

Data stays in RAM between operations. No disk I/O overhead between steps.

🔁

Lazy Evaluation

Spark builds a query plan but executes only when you call an action. Enables optimization.

🌊

Unified Engine

Batch + Streaming + ML + SQL — all one framework.

🐍

PySpark API

Python wrapper around Spark. Write Python, runs distributed on a cluster.

📨

Apache Kafka STREAMING ▾

What is Kafka?

Apache Kafka is a distributed event streaming platform. Producers write events, Consumers read them. Spark Structured Streaming reads from Kafka in near real-time.

🧠 Analogy

Kafka is like a newspaper printing press. Publishers write news. The press stores all editions. Subscribers can read any edition at their own pace, even yesterday's paper.

python — spark reads from kafka

# Read real-time data from Kafka into Spark Structured Streaming
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka-host:9092") \
    .option("subscribe", "user-events") \
    .load()

# Kafka messages come as bytes — decode them
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

🗓️

Airflow, Trino, Flink, HBase ECOSYSTEM ▾

Other Key Ecosystem Tools

Tool	Type	Purpose	Spark Connection
Apache Airflow	Orchestrator	Schedules data pipeline DAGs	Triggers Spark jobs via SparkSubmitOperator
Trino (Presto)	SQL Engine	Interactive SQL across many data sources	Complementary — Trino for ad-hoc, Spark for ETL
Apache Flink	Stream Processor	True real-time event processing	Competitor for streaming; Spark Streaming is micro-batch
HBase	NoSQL DB	Random read/write on HDFS at row level	Spark can read/write HBase for random access

1.3

Types of Data

The three categories of data Big Data systems must handle — and how Spark deals with each.

📊

Structured Data MOST COMMON IN SPARK ▾

What is Structured Data?

Structured data has a predefined schema — rows and columns with fixed data types. Think of a spreadsheet or SQL table. Easiest type to work with in Spark.

🧠 Analogy

Structured data is like a perfectly organized Excel sheet — every column has a name and type, every row follows the same format. No surprises.

python — structured data in pyspark

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("StructuredData").getOrCreate()

data = [
    ("Alice", 30, "Engineer", 95000.0),
    ("Bob",   25, "Designer", 72000.0),
    ("Carol", 35, "Manager",  120000.0)
]

schema = StructType([
    StructField("name",   StringType(),  nullable=False),
    StructField("age",    IntegerType(), nullable=True),
    StructField("role",   StringType(),  nullable=True),
    StructField("salary", DoubleType(),  nullable=True)
])

df = spark.createDataFrame(data, schema)
df.show()
# +-----+---+--------+--------+
# | name|age|    role|  salary|
# |Alice| 30|Engineer| 95000.0|
# |  Bob| 25|Designer| 72000.0|

# Also from files:
df2 = spark.read.parquet("s3://bucket/transactions/")
df3 = spark.read.jdbc(url="jdbc:postgresql://...", table="orders")

CSV files Parquet files SQL Database tables ORC files Excel sheets

🧩

Semi-Structured Data MOST COMMON IN REAL WORLD ▾

What is Semi-Structured Data?

Semi-structured data has some organization (keys/tags) but no rigid schema. It's self-describing — structure is embedded in the data. JSON and XML are the most common. Rows can have different fields.

🧠 Analogy

Semi-structured data is like a business card. Everyone has name and contact info, but some have Twitter handles, some have fax numbers. Flexible structure, no fixed schema enforced.

python — semi-structured json in pyspark

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

json_data = [
    ('{"user":"alice","tags":["python","spark"],"address":{"city":"Bangalore"}}',),
    ('{"user":"bob","tags":["java"],"phone":"9999999999"}',)  # different fields!
]
df = spark.createDataFrame(json_data, ["json_col"])

schema = StructType([
    StructField("user", StringType()),
    StructField("tags", ArrayType(StringType())),
    StructField("address", StructType([StructField("city", StringType())]))
])

parsed = df.withColumn("parsed", from_json(col("json_col"), schema))
parsed.select("parsed.user", "parsed.tags", "parsed.address.city").show()
# +-----+----------------+---------+
# |alice|[python, spark] |Bangalore|
# |  bob|        [java]  |     null|  ← missing fields become null

JSON files XML files AVRO files Log files Kafka messages

🎲

Unstructured Data FASTEST GROWING ▾

What is Unstructured Data?

Unstructured data has no predefined format or schema. Requires specialized tools (ML, NLP, Computer Vision) to extract meaning. ~80% of the world's data is unstructured.

🧠 Analogy

Unstructured data is a pile of photographs. Each contains valuable information, but you can't run SQL on it. You need AI/ML to describe what's in each photo before analysis.

python — unstructured text in pyspark

from pyspark.sql.functions import split, size, length

reviews = [
    ("1", "This product is absolutely amazing! Best purchase ever."),
    ("2", "Terrible quality. Broke after 2 days. Very disappointed."),
]
df = spark.createDataFrame(reviews, ["id", "review"])

# Basic processing on raw text
df = df.withColumn("word_count", size(split(col("review"), " "))) \
       .withColumn("char_count", length(col("review")))

# For NLP: use Spark NLP or Pandas UDF with transformers
from pyspark.ml.feature import Tokenizer, HashingTF
tokenizer = Tokenizer(inputCol="review", outputCol="words")
words_df = tokenizer.transform(df)

Images (PNG, JPEG) Videos (MP4) Audio files Text documents Social media posts PDF files

📊

Complete Comparison SUMMARY ▾

Property	Structured	Semi-Structured	Unstructured
Schema	Fixed, predefined	Flexible / self-describing	None
Storage	RDBMS, Parquet, ORC	JSON, XML, AVRO	S3/HDFS raw files
Query Method	SQL directly	SQL after parsing	Requires ML/NLP
Spark Tool	DataFrame API	from_json(), schema inference	MLlib, Python UDFs
% of World's Data	~20%	~10%	~70%

1.4

File Formats

Deep dive into every file format used in Big Data, plus critical micro-topics: Row vs Columnar, Compression, Predicate Pushdown, Schema Evolution, Serialization, and Delta vs Iceberg vs Hudi.

📁

All File Formats — Overview OVERVIEW ▾

CSV

Row · Text

Human readable
No compression
No schema
Slow for big data

JSON

Row · Text

Nested structures
Schema embedded
Verbose/large
Flexible

XML

Row · Text

Tags + attributes
Very verbose
Legacy systems
Complex parsing

AVRO

Row · Binary

Schema evolution
Kafka standard
Compact binary
Streaming-friendly

ORC

Columnar · Binary

Hive optimized
Built-in indexes
Great compression
ACID support

PARQUET

Columnar · Binary

Spark's default
Predicate pushdown
Column pruning
Best for analytics

DELTA

Columnar + Log

ACID transactions
Time travel
Schema evolution
CDC support

ICEBERG

Open Table Format

Hidden partitioning
Partition evolution
Multi-engine
Snapshot isolation

HUDI

Open Table Format

Native upserts
COW + MOR types
Incremental reads
Record-level index

⚡

Row vs Columnar Storage MICRO TOPIC — CRITICAL ▾

Row-Oriented Storage (CSV, JSON, AVRO)

Stores data row by row on disk. To read a single column from 1 million rows, you must load ALL columns from ALL rows into memory first. Best for OLTP (insert/update individual records).

concept — row storage layout

# Disk layout: each row stored together
# [Alice | 30 | 95000] [Bob | 25 | 72000] [Carol | 35 | 120000]

# Query: SELECT SUM(salary) FROM employees
# MUST read: name + age + salary for EVERY row
# Wasted: name + age columns read but never used!

Columnar Storage (Parquet, ORC)

Stores all values for each column together on disk. SELECT SUM(salary) only reads the salary column — all other columns skipped entirely. This is why Parquet is 10–100x faster for analytical queries.

concept — columnar storage layout

# Disk layout: each column stored together
# [name col: Alice | Bob | Carol]
# [age col:  30    | 25  | 35]
# [salary:   95000 | 72000 | 120000]

# Query: SELECT SUM(salary) → ONLY reads salary block!
# Savings: name + age columns NEVER loaded from disk
# Also: similar data compresses much better (ints next to ints)

Use Case	Best Format	Why
Insert/update single records	Row (CSV, AVRO)	Access full row at once
Analytics: SUM, AVG over 1 column	Columnar (Parquet)	Only read needed column
Streaming ingestion (one event at a time)	AVRO (row)	Write one event at a time
Data warehouse queries	Parquet / ORC	Column pruning = fast scans

🗜️

Compression MICRO TOPIC ▾

Why Compression Matters

In Big Data, I/O is the bottleneck — smaller files = faster reads and less S3/network cost. Columnar formats compress much better because similar data (all integers, all strings) is adjacent on disk.

python — compression in pyspark

# Snappy: best balance of speed + ratio (default for Parquet)
df.write.parquet("s3://bucket/data/", compression="snappy")

# ZSTD: new standard, best ratio + speed combo
df.write.parquet("s3://bucket/data/", compression="zstd")

# Gzip: best ratio but slow and NOT splittable for CSV
df.write.csv("s3://bucket/data/", compression="gzip")

# Set default globally
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")

Algorithm	Speed	Ratio	Splittable	Best For
Snappy	Very fast	~3:1	Yes	Default — balanced
ZSTD	Fast	~5:1	Yes	Best for Parquet (recommended)
Gzip	Slow	~5:1	No (CSV)	Archive / cold data
None	Fastest	None	Yes	Dev / testing only

🎯

Predicate Pushdown MICRO TOPIC — VERY IMPORTANT ▾

What is Predicate Pushdown?

Predicate pushdown = pushing filter conditions down to the file/storage level so only relevant data is loaded into Spark's memory. The file format itself skips irrelevant row groups before Spark even sees the data.

🧠 Analogy

You want India orders from a 1TB file. Without pushdown: load all 1TB into Spark, then filter. With pushdown: Parquet's statistics say "rows 5M–7M are India" — so only read those 2M rows. Like an index in a book.

python — predicate pushdown demo

# Parquet stores min/max statistics per row group (128MB chunks)
# Spark checks: does this row group overlap the filter range?
# If not → entire row group SKIPPED without reading

df = spark.read.parquet("s3://data/orders/")

result = df.filter(df.order_date >= "2024-01-01")

# Verify pushdown is happening:
result.explain(True)
# Look for: PushedFilters: [GreaterThanOrEqual(order_date,2024-01-01)]

# Column pruning (also pushed down)
# Only reads 'city' and 'amount' from Parquet, skips all other columns
df.select("city", "amount").filter(df.country == "IN").show()

🔑 Interview Point

Predicate pushdown works with Parquet and ORC because they store per-column statistics. It does NOT work with CSV/JSON — they have no embedded statistics.

🔄

Schema Evolution MICRO TOPIC ▾

What is Schema Evolution?

Schema evolution = ability to change a table's schema (add columns, rename, change types) without breaking existing data or pipelines. Critical in production where schemas change as business evolves.

python — schema evolution in delta lake

# Original table: name, age only
df_v1 = spark.createDataFrame([("Alice", 30)], ["name", "age"])
df_v1.write.format("delta").save("s3://bucket/users")

# Business adds 'email' column later
df_v2 = spark.createDataFrame([("Bob", 25, "bob@example.com")], ["name", "age", "email"])

# mergeSchema: adds new columns automatically, no error!
df_v2.write.format("delta") \
     .option("mergeSchema", "true") \
     .mode("append") \
     .save("s3://bucket/users")

# Old rows get null for new column — clean merge
spark.read.format("delta").load("s3://bucket/users").show()
# |Alice|30|           null|  ← old row, null for new column
# |  Bob|25|bob@example.com|

Format	Schema Evolution	Method
CSV / JSON	Manual only	No built-in support
Parquet	Add columns only	Schema merge on read
AVRO	Full (with defaults)	Schema Registry
Delta Lake	Full	mergeSchema option
Iceberg	Best-in-class	Schema evolution API
Hudi	Full	Schema-on-read

📦

Serialization MICRO TOPIC ▾

What is Serialization?

Serialization = converting an in-memory object into bytes for storage or network transfer. Spark does this millions of times during shuffles. It's a critical performance factor.

🧠 Analogy

Packing a suitcase: your room (in-memory object) → packed suitcase (bytes on wire) → unpacked at destination (deserialization). Faster and smaller packing = better performance.

python — serialization config

# Default Java serialization — slow, large output
spark.conf.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")

# Kryo serialization — 3x faster, smaller (recommended for RDDs)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# Register custom classes for maximum Kryo performance
spark.conf.set("spark.kryo.classesToRegister", "com.example.MyClass")

# For DataFrames: Tungsten UnsafeRow is used AUTOMATICALLY
# Binary format with no Java object overhead
# This is why DataFrames are faster than RDDs!

Serializer	Speed	Size	Best For
Java (default)	Slow	Large	Simple/small use cases
Kryo	3x faster than Java	Smaller	RDD-heavy workloads
Tungsten UnsafeRow	Fastest	Binary minimal	DataFrame API (automatic)

🏆

Delta vs Iceberg vs Hudi — Deep Comparison MICRO TOPIC — INTERVIEW CRITICAL ▾

Why Open Table Formats Exist

Raw Parquet/ORC files don't support ACID transactions, time travel, or upserts. Open Table Formats add a metadata layer on top of Parquet to provide database-like features for data lakes.

Query Engine (Spark / Trino / Flink / Athena)

↓

Table Format (Delta / Iceberg / Hudi) — ACID, time travel, upserts

↓

Parquet data files on S3 / HDFS / GCS

Side-by-Side Comparison

Feature	Delta Lake	Apache Iceberg	Apache Hudi
Creator	Databricks	Netflix	Uber
ACID Transactions	✓ Full	✓ Full	✓ Full
Time Travel	✓ Version/timestamp	✓ Snapshot-based	✓ Limited
Schema Evolution	✓ Full	✓ Best-in-class	✓ Full
Partition Evolution	✗ No	✓ Best feature	Limited
Native Upserts	✓ MERGE INTO	✓ MERGE INTO	✓ Native UPSERT
Multi-Engine Support	Databricks-first	Best (Spark+Trino+Flink)	Spark-primary
Best For	Databricks users	Multi-engine lakehouses	High upsert/CDC workloads

Internals: Transaction Logs

concept — delta lake transaction log

# Delta Lake file structure
s3://bucket/my_table/
  ├── _delta_log/
  │   ├── 00000000000000000000.json   # version 0: CREATE TABLE
  │   ├── 00000000000000000001.json   # version 1: INSERT
  │   ├── 00000000000000000002.json   # version 2: UPDATE
  │   └── 00000000000000000010.checkpoint.parquet  # checkpoint every 10 commits
  ├── part-00000-abc.snappy.parquet   # actual data files (Parquet)
  └── part-00001-def.snappy.parquet

# Iceberg file structure
s3://bucket/my_iceberg_table/
  ├── metadata/
  │   ├── v1.metadata.json     # Table metadata + snapshot refs
  │   ├── snap-123.avro        # Manifest list (list of manifests)
  │   └── manifest-456.avro   # Manifest (list of data files + stats)
  └── data/
      └── country=IN/
          └── part-00000.parquet

# Time travel with Delta
spark.read.format("delta") \
     .option("versionAsOf", 5) \
     .load("s3://bucket/my_table")

# Time travel with Iceberg
spark.read.option("snapshot-id", 123456789) \
     .table("my_iceberg_table")

🧠 Quick Check: SELECT SUM(revenue) FROM sales WHERE country = 'IN' runs on a Parquet file. Which optimizations apply automatically?

Full table scan — reads every row and every column

Column pruning + Predicate pushdown — skip unneeded columns and row groups

Only column pruning — reads just country + revenue columns