MODULE 1 Big Data Fundamentals
1 / 4
1.1

Why Big Data?

Understanding the fundamental problems that Big Data solves and why traditional systems break down at scale.

🗄️
Traditional Database Limitations THE PROBLEM
What are Traditional Databases?
Traditional databases (MySQL, PostgreSQL, Oracle) are relational systems designed in the 1970s–90s when data sizes were in MBs or GBs. They store data in rows and columns on a single machine. They work great for structured, predictable data — like a bank's transactions or a shop's inventory.
🧠 Analogy
A traditional database is like a single filing cabinet. Perfect for a small office. But when your office grows to 100,000 employees, one cabinet won't work.
Key Limitations at Scale
When data grows to TBs or PBs, traditional databases hit these walls:
💾
Storage Ceiling
Single machine disk is limited. You can't store 500TB on one server easily.
Query Speed
Full table scans on billions of rows take hours on a single CPU.
🔗
Joins Are Expensive
JOINs between huge tables require loading all data into RAM — impossible at scale.
💥
Single Point of Failure
If the one server goes down, everything stops. No redundancy.
🧩
Unstructured Data
Can't store videos, logs, JSON, images in a relational table natively.
🌊
Real-Time Streams
Processing millions of events per second is impossible with traditional RDBMS.
📖 Real Example
Facebook generates ~4 petabytes of data per day. A traditional MySQL server would take years to process a single day's data. This is why they built distributed systems.
The 3 V's of Big Data
VMeaningExampleChallenge
VolumeSize of dataGoogle: 8.5 billion searches/dayStorage & processing
VelocitySpeed of dataTwitter: 500K tweets/minuteReal-time ingestion
VarietyTypes of dataLogs, images, JSON, CSV, videoSchema flexibility
📈
Vertical Scaling (Scale Up) SOLUTION #1
What is Vertical Scaling?
Vertical scaling = making your single machine more powerful. You add more RAM, faster CPUs, bigger SSDs to the same server. This was the original answer to database performance problems.
🧠 Analogy
Vertical scaling = upgrading your car's engine. Instead of adding more cars, you put a bigger engine in the one you have.
concept
# Before vertical scaling
Server: 8 CPU cores, 32GB RAM, 1TB disk
Query time for 100M rows: 45 minutes

# After vertical scaling
Server: 64 CPU cores, 512GB RAM, 10TB SSD
Query time for 100M rows: 3 minutes

# But... you hit the hardware ceiling
# Max RAM per machine: ~12TB (extremely expensive!)
# Still ONE point of failure — server dies = downtime
💸
Cost Explodes
Doubling CPU power often costs 4x more. Non-linear cost curve.
🧱
Physical Limit
There's a maximum RAM/CPU that fits on one motherboard.
🔴
Downtime Required
Upgrading hardware = server must go offline.
⚠️ Key Insight
Vertical scaling is a band-aid. At petabyte scale, no single machine is powerful enough.
🖧
Horizontal Scaling (Scale Out) THE REAL SOLUTION
What is Horizontal Scaling?
Horizontal scaling = adding more machines and distributing the work across all of them. Instead of one powerful server, you have many commodity servers working together as a cluster.
🧠 Analogy
Instead of one super-chef cooking everything, you hire 100 regular chefs. Each cooks a portion of the meal simultaneously. Total cooking time drops dramatically.
AspectVertical (Scale Up)Horizontal (Scale Out)
StrategyBigger machineMore machines
CostExponentialLinear
LimitHardware ceilingVirtually unlimited
DowntimeRequired for upgradeAdd nodes without stopping
Fault ToleranceSingle point of failureRedundant nodes
ComplexitySimpleRequires distributed coordination
🔑 Key Insight
Spark is built for horizontal scaling. A Spark cluster of 100 cheap machines can process petabytes of data in minutes by running operations in parallel across all nodes simultaneously.
🌐
Distributed Systems CORE CONCEPT
What is a Distributed System?
A distributed system is a collection of independent computers that appear to users as a single system. Machines coordinate via message passing over a network. Spark, Hadoop, Kafka are all distributed systems.
🧠 Analogy
An ant colony is a distributed system. No single ant knows the full plan, but through local rules and communication, thousands of ants build complex structures efficiently.
🔄
Concurrency
Many nodes work on different parts of the problem simultaneously.
🛡️
Fault Tolerance
If 2 nodes fail out of 100, the others continue. Auto-recovery.
📡
Message Passing
Nodes coordinate via network messages (not shared memory).
🔀
No Shared Memory
Each node has its own RAM. Data is shared by copying over the network (shuffle).
python — pyspark distributed execution
from pyspark.sql import SparkSession

# Spark creates a distributed computation cluster
spark = SparkSession.builder \
    .appName("MyDistributedApp") \
    .master("spark://cluster:7077") \
    .getOrCreate()

# This 10TB file is distributed across all cluster nodes
# Each node holds a partition (chunk) of the data
df = spark.read.parquet("s3://data/events/")

# filter() runs in PARALLEL on all nodes simultaneously
# Node 1 filters its partition, Node 2 its partition, etc.
result = df.filter(df.country == "IN") \
           .groupBy("city") \
           .count()

result.show()  # Results collected and merged
🔺
CAP Theorem THEORY
What is CAP Theorem?
CAP Theorem (2000) states a distributed system can guarantee only 2 out of 3 properties at the same time. Fundamental constraint every data engineer must know.
C
Consistency
Every read gets the most recent write. All nodes see same data at same time.
A
Availability
Every request gets a response (even if stale). System always up.
P
Partition Tolerance
System works even if network packets are lost between nodes.
⚠️ The Catch
In real distributed systems, P is non-negotiable — networks ALWAYS fail. So your real choice is between C and A.
SystemGuaranteesTrade-off
HDFSCPBlocks writes during partition. Prioritizes consistency.
CassandraAPAlways available, may return slightly stale data.
KafkaAPAlways accepts messages, replication may lag.
Delta LakeCPACID via transaction log. No dirty reads.
🧠 Analogy
2 bank branches lose network connection.
CP: Stop all transactions until restored (consistent but unavailable).
AP: Both continue, reconcile conflicts later (available but inconsistent).
📍
Data Locality PERFORMANCE
What is Data Locality?
Data locality = moving computation to where the data lives, rather than moving data to computation. One of Spark's most critical performance optimizations.
🧠 Analogy
Instead of shipping all ingredients to a central kitchen, build a mini-kitchen at the farm and cook there (data locality). Only the cooked result travels — much smaller!
LevelWhere is data?Speed
PROCESS_LOCALSame JVM process (in memory)Fastest — memory speed
NODE_LOCALSame machine, different processFast — local disk
RACK_LOCALSame network rackOK — short network hop
ANYAnywhere in clusterSlow — cross-rack transfer
python — spark locality config
# Spark tries locality levels in order (best to worst)
# You can see locality level in Spark UI → Stages → Tasks

# Configure how long Spark waits for a local slot
spark.conf.set("spark.locality.wait", "3s")     # default
spark.conf.set("spark.locality.wait.node", "3s")
spark.conf.set("spark.locality.wait.rack", "3s")
# Increase if your cluster has network bottlenecks
🔑 Interview Tip
When asked "why is Spark fast?", data locality is a key answer. The task scheduler assigns tasks to nodes where the data already lives, minimizing expensive network I/O.
🧠 Quick Check: Which CAP property is NON-NEGOTIABLE in real distributed systems?
Consistency (C)
Availability (A)
Partition Tolerance (P)
All three
1.2

Big Data Ecosystem

The major tools and technologies that make up the modern big data stack — and how they work together.

🐘
Hadoop FOUNDATION
What is Hadoop?
Apache Hadoop (2006) was the first open-source framework for distributed storage and processing. Spark was later built to overcome Hadoop's limitations — primarily disk I/O between every step.
🗂️
HDFS
Distributed file system. Splits files into 128MB blocks across nodes with 3x replication.
🗺️
MapReduce
Original processing engine. Map = split work, Reduce = combine. Slow due to disk I/O.
🧶
YARN
Resource manager. Decides which node gets how much CPU/memory per task.
⚡ Spark vs Hadoop MapReduce
MapReduce writes to disk after every step. Spark keeps data in memory across operations. Result: Spark is 10–100x faster for iterative workloads (ML, graph processing).
💿
HDFS — Hadoop Distributed File System STORAGE
How HDFS Works
HDFS splits files into 128MB blocks across DataNodes. Each block is replicated 3 times by default. The NameNode (master) tracks where every block lives.
🧠 Analogy
HDFS is like a distributed library. The NameNode is the librarian who knows where every book is. DataNodes are shelves across different rooms. Every book has 3 copies — if one room floods, you still find the book.
bash — hdfs commands
# List files in HDFS
hdfs dfs -ls /user/data/

# Upload a file to HDFS
hdfs dfs -put localfile.csv /user/data/

# Read HDFS file in PySpark
df = spark.read.csv("hdfs://namenode:9000/user/data/sales.csv")

# Check replication factor for a file
hdfs fsck /user/data/myfile.parquet -files -blocks
🧶
YARN — Resource Manager SCHEDULING
What is YARN?
YARN (Yet Another Resource Negotiator) is the cluster resource manager. When you submit a Spark job, YARN decides how many containers (CPU + memory) to allocate, and on which nodes.
bash — submit spark to yarn
spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --num-executors 10 \
  --executor-memory 4g \
  --executor-cores 2 \
  my_spark_job.py
🐝
Apache Hive SQL ON BIG DATA
What is Hive?
Apache Hive provides a SQL-like interface (HiveQL) over Hadoop data. It translates SQL into MapReduce/Tez/Spark jobs. Hive also provides the Hive Metastore — a central registry of table schemas used by Spark.
sql — hiveql
-- Create external Hive table over HDFS Parquet data
CREATE EXTERNAL TABLE sales (
  order_id  STRING,
  amount    DECIMAL(10,2),
  city      STRING
)
STORED AS PARQUET
LOCATION '/data/sales/';

-- In PySpark, read the same Hive table directly
df = spark.sql("SELECT city, SUM(amount) FROM sales GROUP BY city")
🔑 Why It Matters for Spark
Spark integrates directly with Hive Metastore. spark.sql("SELECT * FROM my_hive_table") uses the metastore to discover table location and schema automatically.
Apache Spark THE STAR OF THIS COURSE
What Makes Spark Special?
Apache Spark (2009, UC Berkeley) keeps data in memory across operations, making it 10–100x faster than Hadoop MapReduce. It's a unified engine for batch, streaming, SQL, and ML.
🧠
In-Memory Processing
Data stays in RAM between operations. No disk I/O overhead between steps.
🔁
Lazy Evaluation
Spark builds a query plan but executes only when you call an action. Enables optimization.
🌊
Unified Engine
Batch + Streaming + ML + SQL — all one framework.
🐍
PySpark API
Python wrapper around Spark. Write Python, runs distributed on a cluster.
📨
Apache Kafka STREAMING
What is Kafka?
Apache Kafka is a distributed event streaming platform. Producers write events, Consumers read them. Spark Structured Streaming reads from Kafka in near real-time.
🧠 Analogy
Kafka is like a newspaper printing press. Publishers write news. The press stores all editions. Subscribers can read any edition at their own pace, even yesterday's paper.
python — spark reads from kafka
# Read real-time data from Kafka into Spark Structured Streaming
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka-host:9092") \
    .option("subscribe", "user-events") \
    .load()

# Kafka messages come as bytes — decode them
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
🗓️
Airflow, Trino, Flink, HBase ECOSYSTEM
Other Key Ecosystem Tools
ToolTypePurposeSpark Connection
Apache AirflowOrchestratorSchedules data pipeline DAGsTriggers Spark jobs via SparkSubmitOperator
Trino (Presto)SQL EngineInteractive SQL across many data sourcesComplementary — Trino for ad-hoc, Spark for ETL
Apache FlinkStream ProcessorTrue real-time event processingCompetitor for streaming; Spark Streaming is micro-batch
HBaseNoSQL DBRandom read/write on HDFS at row levelSpark can read/write HBase for random access
1.3

Types of Data

The three categories of data Big Data systems must handle — and how Spark deals with each.

📊
Structured Data MOST COMMON IN SPARK
What is Structured Data?
Structured data has a predefined schema — rows and columns with fixed data types. Think of a spreadsheet or SQL table. Easiest type to work with in Spark.
🧠 Analogy
Structured data is like a perfectly organized Excel sheet — every column has a name and type, every row follows the same format. No surprises.
python — structured data in pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder.appName("StructuredData").getOrCreate()

data = [
    ("Alice", 30, "Engineer", 95000.0),
    ("Bob",   25, "Designer", 72000.0),
    ("Carol", 35, "Manager",  120000.0)
]

schema = StructType([
    StructField("name",   StringType(),  nullable=False),
    StructField("age",    IntegerType(), nullable=True),
    StructField("role",   StringType(),  nullable=True),
    StructField("salary", DoubleType(),  nullable=True)
])

df = spark.createDataFrame(data, schema)
df.show()
# +-----+---+--------+--------+
# | name|age|    role|  salary|
# |Alice| 30|Engineer| 95000.0|
# |  Bob| 25|Designer| 72000.0|

# Also from files:
df2 = spark.read.parquet("s3://bucket/transactions/")
df3 = spark.read.jdbc(url="jdbc:postgresql://...", table="orders")
CSV files Parquet files SQL Database tables ORC files Excel sheets
🧩
Semi-Structured Data MOST COMMON IN REAL WORLD
What is Semi-Structured Data?
Semi-structured data has some organization (keys/tags) but no rigid schema. It's self-describing — structure is embedded in the data. JSON and XML are the most common. Rows can have different fields.
🧠 Analogy
Semi-structured data is like a business card. Everyone has name and contact info, but some have Twitter handles, some have fax numbers. Flexible structure, no fixed schema enforced.
python — semi-structured json in pyspark
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

json_data = [
    ('{"user":"alice","tags":["python","spark"],"address":{"city":"Bangalore"}}',),
    ('{"user":"bob","tags":["java"],"phone":"9999999999"}',)  # different fields!
]
df = spark.createDataFrame(json_data, ["json_col"])

schema = StructType([
    StructField("user", StringType()),
    StructField("tags", ArrayType(StringType())),
    StructField("address", StructType([StructField("city", StringType())]))
])

parsed = df.withColumn("parsed", from_json(col("json_col"), schema))
parsed.select("parsed.user", "parsed.tags", "parsed.address.city").show()
# +-----+----------------+---------+
# |alice|[python, spark] |Bangalore|
# |  bob|        [java]  |     null|  ← missing fields become null
JSON files XML files AVRO files Log files Kafka messages
🎲
Unstructured Data FASTEST GROWING
What is Unstructured Data?
Unstructured data has no predefined format or schema. Requires specialized tools (ML, NLP, Computer Vision) to extract meaning. ~80% of the world's data is unstructured.
🧠 Analogy
Unstructured data is a pile of photographs. Each contains valuable information, but you can't run SQL on it. You need AI/ML to describe what's in each photo before analysis.
python — unstructured text in pyspark
from pyspark.sql.functions import split, size, length

reviews = [
    ("1", "This product is absolutely amazing! Best purchase ever."),
    ("2", "Terrible quality. Broke after 2 days. Very disappointed."),
]
df = spark.createDataFrame(reviews, ["id", "review"])

# Basic processing on raw text
df = df.withColumn("word_count", size(split(col("review"), " "))) \
       .withColumn("char_count", length(col("review")))

# For NLP: use Spark NLP or Pandas UDF with transformers
from pyspark.ml.feature import Tokenizer, HashingTF
tokenizer = Tokenizer(inputCol="review", outputCol="words")
words_df = tokenizer.transform(df)
Images (PNG, JPEG) Videos (MP4) Audio files Text documents Social media posts PDF files
📊
Complete Comparison SUMMARY
PropertyStructuredSemi-StructuredUnstructured
SchemaFixed, predefinedFlexible / self-describingNone
StorageRDBMS, Parquet, ORCJSON, XML, AVROS3/HDFS raw files
Query MethodSQL directlySQL after parsingRequires ML/NLP
Spark ToolDataFrame APIfrom_json(), schema inferenceMLlib, Python UDFs
% of World's Data~20%~10%~70%
1.4

File Formats

Deep dive into every file format used in Big Data, plus critical micro-topics: Row vs Columnar, Compression, Predicate Pushdown, Schema Evolution, Serialization, and Delta vs Iceberg vs Hudi.

📁
All File Formats — Overview OVERVIEW
CSV
Row · Text
  • Human readable
  • No compression
  • No schema
  • Slow for big data
JSON
Row · Text
  • Nested structures
  • Schema embedded
  • Verbose/large
  • Flexible
XML
Row · Text
  • Tags + attributes
  • Very verbose
  • Legacy systems
  • Complex parsing
AVRO
Row · Binary
  • Schema evolution
  • Kafka standard
  • Compact binary
  • Streaming-friendly
ORC
Columnar · Binary
  • Hive optimized
  • Built-in indexes
  • Great compression
  • ACID support
PARQUET
Columnar · Binary
  • Spark's default
  • Predicate pushdown
  • Column pruning
  • Best for analytics
DELTA
Columnar + Log
  • ACID transactions
  • Time travel
  • Schema evolution
  • CDC support
ICEBERG
Open Table Format
  • Hidden partitioning
  • Partition evolution
  • Multi-engine
  • Snapshot isolation
HUDI
Open Table Format
  • Native upserts
  • COW + MOR types
  • Incremental reads
  • Record-level index
Row vs Columnar Storage MICRO TOPIC — CRITICAL
Row-Oriented Storage (CSV, JSON, AVRO)
Stores data row by row on disk. To read a single column from 1 million rows, you must load ALL columns from ALL rows into memory first. Best for OLTP (insert/update individual records).
concept — row storage layout
# Disk layout: each row stored together
# [Alice | 30 | 95000] [Bob | 25 | 72000] [Carol | 35 | 120000]

# Query: SELECT SUM(salary) FROM employees
# MUST read: name + age + salary for EVERY row
# Wasted: name + age columns read but never used!
Columnar Storage (Parquet, ORC)
Stores all values for each column together on disk. SELECT SUM(salary) only reads the salary column — all other columns skipped entirely. This is why Parquet is 10–100x faster for analytical queries.
concept — columnar storage layout
# Disk layout: each column stored together
# [name col: Alice | Bob | Carol]
# [age col:  30    | 25  | 35]
# [salary:   95000 | 72000 | 120000]

# Query: SELECT SUM(salary) → ONLY reads salary block!
# Savings: name + age columns NEVER loaded from disk
# Also: similar data compresses much better (ints next to ints)
Use CaseBest FormatWhy
Insert/update single recordsRow (CSV, AVRO)Access full row at once
Analytics: SUM, AVG over 1 columnColumnar (Parquet)Only read needed column
Streaming ingestion (one event at a time)AVRO (row)Write one event at a time
Data warehouse queriesParquet / ORCColumn pruning = fast scans
🗜️
Compression MICRO TOPIC
Why Compression Matters
In Big Data, I/O is the bottleneck — smaller files = faster reads and less S3/network cost. Columnar formats compress much better because similar data (all integers, all strings) is adjacent on disk.
python — compression in pyspark
# Snappy: best balance of speed + ratio (default for Parquet)
df.write.parquet("s3://bucket/data/", compression="snappy")

# ZSTD: new standard, best ratio + speed combo
df.write.parquet("s3://bucket/data/", compression="zstd")

# Gzip: best ratio but slow and NOT splittable for CSV
df.write.csv("s3://bucket/data/", compression="gzip")

# Set default globally
spark.conf.set("spark.sql.parquet.compression.codec", "zstd")
AlgorithmSpeedRatioSplittableBest For
SnappyVery fast~3:1YesDefault — balanced
ZSTDFast~5:1YesBest for Parquet (recommended)
GzipSlow~5:1No (CSV)Archive / cold data
NoneFastestNoneYesDev / testing only
🎯
Predicate Pushdown MICRO TOPIC — VERY IMPORTANT
What is Predicate Pushdown?
Predicate pushdown = pushing filter conditions down to the file/storage level so only relevant data is loaded into Spark's memory. The file format itself skips irrelevant row groups before Spark even sees the data.
🧠 Analogy
You want India orders from a 1TB file. Without pushdown: load all 1TB into Spark, then filter. With pushdown: Parquet's statistics say "rows 5M–7M are India" — so only read those 2M rows. Like an index in a book.
python — predicate pushdown demo
# Parquet stores min/max statistics per row group (128MB chunks)
# Spark checks: does this row group overlap the filter range?
# If not → entire row group SKIPPED without reading

df = spark.read.parquet("s3://data/orders/")

result = df.filter(df.order_date >= "2024-01-01")

# Verify pushdown is happening:
result.explain(True)
# Look for: PushedFilters: [GreaterThanOrEqual(order_date,2024-01-01)]

# Column pruning (also pushed down)
# Only reads 'city' and 'amount' from Parquet, skips all other columns
df.select("city", "amount").filter(df.country == "IN").show()
🔑 Interview Point
Predicate pushdown works with Parquet and ORC because they store per-column statistics. It does NOT work with CSV/JSON — they have no embedded statistics.
🔄
Schema Evolution MICRO TOPIC
What is Schema Evolution?
Schema evolution = ability to change a table's schema (add columns, rename, change types) without breaking existing data or pipelines. Critical in production where schemas change as business evolves.
python — schema evolution in delta lake
# Original table: name, age only
df_v1 = spark.createDataFrame([("Alice", 30)], ["name", "age"])
df_v1.write.format("delta").save("s3://bucket/users")

# Business adds 'email' column later
df_v2 = spark.createDataFrame([("Bob", 25, "bob@example.com")], ["name", "age", "email"])

# mergeSchema: adds new columns automatically, no error!
df_v2.write.format("delta") \
     .option("mergeSchema", "true") \
     .mode("append") \
     .save("s3://bucket/users")

# Old rows get null for new column — clean merge
spark.read.format("delta").load("s3://bucket/users").show()
# |Alice|30|           null|  ← old row, null for new column
# |  Bob|25|bob@example.com|
FormatSchema EvolutionMethod
CSV / JSONManual onlyNo built-in support
ParquetAdd columns onlySchema merge on read
AVROFull (with defaults)Schema Registry
Delta LakeFullmergeSchema option
IcebergBest-in-classSchema evolution API
HudiFullSchema-on-read
📦
Serialization MICRO TOPIC
What is Serialization?
Serialization = converting an in-memory object into bytes for storage or network transfer. Spark does this millions of times during shuffles. It's a critical performance factor.
🧠 Analogy
Packing a suitcase: your room (in-memory object) → packed suitcase (bytes on wire) → unpacked at destination (deserialization). Faster and smaller packing = better performance.
python — serialization config
# Default Java serialization — slow, large output
spark.conf.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")

# Kryo serialization — 3x faster, smaller (recommended for RDDs)
spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

# Register custom classes for maximum Kryo performance
spark.conf.set("spark.kryo.classesToRegister", "com.example.MyClass")

# For DataFrames: Tungsten UnsafeRow is used AUTOMATICALLY
# Binary format with no Java object overhead
# This is why DataFrames are faster than RDDs!
SerializerSpeedSizeBest For
Java (default)SlowLargeSimple/small use cases
Kryo3x faster than JavaSmallerRDD-heavy workloads
Tungsten UnsafeRowFastestBinary minimalDataFrame API (automatic)
🏆
Delta vs Iceberg vs Hudi — Deep Comparison MICRO TOPIC — INTERVIEW CRITICAL
Why Open Table Formats Exist
Raw Parquet/ORC files don't support ACID transactions, time travel, or upserts. Open Table Formats add a metadata layer on top of Parquet to provide database-like features for data lakes.
Query Engine (Spark / Trino / Flink / Athena)
Table Format (Delta / Iceberg / Hudi) — ACID, time travel, upserts
Parquet data files on S3 / HDFS / GCS
Side-by-Side Comparison
FeatureDelta LakeApache IcebergApache Hudi
CreatorDatabricksNetflixUber
ACID Transactions✓ Full✓ Full✓ Full
Time Travel✓ Version/timestamp✓ Snapshot-based✓ Limited
Schema Evolution✓ Full✓ Best-in-class✓ Full
Partition Evolution✗ No✓ Best featureLimited
Native Upserts✓ MERGE INTO✓ MERGE INTO✓ Native UPSERT
Multi-Engine SupportDatabricks-firstBest (Spark+Trino+Flink)Spark-primary
Best ForDatabricks usersMulti-engine lakehousesHigh upsert/CDC workloads
Internals: Transaction Logs
concept — delta lake transaction log
# Delta Lake file structure
s3://bucket/my_table/
  ├── _delta_log/
  │   ├── 00000000000000000000.json   # version 0: CREATE TABLE
  │   ├── 00000000000000000001.json   # version 1: INSERT
  │   ├── 00000000000000000002.json   # version 2: UPDATE
  │   └── 00000000000000000010.checkpoint.parquet  # checkpoint every 10 commits
  ├── part-00000-abc.snappy.parquet   # actual data files (Parquet)
  └── part-00001-def.snappy.parquet

# Iceberg file structure
s3://bucket/my_iceberg_table/
  ├── metadata/
  │   ├── v1.metadata.json     # Table metadata + snapshot refs
  │   ├── snap-123.avro        # Manifest list (list of manifests)
  │   └── manifest-456.avro   # Manifest (list of data files + stats)
  └── data/
      └── country=IN/
          └── part-00000.parquet

# Time travel with Delta
spark.read.format("delta") \
     .option("versionAsOf", 5) \
     .load("s3://bucket/my_table")

# Time travel with Iceberg
spark.read.option("snapshot-id", 123456789) \
     .table("my_iceberg_table")
🧠 Quick Check: SELECT SUM(revenue) FROM sales WHERE country = 'IN' runs on a Parquet file. Which optimizations apply automatically?
Full table scan — reads every row and every column
Column pruning + Predicate pushdown — skip unneeded columns and row groups
Only column pruning — reads just country + revenue columns