2.1
What is Apache Spark?
The story of Spark, what problem it solves, and where it's used in the real world — the foundation before we go into architecture.
History of Apache Spark
BACKGROUND
▾
From MapReduce to Spark
Spark was created in 2009 at UC Berkeley's AMPLab, became an Apache project in 2010, and was open-sourced in 2014 as a top-level Apache project. It was built to fix the biggest pain point of Hadoop MapReduce: every step writes to disk. Spark keeps data in memory (RAM) between steps, making it 10–100x faster for iterative workloads like ML and interactive queries.
🧠 Analogy
MapReduce is like a chef who, after every step (chopping, boiling, frying), puts the food away in the fridge and takes it out again for the next step. Spark is a chef who keeps everything on the counter (RAM) and moves straight from one step to the next — no fridge trips unless the counter gets full.| Aspect | Hadoop MapReduce | Apache Spark |
|---|---|---|
| Processing Model | Disk-based, step-by-step | In-memory (RAM) with disk fallback |
| Speed | Slow for iterative jobs | 10–100x faster |
| API | Java-heavy, verbose | Python, Scala, Java, SQL, R |
| Workloads | Batch only | Batch + Streaming + ML + Graph + SQL |
| Ease of Use | Low-level map/reduce code | High-level DataFrame/SQL APIs |
🔑 Key Insight
Spark didn't replace HDFS or YARN — it replaced the processing engine. You can still store data on HDFS/S3 and use YARN/Kubernetes to manage resources; Spark is the compute layer that runs on top.
Components (Quick Overview)
OVERVIEW
▾
A Unified Engine, Multiple Modules
Spark is one engine (Spark Core) with several libraries built on top, all sharing the same execution engine and the same cluster. We'll cover each in detail in 2.2 — for now, just know they exist and share resources.
Spark Core
RDDs, task scheduling, memory management — the engine everything else runs on.
Spark SQL
DataFrames, SQL queries, Catalyst optimizer.
Spark Streaming
Structured Streaming for real-time data.
MLlib
Machine learning algorithms at scale.
GraphX
Graph processing (social networks, etc.)
Real-World Use Cases
WHY IT MATTERS
▾
Where Spark Shows Up On The Job
As a Data Engineer, you'll use Spark to move and transform data at scale. Here are the most common real-world scenarios:
ETL / ELT Pipelines
Read from databases/files/Kafka, clean & transform, write to Delta/Snowflake/warehouse.
Analytics & BI
Aggregate billions of rows for dashboards (sales, fraud, ops metrics).
CDC Pipelines
Capture database changes (insert/update/delete) and replicate to a lakehouse.
Real-Time Streaming
Process clickstreams, IoT sensor data, fraud detection in near real-time.
Feature Engineering for ML
Prepare large training datasets for machine learning models.
Lakehouse Platforms
Build Bronze/Silver/Gold layers on Delta, Iceberg, or Hudi.
💡 Example
A food delivery company uses Spark to: read order events from Kafka in real-time → join with restaurant and rider data → compute delivery ETAs → write results to Delta Lake → power a live ops dashboard. All of this happens continuously, 24/7, at massive scale.2.2
Spark Components
A deep look at each library that makes up the Spark ecosystem, what it's used for, and how it fits with PySpark.
Spark Core
THE FOUNDATION
▾
What is Spark Core?
Spark Core is the base engine that provides distributed task scheduling, memory management, fault recovery, and the original RDD (Resilient Distributed Dataset) API. Every other component (SQL, Streaming, MLlib, GraphX) is built ON TOP of Spark Core and uses its scheduler and memory manager under the hood.
🧠 Analogy
Spark Core is like the engine and chassis of a car. Spark SQL, Streaming, MLlib, and GraphX are different "bodies" (sedan, truck, sports car) built on the same chassis — they all rely on the same engine to actually move.python — spark core / rdd basics
from pyspark import SparkContext
# SparkContext is the entry point to Spark Core (RDD API)
sc = SparkContext(appName="CoreExample")
# Create an RDD — a low-level distributed collection
rdd = sc.parallelize([1, 2, 3, 4, 5])
# Transformation (lazy) + Action (triggers execution)
squared = rdd.map(lambda x: x * x)
print(squared.collect()) # [1, 4, 9, 16, 25]
🔑 Key Insight
In modern PySpark, you rarely touch RDDs directly — you use DataFrames (Spark SQL). But DataFrames are compiled DOWN to RDD operations internally, so understanding Spark Core helps you understand WHY things behave the way they do (Module 4 covers RDDs in depth).
Spark SQL
MOST USED
▾
What is Spark SQL?
Spark SQL is the module for working with structured data using DataFrames and SQL queries. It includes the Catalyst Optimizer, which converts your DataFrame code or SQL into an optimized execution plan. This is what 90% of data engineers use daily.
🧠 Analogy
Spark SQL is like a smart GPS. You tell it the destination ("give me total sales by region"), and it figures out the best route (execution plan) — whether that means taking the highway (broadcast join) or back roads (shuffle join) — without you manually planning every turn.python — spark sql / dataframe
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SQLExample").getOrCreate()
df = spark.read.csv("sales.csv", header=True, inferSchema=True)
# DataFrame API
df.groupBy("region").sum("amount").show()
# Or pure SQL — same engine, same optimizer
df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) FROM sales GROUP BY region").show()
🔑 Key Insight
DataFrame API and SQL produce identical execution plans — Catalyst optimizes both the same way. Choose whichever is more readable for the task; performance is the same.
Spark Streaming
REAL-TIME
▾
What is Spark Streaming?
Spark Streaming (now Structured Streaming) processes continuous, never-ending data — like a Kafka topic emitting events 24/7. Spark breaks this infinite stream into small "micro-batches" and applies the same DataFrame API you'd use for batch data. We cover this fully in Modules 18A–18J.
🧠 Analogy
Batch processing is reading an entire book at once. Streaming is reading the book one page at a time AS IT'S BEING WRITTEN — you process each new page (micro-batch) as soon as it arrives, without waiting for "The End".python — structured streaming preview
# Read a continuous stream from Kafka
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "host:9092") \
.option("subscribe", "orders") \
.load()
# Same DataFrame API as batch!
query = stream_df.groupBy("region").count() \
.writeStream \
.format("console") \
.outputMode("complete") \
.start()
MLlib
MACHINE LEARNING
▾
What is MLlib?
MLlib is Spark's distributed machine learning library — algorithms for classification, regression, clustering, recommendation, plus feature engineering tools (scaling, encoding) and pipelines. It scales ML to datasets too big to fit on one machine.
🧠 Analogy
Scikit-learn trains a model using data that fits in one computer's RAM. MLlib trains the same TYPE of model, but the data and computation are spread across hundreds of machines — like training on a 10TB dataset that no single laptop could hold.python — mllib quick example
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# Combine feature columns into a single vector column
assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features")
data = assembler.transform(df)
# Train a distributed linear regression model
lr = LinearRegression(featuresCol="features", labelCol="spend")
model = lr.fit(data)
🔑 Key Insight
As a Data Engineer, you typically won't build models — but you'll often do the feature engineering (joining, aggregating, cleaning data) that feeds into MLlib or external ML platforms.
GraphX
GRAPH PROCESSING
▾
What is GraphX?
GraphX is Spark's API for graphs — data made of vertices (nodes) and edges (connections), like social networks, recommendation graphs, or fraud rings. It supports algorithms like PageRank and connected components. Note: GraphX is Scala/Java only — it does not have a native PySpark API (GraphFrames is a community package that adds graph support to PySpark).
🧠 Analogy
Think of GraphX as the engine behind "people you may know" on a social network — it analyzes how millions of users (vertices) connect to each other (edges) to find patterns, clusters, and influential nodes.Vertices
The "things" in the graph — users, products, accounts.
Edges
Relationships — "follows", "purchased", "transferred money to".
Common Use Cases
Fraud ring detection, recommendation engines, network analysis.
⚠️ Note
For PySpark users, GraphX is rarely used directly. It's listed here for completeness since it's part of the Spark ecosystem — but in practice, graph workloads on PySpark use the GraphFrames package or dedicated graph databases (Neo4j, Neptune).2.3
Cluster Managers
Spark needs a "manager" to give it machines (CPU + RAM) to run on. Here are the four cluster managers Spark supports.
Standalone Cluster Manager
SIMPLEST
▾
What is Standalone Mode?
Spark's built-in cluster manager — no extra software needed. You start a master process and worker processes manually, and Spark manages resources itself. Great for learning, small clusters, or dedicated Spark-only environments.
🧠 Analogy
Standalone mode is like running your own small restaurant where you (the owner) personally assign tables to waiters. No external booking system — just you managing everything directly.bash — starting a standalone cluster
# On the master machine
$ start-master.sh
# Master UI available at http://master-host:8080
# On each worker machine, connect to the master
$ start-worker.sh spark://master-host:7077
# Submit a job to the standalone cluster
$ spark-submit --master spark://master-host:7077 my_job.py
🔑 Key Insight
Standalone mode is rarely used in production at large companies — it's mostly for learning, testing, or small on-prem deployments. Production usually uses YARN, Kubernetes, or managed platforms (Databricks, EMR).
YARN (Yet Another Resource Negotiator)
HADOOP ECOSYSTEM
▾
What is YARN?
YARN is Hadoop's resource manager. It was built for MapReduce but Spark integrates with it fully. YARN has a ResourceManager (decides which apps get resources) and NodeManagers (run on each machine, manage containers). YARN is the most common cluster manager in traditional on-prem Hadoop clusters and EMR.
🧠 Analogy
YARN is like an office building manager. When your team (Spark app) needs meeting rooms (containers with CPU/RAM), you ask the building manager (ResourceManager), who checks availability across all floors (NodeManagers) and assigns you rooms.bash — submitting to yarn
# Cluster mode — driver runs INSIDE the cluster (on a NodeManager)
$ spark-submit --master yarn --deploy-mode cluster my_job.py
# Client mode — driver runs on the machine you submit from
$ spark-submit --master yarn --deploy-mode client my_job.py
| Mode | Driver Location | Best For |
|---|---|---|
| cluster | Inside the cluster (a container) | Production jobs — survives if your terminal closes |
| client | On the machine running spark-submit | Interactive/debugging — see logs locally in real time |
Kubernetes
CLOUD-NATIVE / MODERN
▾
What is Spark on Kubernetes?
Spark can run natively on Kubernetes (K8s) — the driver and each executor run as their own pods. This is the modern, cloud-native way to run Spark, used heavily on EKS, GKE, AKS. We cover this in full depth in Module 28.
🧠 Analogy
Kubernetes is like a smart shipping company. You say "I need 1 big container (driver pod) and 10 medium containers (executor pods)", and Kubernetes finds space on its ships (nodes), schedules them, restarts any that fail, and scales up/down as needed.bash — submitting to kubernetes
$ spark-submit \
--master k8s://https://<k8s-api-server>:443 \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=my-spark-image:latest \
--conf spark.executor.instances=5 \
local:///opt/spark/jobs/my_job.py
🔑 Key Insight
Kubernetes is increasingly preferred over YARN for new platforms because it gives multi-tenancy, container isolation, and works the same across AWS/Azure/GCP — no vendor lock-in to a specific Hadoop distribution.
Apache Mesos
LEGACY / DEPRECATED
▾
What was Mesos?
Mesos was a general-purpose cluster manager (used by companies like Twitter and Airbnb) that could run Spark alongside other frameworks. Spark's support for Mesos was removed in Spark 3.2+ as the industry moved to Kubernetes.
| Cluster Manager | Status | Common Use |
|---|---|---|
| Standalone | Niche | Learning, small dedicated clusters |
| YARN | Widely Used | On-prem Hadoop, EMR, Cloudera, Synapse |
| Kubernetes | Growing Fast | EKS/GKE/AKS, cloud-native platforms |
| Mesos | Removed (3.2+) | Historical only — don't use for new projects |
⚠️ Interview Tip
If asked "which cluster managers does Spark support", mention Standalone, YARN, and Kubernetes as current options. Mesos is good to know historically but is no longer supported.2.4
Spark Execution Architecture
The most important topic in this module — how your PySpark code actually runs across a cluster, from Driver to Executors, and the Application → Job → Stage → Task hierarchy.
The Big Picture: Driver, Cluster Manager, Executors
CORE ARCHITECTURE
▾
The Three Main Players
Every Spark application has 3 key components working together: the Driver (the brain), the Cluster Manager (the resource allocator — YARN/K8s/Standalone from section 2.3), and the Executors (the workers that do the actual data processing).
YOUR CODE
(driver.py)
(driver.py)
→
DRIVER
(SparkSession + DAG Scheduler)
(SparkSession + DAG Scheduler)
→
requests resources from
CLUSTER MANAGER
(YARN / K8s / Standalone)
(YARN / K8s / Standalone)
→
launches
Executor 1
(Tasks + Cache)
(Tasks + Cache)
Executor 2
(Tasks + Cache)
(Tasks + Cache)
Executor N
(Tasks + Cache)
(Tasks + Cache)
🧠 Analogy
Think of building a house. You (Driver) have the blueprint and decide what needs to happen and in what order. The contractor (Cluster Manager) assigns workers and equipment. The construction workers (Executors) actually lay bricks, pour concrete — the real work happens here, in parallel, on different parts of the house.Driver
Runs your main() function, creates the SparkSession, builds the DAG, schedules tasks. Only ONE per application.
Cluster Manager
Allocates CPU/RAM resources for the Driver and Executors (YARN, Kubernetes, Standalone).
Executors
JVM processes on worker nodes that run tasks and store cached data. Many per application.
SparkSession
ENTRY POINT
▾
What is SparkSession?
SparkSession is the single unified entry point to all of Spark's functionality (Spark SQL, DataFrames, Streaming, configurations). Before Spark 2.0, you needed separate SparkContext, SQLContext, HiveContext — SparkSession combines them all into one object. It lives inside the Driver.
🧠 Analogy
SparkSession is like the "ignition key + dashboard" of a car. One key turns everything on, and the dashboard (session object) gives you access to every system — radio, AC, navigation (DataFrame API, SQL, config, streaming) — instead of having separate keys for each.python — creating a sparksession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("OrderProcessing") \
.master("local[4]") \
.config("spark.executor.memory", "4g") \
.getOrCreate()
# Now you can use spark for everything:
df = spark.read.csv("data.csv") # DataFrame API
spark.sql("SELECT * FROM my_view") # Spark SQL
spark.readStream.format("kafka") # Streaming
spark.conf.get("spark.sql.shuffle.partitions") # Config
DAG Generation
PLANNING
▾
What is a DAG?
A DAG (Directed Acyclic Graph) is a map of all the operations (transformations) your code needs to perform, and how they depend on each other. "Directed" = each step points to the next. "Acyclic" = no loops, it always moves forward. Spark builds this DAG lazily — nothing actually runs until an action (like
.show(), .collect(), .write()) is called.🧠 Analogy
Writing transformations (filter, select, groupBy) is like writing a recipe on paper — nothing is cooked yet. Calling an action (.show(), .write()) is like saying "okay, start cooking now" — only then does Spark look at the full recipe (DAG) and figure out the most efficient way to execute it.python — lazy evaluation and dag
df = spark.read.csv("orders.csv", header=True)
# These are TRANSFORMATIONS — lazy, just added to the DAG, nothing runs yet
filtered = df.filter(df.amount > 100)
grouped = filtered.groupBy("region").sum("amount")
sorted_ = grouped.orderBy("sum(amount)", ascending=False)
# This is an ACTION — NOW Spark builds the DAG and executes it
sorted_.show()
🔑 Key Insight
Lazy evaluation lets Spark see the WHOLE picture before running anything — so it can optimize (combine filters, reorder operations, skip unnecessary work) before a single task is executed. This is the foundation of the Catalyst Optimizer (Module 16).
Cluster & Worker Nodes
INFRASTRUCTURE
▾
What's a Cluster and Worker Node?
A cluster is a group of machines working together. Each physical/virtual machine in the cluster is a worker node. Each worker node can run one or more Executors (JVM processes), and each Executor can run multiple tasks in parallel using multiple cores.
CLUSTER (e.g., 5 worker nodes, 16 cores + 64GB RAM each)
Worker Node 1 → Executor (4 cores, 16GB)
Worker Node 2 → Executor (4 cores, 16GB)
Worker Node 3 → 2x Executors (4 cores, 16GB each)
🧠 Analogy
The cluster is the whole factory building. Each worker node is one floor of the factory. Each Executor is a workstation set up on that floor, and each core is a pair of hands at that workstation that can work on a task simultaneously.
Memory Management (Overview)
OVERVIEW
▾
Driver Memory vs Executor Memory
Memory is split between the Driver (where your code runs and small results get collected) and the Executors (where actual data processing, caching, and shuffling happen). This is a high-level intro — Module 17 covers Spark Memory Internals in full depth (Unified Memory Manager, spills, GC).
| Memory Type | Used For | If too small... |
|---|---|---|
| Driver Memory | Holding the plan, broadcast variables, .collect() results | Driver OOM if you collect too much data |
| Executor Memory | Task execution, caching, shuffle buffers | Spills to disk or Executor OOM |
⚠️ Common Mistake
Calling .collect() on a huge DataFrame pulls ALL data into the Driver's memory — a frequent cause of Driver OOM errors. Use .show(), .take(n), or write to storage instead.
Micro Topics: Application → Job → Stage → Task
MUST KNOW HIERARCHY
▾
The Hierarchy: Application → Job → Stage → Task
This is one of the most commonly asked interview topics. When you submit a PySpark script, it forms ONE Application. Each action (like
.show() or .write()) triggers a Job. Each Job is split into Stages — separated by shuffle boundaries (wide dependencies). Each Stage is split into Tasks — one task per data partition.APPLICATION (your whole spark-submit run)
JOB 1 (triggered by .show())
JOB 2 (triggered by .write())
Stage 1
(before shuffle)
(before shuffle)
→ shuffle →
Stage 2
(after shuffle)
(after shuffle)
Task 1
Task 2
Task 3
... one task per partition
🧠 Analogy
Think of a restaurant kitchen during dinner service. Application = the entire night's service. Job = one full order ticket (e.g., Table 5's order). Stage = a phase of cooking that must finish before the next starts (e.g., "prep all ingredients" must finish before "plate the dish" — there's a hand-off point, like a shuffle). Task = one chef chopping one specific ingredient — many chefs (tasks) work in the same phase (stage) simultaneously.Application
One SparkSession / spark-submit run. Has ONE Driver and a pool of Executors for its entire lifetime.
Job
Created for each ACTION (show, count, collect, write, save). One application can have many jobs.
Stage
A set of tasks that can run WITHOUT a shuffle. A new stage starts whenever data needs to be shuffled (e.g., groupBy, join, repartition).
Task
The smallest unit of work — runs on ONE partition of data on ONE executor core.
python — seeing jobs, stages, tasks in action
df = spark.read.csv("orders.csv", header=True, inferSchema=True) # 8 partitions
# filter() and select() = narrow transformations, no shuffle → STAGE 1 (8 tasks)
filtered = df.filter(df.amount > 100).select("region", "amount")
# groupBy() = wide transformation, causes a SHUFFLE → new STAGE
result = filtered.groupBy("region").sum("amount")
# .show() = ACTION → creates JOB 1, which has 2 stages
# Job 1
# ├── Stage 1: read + filter + select → 8 tasks (one per input partition)
# └── Stage 2: groupBy + sum (after shuffle) → 200 tasks (default shuffle partitions)
result.show()
# .write() = ACTION → creates JOB 2 (separate job!)
result.write.parquet("output/")
🔑 Key Insight
You can SEE this entire hierarchy live in the Spark UI (Jobs tab → Stages tab → Tasks tab), covered in Module 26. Understanding this hierarchy is essential for debugging slow jobs — e.g., "Stage 2 has 200 tasks but only 4 cores → many tasks queue up and wait."Micro Topics: Driver Memory, Executor Memory, Executor Cores
These three settings are the most common
spark-submit configuration parameters and directly control how your Application's resources are shaped.| Config | Flag | What it Controls | Typical Value |
|---|---|---|---|
| Driver Memory | --driver-memory | RAM for the Driver process (plans, broadcast vars, collected results) | 2g – 8g |
| Executor Memory | --executor-memory | RAM per Executor for tasks, caching, shuffle | 4g – 16g |
| Executor Cores | --executor-cores | Number of concurrent tasks each Executor can run | 2 – 5 (5 is a common sweet spot) |
bash — spark-submit with resource config
$ spark-submit \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 10 \
my_etl_job.py
# Total cluster resources used:
# Executors: 10 x (8GB + 4 cores) = 80GB RAM, 40 cores
# Plus driver: 4GB RAM, 1 core
💡 Example
If --executor-cores 4 and a stage has 200 tasks per executor's share, that executor processes 4 tasks at a time — the remaining tasks queue up and run as earlier ones finish. More cores = more parallelism, but too many cores per executor can cause memory contention between tasks sharing that executor's RAM.2.5
Spark Connect (Spark 4.x Era)
A modern, decoupled way of connecting to Spark — separating "where your code runs" from "where Spark actually executes".
Client-Server Architecture
NEW MODEL
▾
What is Spark Connect?
In "classic" Spark, your Driver process must be physically close to (or part of) the cluster. Spark Connect (stable since Spark 3.4, central to the 4.x era) splits Spark into a thin client (your laptop/IDE, just builds DataFrame query plans) and a server (the actual Spark cluster that executes them) — connected over gRPC.
CLIENT
(Your laptop, VS Code, Jupyter)
builds unresolved logical plan
(Your laptop, VS Code, Jupyter)
builds unresolved logical plan
⇄ gRPC ⇄
SPARK CONNECT SERVER
(Real Driver + Executors run here)
(Real Driver + Executors run here)
🧠 Analogy
Classic Spark is like having to physically be inside a factory to operate its machines. Spark Connect is like operating the factory's machines remotely via a phone app — you (client) send commands, the factory (server) does the heavy lifting, and you just see results. You can close your laptop and the factory keeps running.Remote Spark Sessions
With Spark Connect, creating a session looks almost identical — but you connect to a remote endpoint instead of starting a local/cluster driver directly.
python — connecting via spark connect
from pyspark.sql import SparkSession
# Connect to a remote Spark Connect server (cluster)
# instead of starting a local driver
spark = SparkSession.builder \
.remote("sc://my-spark-cluster.company.com:15002") \
.getOrCreate()
# Your code looks EXACTLY the same as classic Spark!
df = spark.read.table("sales")
df.groupBy("region").sum("amount").show()
# All execution happens on the remote server, not your laptop
Databricks Connect & IDE Integration
REAL-WORLD USAGE
▾
Databricks Connect (v2, Built on Spark Connect)
Databricks Connect is the most common real-world use of this architecture. It lets you write and run PySpark code in your local IDE (VS Code, PyCharm) while the actual computation executes on a remote Databricks cluster. This means full IDE features (debugging, autocomplete, linting, git) while using real cluster-scale compute and data.
python — databricks connect example
# Running locally in VS Code, but executing on a Databricks cluster
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.profile("my-profile").getOrCreate()
# df.show() runs on the remote Databricks cluster,
# but you can set breakpoints and debug locally!
df = spark.read.table("main.sales.orders")
df.show()
🔑 Key Insight
Before Spark Connect, "remote development" against Databricks meant writing code in notebooks in the browser. Now you can develop in a proper IDE locally — a major quality-of-life improvement for engineering teams.
Spark Connect Protocol
UNDER THE HOOD
▾
How the Protocol Works
When you call DataFrame methods (
.filter(), .groupBy()), the client doesn't execute anything — it builds an unresolved logical plan as a protobuf message and sends it to the server via gRPC. The server (real Spark Driver) resolves it, runs it through Catalyst (Module 16), executes it on Executors, and streams results back as Arrow batches.df.filter(...).groupBy(...)
→
Build unresolved plan (protobuf)
→
send over gRPC
Server: Analyze → Catalyst Optimize → Execute
→
stream results
Client receives Arrow record batches → .show() prints
Spark Connect vs Classic Mode
COMPARISON
▾
Side-by-Side Comparison
| Aspect | Classic Mode | Spark Connect |
|---|---|---|
| Driver Location | Same JVM as your app, on the cluster | Decoupled — runs on the server; client is a thin shell |
| Client Language Requirements | Needs JVM locally (PySpark wraps JVM) | Pure Python/Go/Rust client possible — no local JVM needed |
| Crash Isolation | A bad client request can crash the whole Driver JVM | Client crash doesn't affect the server session |
| Multiple Sessions | One Driver = one app, harder to share | Many clients can share one server cluster |
| Use Case Fit | Traditional spark-submit batch jobs | Interactive dev, notebooks, IDE debugging, multi-language clients |
Use Cases and Limitations
Spark Connect is great for interactive development and tooling — but has some gaps compared to classic mode.
Great For
IDE development, notebooks, multi-user clusters, language-agnostic clients (e.g., Go, Rust apps calling Spark).
Limitations
Some RDD-level APIs and certain third-party extensions may not yet be fully supported over Spark Connect.
Production Batch Jobs
Classic spark-submit is still the standard for scheduled production ETL jobs (Module 24, Airflow).
⚠️ Interview Tip
Spark Connect is a newer topic — interviewers asking about it usually want to know you're aware of the client-server split and that it powers tools like Databricks Connect. You don't need to memorize protocol internals, just the "why" and "what problem it solves".🧠 Quick Check: A PySpark job calls
df.filter(...), then df.groupBy("region").sum("amount"), then .show(). How many Jobs and how many Stages (minimum) does this create?