MODULE 2 Spark Architecture

1 / 5

What is Apache Spark?

The story of Spark, what problem it solves, and where it's used in the real world — the foundation before we go into architecture.

📜

History of Apache Spark BACKGROUND ▾

From MapReduce to Spark

Spark was created in 2009 at UC Berkeley's AMPLab, became an Apache project in 2010, and was open-sourced in 2014 as a top-level Apache project. It was built to fix the biggest pain point of Hadoop MapReduce: every step writes to disk. Spark keeps data in memory (RAM) between steps, making it 10–100x faster for iterative workloads like ML and interactive queries.

🧠 Analogy

MapReduce is like a chef who, after every step (chopping, boiling, frying), puts the food away in the fridge and takes it out again for the next step. Spark is a chef who keeps everything on the counter (RAM) and moves straight from one step to the next — no fridge trips unless the counter gets full.

Aspect	Hadoop MapReduce	Apache Spark
Processing Model	Disk-based, step-by-step	In-memory (RAM) with disk fallback
Speed	Slow for iterative jobs	10–100x faster
API	Java-heavy, verbose	Python, Scala, Java, SQL, R
Workloads	Batch only	Batch + Streaming + ML + Graph + SQL
Ease of Use	Low-level map/reduce code	High-level DataFrame/SQL APIs

🔑 Key Insight

Spark didn't replace HDFS or YARN — it replaced the processing engine. You can still store data on HDFS/S3 and use YARN/Kubernetes to manage resources; Spark is the compute layer that runs on top.

🧩

Components (Quick Overview) OVERVIEW ▾

A Unified Engine, Multiple Modules

Spark is one engine (Spark Core) with several libraries built on top, all sharing the same execution engine and the same cluster. We'll cover each in detail in 2.2 — for now, just know they exist and share resources.

⚙️

Spark Core

RDDs, task scheduling, memory management — the engine everything else runs on.

📊

Spark SQL

DataFrames, SQL queries, Catalyst optimizer.

🌊

Spark Streaming

Structured Streaming for real-time data.

🤖

MLlib

Machine learning algorithms at scale.

🕸️

GraphX

Graph processing (social networks, etc.)

🏢

Real-World Use Cases WHY IT MATTERS ▾

Where Spark Shows Up On The Job

As a Data Engineer, you'll use Spark to move and transform data at scale. Here are the most common real-world scenarios:

🏗️

ETL / ELT Pipelines

Read from databases/files/Kafka, clean & transform, write to Delta/Snowflake/warehouse.

📈

Analytics & BI

Aggregate billions of rows for dashboards (sales, fraud, ops metrics).

🔄

CDC Pipelines

Capture database changes (insert/update/delete) and replicate to a lakehouse.

⚡

Real-Time Streaming

Process clickstreams, IoT sensor data, fraud detection in near real-time.

🧠

Feature Engineering for ML

Prepare large training datasets for machine learning models.

🏛️

Lakehouse Platforms

Build Bronze/Silver/Gold layers on Delta, Iceberg, or Hudi.

💡 Example

A food delivery company uses Spark to: read order events from Kafka in real-time → join with restaurant and rider data → compute delivery ETAs → write results to Delta Lake → power a live ops dashboard. All of this happens continuously, 24/7, at massive scale.

2.2

Spark Components

A deep look at each library that makes up the Spark ecosystem, what it's used for, and how it fits with PySpark.

⚙️

Spark Core THE FOUNDATION ▾

What is Spark Core?

Spark Core is the base engine that provides distributed task scheduling, memory management, fault recovery, and the original RDD (Resilient Distributed Dataset) API. Every other component (SQL, Streaming, MLlib, GraphX) is built ON TOP of Spark Core and uses its scheduler and memory manager under the hood.

🧠 Analogy

Spark Core is like the engine and chassis of a car. Spark SQL, Streaming, MLlib, and GraphX are different "bodies" (sedan, truck, sports car) built on the same chassis — they all rely on the same engine to actually move.

python — spark core / rdd basics

from pyspark import SparkContext

# SparkContext is the entry point to Spark Core (RDD API)
sc = SparkContext(appName="CoreExample")

# Create an RDD — a low-level distributed collection
rdd = sc.parallelize([1, 2, 3, 4, 5])

# Transformation (lazy) + Action (triggers execution)
squared = rdd.map(lambda x: x * x)
print(squared.collect())  # [1, 4, 9, 16, 25]

🔑 Key Insight

In modern PySpark, you rarely touch RDDs directly — you use DataFrames (Spark SQL). But DataFrames are compiled DOWN to RDD operations internally, so understanding Spark Core helps you understand WHY things behave the way they do (Module 4 covers RDDs in depth).

📊

Spark SQL MOST USED ▾

What is Spark SQL?

Spark SQL is the module for working with structured data using DataFrames and SQL queries. It includes the Catalyst Optimizer, which converts your DataFrame code or SQL into an optimized execution plan. This is what 90% of data engineers use daily.

🧠 Analogy

Spark SQL is like a smart GPS. You tell it the destination ("give me total sales by region"), and it figures out the best route (execution plan) — whether that means taking the highway (broadcast join) or back roads (shuffle join) — without you manually planning every turn.

python — spark sql / dataframe

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQLExample").getOrCreate()

df = spark.read.csv("sales.csv", header=True, inferSchema=True)

# DataFrame API
df.groupBy("region").sum("amount").show()

# Or pure SQL — same engine, same optimizer
df.createOrReplaceTempView("sales")
spark.sql("SELECT region, SUM(amount) FROM sales GROUP BY region").show()

🔑 Key Insight

DataFrame API and SQL produce identical execution plans — Catalyst optimizes both the same way. Choose whichever is more readable for the task; performance is the same.

🌊

Spark Streaming REAL-TIME ▾

What is Spark Streaming?

Spark Streaming (now Structured Streaming) processes continuous, never-ending data — like a Kafka topic emitting events 24/7. Spark breaks this infinite stream into small "micro-batches" and applies the same DataFrame API you'd use for batch data. We cover this fully in Modules 18A–18J.

🧠 Analogy

Batch processing is reading an entire book at once. Streaming is reading the book one page at a time AS IT'S BEING WRITTEN — you process each new page (micro-batch) as soon as it arrives, without waiting for "The End".

python — structured streaming preview

# Read a continuous stream from Kafka
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "host:9092") \
    .option("subscribe", "orders") \
    .load()

# Same DataFrame API as batch!
query = stream_df.groupBy("region").count() \
    .writeStream \
    .format("console") \
    .outputMode("complete") \
    .start()

🤖

MLlib MACHINE LEARNING ▾

What is MLlib?

MLlib is Spark's distributed machine learning library — algorithms for classification, regression, clustering, recommendation, plus feature engineering tools (scaling, encoding) and pipelines. It scales ML to datasets too big to fit on one machine.

🧠 Analogy

Scikit-learn trains a model using data that fits in one computer's RAM. MLlib trains the same TYPE of model, but the data and computation are spread across hundreds of machines — like training on a 10TB dataset that no single laptop could hold.

python — mllib quick example

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# Combine feature columns into a single vector column
assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features")
data = assembler.transform(df)

# Train a distributed linear regression model
lr = LinearRegression(featuresCol="features", labelCol="spend")
model = lr.fit(data)

🔑 Key Insight

As a Data Engineer, you typically won't build models — but you'll often do the feature engineering (joining, aggregating, cleaning data) that feeds into MLlib or external ML platforms.

🕸️

GraphX GRAPH PROCESSING ▾

What is GraphX?

GraphX is Spark's API for graphs — data made of vertices (nodes) and edges (connections), like social networks, recommendation graphs, or fraud rings. It supports algorithms like PageRank and connected components. Note: GraphX is Scala/Java only — it does not have a native PySpark API (GraphFrames is a community package that adds graph support to PySpark).

🧠 Analogy

Think of GraphX as the engine behind "people you may know" on a social network — it analyzes how millions of users (vertices) connect to each other (edges) to find patterns, clusters, and influential nodes.

👤

Vertices

The "things" in the graph — users, products, accounts.

🔗

Edges

Relationships — "follows", "purchased", "transferred money to".

🎯

Common Use Cases

Fraud ring detection, recommendation engines, network analysis.

⚠️ Note

For PySpark users, GraphX is rarely used directly. It's listed here for completeness since it's part of the Spark ecosystem — but in practice, graph workloads on PySpark use the GraphFrames package or dedicated graph databases (Neo4j, Neptune).

2.3

Cluster Managers

Spark needs a "manager" to give it machines (CPU + RAM) to run on. Here are the four cluster managers Spark supports.

🏠

Standalone Cluster Manager SIMPLEST ▾

What is Standalone Mode?

Spark's built-in cluster manager — no extra software needed. You start a master process and worker processes manually, and Spark manages resources itself. Great for learning, small clusters, or dedicated Spark-only environments.

🧠 Analogy

Standalone mode is like running your own small restaurant where you (the owner) personally assign tables to waiters. No external booking system — just you managing everything directly.

bash — starting a standalone cluster

# On the master machine
$ start-master.sh
# Master UI available at http://master-host:8080

# On each worker machine, connect to the master
$ start-worker.sh spark://master-host:7077

# Submit a job to the standalone cluster
$ spark-submit --master spark://master-host:7077 my_job.py

🔑 Key Insight

Standalone mode is rarely used in production at large companies — it's mostly for learning, testing, or small on-prem deployments. Production usually uses YARN, Kubernetes, or managed platforms (Databricks, EMR).

🐘

YARN (Yet Another Resource Negotiator) HADOOP ECOSYSTEM ▾

What is YARN?

YARN is Hadoop's resource manager. It was built for MapReduce but Spark integrates with it fully. YARN has a ResourceManager (decides which apps get resources) and NodeManagers (run on each machine, manage containers). YARN is the most common cluster manager in traditional on-prem Hadoop clusters and EMR.

🧠 Analogy

YARN is like an office building manager. When your team (Spark app) needs meeting rooms (containers with CPU/RAM), you ask the building manager (ResourceManager), who checks availability across all floors (NodeManagers) and assigns you rooms.

bash — submitting to yarn

# Cluster mode — driver runs INSIDE the cluster (on a NodeManager)
$ spark-submit --master yarn --deploy-mode cluster my_job.py

# Client mode — driver runs on the machine you submit from
$ spark-submit --master yarn --deploy-mode client my_job.py

Mode	Driver Location	Best For
cluster	Inside the cluster (a container)	Production jobs — survives if your terminal closes
client	On the machine running spark-submit	Interactive/debugging — see logs locally in real time

☸️

Kubernetes CLOUD-NATIVE / MODERN ▾

What is Spark on Kubernetes?

Spark can run natively on Kubernetes (K8s) — the driver and each executor run as their own pods. This is the modern, cloud-native way to run Spark, used heavily on EKS, GKE, AKS. We cover this in full depth in Module 28.

🧠 Analogy

Kubernetes is like a smart shipping company. You say "I need 1 big container (driver pod) and 10 medium containers (executor pods)", and Kubernetes finds space on its ships (nodes), schedules them, restarts any that fail, and scales up/down as needed.

bash — submitting to kubernetes

$ spark-submit \
    --master k8s://https://<k8s-api-server>:443 \
    --deploy-mode cluster \
    --conf spark.kubernetes.container.image=my-spark-image:latest \
    --conf spark.executor.instances=5 \
    local:///opt/spark/jobs/my_job.py

🔑 Key Insight

Kubernetes is increasingly preferred over YARN for new platforms because it gives multi-tenancy, container isolation, and works the same across AWS/Azure/GCP — no vendor lock-in to a specific Hadoop distribution.

🦏

Apache Mesos LEGACY / DEPRECATED ▾

What was Mesos?

Mesos was a general-purpose cluster manager (used by companies like Twitter and Airbnb) that could run Spark alongside other frameworks. Spark's support for Mesos was removed in Spark 3.2+ as the industry moved to Kubernetes.

Cluster Manager	Status	Common Use
Standalone	Niche	Learning, small dedicated clusters
YARN	Widely Used	On-prem Hadoop, EMR, Cloudera, Synapse
Kubernetes	Growing Fast	EKS/GKE/AKS, cloud-native platforms
Mesos	Removed (3.2+)	Historical only — don't use for new projects

⚠️ Interview Tip

If asked "which cluster managers does Spark support", mention Standalone, YARN, and Kubernetes as current options. Mesos is good to know historically but is no longer supported.

2.4

Spark Execution Architecture

The most important topic in this module — how your PySpark code actually runs across a cluster, from Driver to Executors, and the Application → Job → Stage → Task hierarchy.

🧠

The Big Picture: Driver, Cluster Manager, Executors CORE ARCHITECTURE ▾

The Three Main Players

Every Spark application has 3 key components working together: the Driver (the brain), the Cluster Manager (the resource allocator — YARN/K8s/Standalone from section 2.3), and the Executors (the workers that do the actual data processing).

YOUR CODE
(driver.py)

→

DRIVER
(SparkSession + DAG Scheduler)

→

requests resources from

CLUSTER MANAGER
(YARN / K8s / Standalone)

→

launches

Executor 1
(Tasks + Cache)

Executor 2
(Tasks + Cache)

Executor N
(Tasks + Cache)

🧠 Analogy

Think of building a house. You (Driver) have the blueprint and decide what needs to happen and in what order. The contractor (Cluster Manager) assigns workers and equipment. The construction workers (Executors) actually lay bricks, pour concrete — the real work happens here, in parallel, on different parts of the house.

🧠

Driver

Runs your main() function, creates the SparkSession, builds the DAG, schedules tasks. Only ONE per application.

🗂️

Cluster Manager

Allocates CPU/RAM resources for the Driver and Executors (YARN, Kubernetes, Standalone).

⚙️

Executors

JVM processes on worker nodes that run tasks and store cached data. Many per application.

🔑

SparkSession ENTRY POINT ▾

What is SparkSession?

SparkSession is the single unified entry point to all of Spark's functionality (Spark SQL, DataFrames, Streaming, configurations). Before Spark 2.0, you needed separate SparkContext, SQLContext, HiveContext — SparkSession combines them all into one object. It lives inside the Driver.

🧠 Analogy

SparkSession is like the "ignition key + dashboard" of a car. One key turns everything on, and the dashboard (session object) gives you access to every system — radio, AC, navigation (DataFrame API, SQL, config, streaming) — instead of having separate keys for each.

python — creating a sparksession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("OrderProcessing") \
    .master("local[4]") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()

# Now you can use spark for everything:
df = spark.read.csv("data.csv")         # DataFrame API
spark.sql("SELECT * FROM my_view")       # Spark SQL
spark.readStream.format("kafka")         # Streaming
spark.conf.get("spark.sql.shuffle.partitions")  # Config

🗺️

DAG Generation PLANNING ▾

What is a DAG?

A DAG (Directed Acyclic Graph) is a map of all the operations (transformations) your code needs to perform, and how they depend on each other. "Directed" = each step points to the next. "Acyclic" = no loops, it always moves forward. Spark builds this DAG lazily — nothing actually runs until an action (like .show(), .collect(), .write()) is called.

🧠 Analogy

Writing transformations (filter, select, groupBy) is like writing a recipe on paper — nothing is cooked yet. Calling an action (.show(), .write()) is like saying "okay, start cooking now" — only then does Spark look at the full recipe (DAG) and figure out the most efficient way to execute it.

python — lazy evaluation and dag

df = spark.read.csv("orders.csv", header=True)

# These are TRANSFORMATIONS — lazy, just added to the DAG, nothing runs yet
filtered = df.filter(df.amount > 100)
grouped  = filtered.groupBy("region").sum("amount")
sorted_  = grouped.orderBy("sum(amount)", ascending=False)

# This is an ACTION — NOW Spark builds the DAG and executes it
sorted_.show()

🔑 Key Insight

Lazy evaluation lets Spark see the WHOLE picture before running anything — so it can optimize (combine filters, reorder operations, skip unnecessary work) before a single task is executed. This is the foundation of the Catalyst Optimizer (Module 16).

🖥️

Cluster & Worker Nodes INFRASTRUCTURE ▾

What's a Cluster and Worker Node?

A cluster is a group of machines working together. Each physical/virtual machine in the cluster is a worker node. Each worker node can run one or more Executors (JVM processes), and each Executor can run multiple tasks in parallel using multiple cores.

CLUSTER (e.g., 5 worker nodes, 16 cores + 64GB RAM each)

Worker Node 1 → Executor (4 cores, 16GB)

Worker Node 2 → Executor (4 cores, 16GB)

Worker Node 3 → 2x Executors (4 cores, 16GB each)

🧠 Analogy

The cluster is the whole factory building. Each worker node is one floor of the factory. Each Executor is a workstation set up on that floor, and each core is a pair of hands at that workstation that can work on a task simultaneously.

💾

Memory Management (Overview) OVERVIEW ▾

Driver Memory vs Executor Memory

Memory is split between the Driver (where your code runs and small results get collected) and the Executors (where actual data processing, caching, and shuffling happen). This is a high-level intro — Module 17 covers Spark Memory Internals in full depth (Unified Memory Manager, spills, GC).

Memory Type	Used For	If too small...
Driver Memory	Holding the plan, broadcast variables, `.collect()` results	Driver OOM if you collect too much data
Executor Memory	Task execution, caching, shuffle buffers	Spills to disk or Executor OOM

⚠️ Common Mistake

Calling .collect() on a huge DataFrame pulls ALL data into the Driver's memory — a frequent cause of Driver OOM errors. Use .show(), .take(n), or write to storage instead.

🪜

Micro Topics: Application → Job → Stage → Task MUST KNOW HIERARCHY ▾

The Hierarchy: Application → Job → Stage → Task

This is one of the most commonly asked interview topics. When you submit a PySpark script, it forms ONE Application. Each action (like .show() or .write()) triggers a Job. Each Job is split into Stages — separated by shuffle boundaries (wide dependencies). Each Stage is split into Tasks — one task per data partition.

APPLICATION (your whole spark-submit run)

JOB 1 (triggered by .show())

JOB 2 (triggered by .write())

Stage 1
(before shuffle)

→ shuffle →

Stage 2
(after shuffle)

Task 1

Task 2

Task 3

... one task per partition

🧠 Analogy

Think of a restaurant kitchen during dinner service. Application = the entire night's service. Job = one full order ticket (e.g., Table 5's order). Stage = a phase of cooking that must finish before the next starts (e.g., "prep all ingredients" must finish before "plate the dish" — there's a hand-off point, like a shuffle). Task = one chef chopping one specific ingredient — many chefs (tasks) work in the same phase (stage) simultaneously.

📦

Application

One SparkSession / spark-submit run. Has ONE Driver and a pool of Executors for its entire lifetime.

🎯

Job

Created for each ACTION (show, count, collect, write, save). One application can have many jobs.

🔀

Stage

A set of tasks that can run WITHOUT a shuffle. A new stage starts whenever data needs to be shuffled (e.g., groupBy, join, repartition).

🧱

Task

The smallest unit of work — runs on ONE partition of data on ONE executor core.

python — seeing jobs, stages, tasks in action

df = spark.read.csv("orders.csv", header=True, inferSchema=True)  # 8 partitions

# filter() and select() = narrow transformations, no shuffle → STAGE 1 (8 tasks)
filtered = df.filter(df.amount > 100).select("region", "amount")

# groupBy() = wide transformation, causes a SHUFFLE → new STAGE
result = filtered.groupBy("region").sum("amount")

# .show() = ACTION → creates JOB 1, which has 2 stages
#   Job 1
#     ├── Stage 1: read + filter + select  → 8 tasks (one per input partition)
#     └── Stage 2: groupBy + sum (after shuffle) → 200 tasks (default shuffle partitions)
result.show()

# .write() = ACTION → creates JOB 2 (separate job!)
result.write.parquet("output/")

🔑 Key Insight

You can SEE this entire hierarchy live in the Spark UI (Jobs tab → Stages tab → Tasks tab), covered in Module 26. Understanding this hierarchy is essential for debugging slow jobs — e.g., "Stage 2 has 200 tasks but only 4 cores → many tasks queue up and wait."

Micro Topics: Driver Memory, Executor Memory, Executor Cores

These three settings are the most common spark-submit configuration parameters and directly control how your Application's resources are shaped.

Config	Flag	What it Controls	Typical Value
Driver Memory	`--driver-memory`	RAM for the Driver process (plans, broadcast vars, collected results)	2g – 8g
Executor Memory	`--executor-memory`	RAM per Executor for tasks, caching, shuffle	4g – 16g
Executor Cores	`--executor-cores`	Number of concurrent tasks each Executor can run	2 – 5 (5 is a common sweet spot)

bash — spark-submit with resource config

$ spark-submit \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 8g \
    --executor-cores 4 \
    --num-executors 10 \
    my_etl_job.py

# Total cluster resources used:
# Executors: 10 x (8GB + 4 cores) = 80GB RAM, 40 cores
# Plus driver: 4GB RAM, 1 core

💡 Example

If --executor-cores 4 and a stage has 200 tasks per executor's share, that executor processes 4 tasks at a time — the remaining tasks queue up and run as earlier ones finish. More cores = more parallelism, but too many cores per executor can cause memory contention between tasks sharing that executor's RAM.

2.5

Spark Connect (Spark 4.x Era)

A modern, decoupled way of connecting to Spark — separating "where your code runs" from "where Spark actually executes".

🔌

Client-Server Architecture NEW MODEL ▾

What is Spark Connect?

In "classic" Spark, your Driver process must be physically close to (or part of) the cluster. Spark Connect (stable since Spark 3.4, central to the 4.x era) splits Spark into a thin client (your laptop/IDE, just builds DataFrame query plans) and a server (the actual Spark cluster that executes them) — connected over gRPC.

CLIENT
(Your laptop, VS Code, Jupyter)
builds unresolved logical plan

⇄ gRPC ⇄

SPARK CONNECT SERVER
(Real Driver + Executors run here)

🧠 Analogy

Classic Spark is like having to physically be inside a factory to operate its machines. Spark Connect is like operating the factory's machines remotely via a phone app — you (client) send commands, the factory (server) does the heavy lifting, and you just see results. You can close your laptop and the factory keeps running.

Remote Spark Sessions

With Spark Connect, creating a session looks almost identical — but you connect to a remote endpoint instead of starting a local/cluster driver directly.

python — connecting via spark connect

from pyspark.sql import SparkSession

# Connect to a remote Spark Connect server (cluster)
# instead of starting a local driver
spark = SparkSession.builder \
    .remote("sc://my-spark-cluster.company.com:15002") \
    .getOrCreate()

# Your code looks EXACTLY the same as classic Spark!
df = spark.read.table("sales")
df.groupBy("region").sum("amount").show()
# All execution happens on the remote server, not your laptop

🧩

Databricks Connect & IDE Integration REAL-WORLD USAGE ▾

Databricks Connect (v2, Built on Spark Connect)

Databricks Connect is the most common real-world use of this architecture. It lets you write and run PySpark code in your local IDE (VS Code, PyCharm) while the actual computation executes on a remote Databricks cluster. This means full IDE features (debugging, autocomplete, linting, git) while using real cluster-scale compute and data.

python — databricks connect example

# Running locally in VS Code, but executing on a Databricks cluster
from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("my-profile").getOrCreate()

# df.show() runs on the remote Databricks cluster,
# but you can set breakpoints and debug locally!
df = spark.read.table("main.sales.orders")
df.show()

🔑 Key Insight

Before Spark Connect, "remote development" against Databricks meant writing code in notebooks in the browser. Now you can develop in a proper IDE locally — a major quality-of-life improvement for engineering teams.

📡

Spark Connect Protocol UNDER THE HOOD ▾

How the Protocol Works

When you call DataFrame methods (.filter(), .groupBy()), the client doesn't execute anything — it builds an unresolved logical plan as a protobuf message and sends it to the server via gRPC. The server (real Spark Driver) resolves it, runs it through Catalyst (Module 16), executes it on Executors, and streams results back as Arrow batches.

df.filter(...).groupBy(...)

→

Build unresolved plan (protobuf)

→

send over gRPC

Server: Analyze → Catalyst Optimize → Execute

→

stream results

Client receives Arrow record batches → .show() prints

⚖️

Spark Connect vs Classic Mode COMPARISON ▾

Side-by-Side Comparison

Aspect	Classic Mode	Spark Connect
Driver Location	Same JVM as your app, on the cluster	Decoupled — runs on the server; client is a thin shell
Client Language Requirements	Needs JVM locally (PySpark wraps JVM)	Pure Python/Go/Rust client possible — no local JVM needed
Crash Isolation	A bad client request can crash the whole Driver JVM	Client crash doesn't affect the server session
Multiple Sessions	One Driver = one app, harder to share	Many clients can share one server cluster
Use Case Fit	Traditional spark-submit batch jobs	Interactive dev, notebooks, IDE debugging, multi-language clients

Use Cases and Limitations

Spark Connect is great for interactive development and tooling — but has some gaps compared to classic mode.

✅

Great For

IDE development, notebooks, multi-user clusters, language-agnostic clients (e.g., Go, Rust apps calling Spark).

⚠️

Limitations

Some RDD-level APIs and certain third-party extensions may not yet be fully supported over Spark Connect.

🏭

Production Batch Jobs

Classic spark-submit is still the standard for scheduled production ETL jobs (Module 24, Airflow).

⚠️ Interview Tip

Spark Connect is a newer topic — interviewers asking about it usually want to know you're aware of the client-server split and that it powers tools like Databricks Connect. You don't need to memorize protocol internals, just the "why" and "what problem it solves".

🧠 Quick Check: A PySpark job calls df.filter(...), then df.groupBy("region").sum("amount"), then .show(). How many Jobs and how many Stages (minimum) does this create?

1 Job, 1 Stage — everything runs together

1 Job, 2 Stages — groupBy causes a shuffle boundary

2 Jobs, 2 Stages — one job per transformation