3.1
Installation & Local Mode
Getting PySpark running on your machine — installation methods, prerequisites, and running Spark in local mode for development and learning.
Installation
GETTING STARTED
▾
What Do You Need to Install PySpark?
PySpark is a Python wrapper around Spark, which itself runs on the JVM (Java Virtual Machine). So to run PySpark, you need: (1) Java (JDK 8, 11, or 17) — Spark's engine runs on the JVM, (2) Python 3.8+, and (3) the pyspark package itself.
🧠 Analogy
PySpark is like a Python steering wheel attached to a car (Spark) whose engine runs on Java fuel (JVM). You need the engine (Java) installed even though you're only "driving" with Python.bash — installing pyspark
# 1. Check / install Java (JDK 11 recommended)
$ java -version
# 2. Install PySpark via pip (includes Spark binaries)
$ pip install pyspark
# 3. Verify installation
$ pyspark --version
# 4. (Optional) Install commonly paired packages
$ pip install pyspark[sql] pandas numpy
🔑 Key Insight
pip install pyspark bundles a complete Spark distribution — you don't need to separately download Spark from apache.org for local development. That manual download route is mainly used for production cluster setups.Common Installation Issues
A few environment variables often need to be set correctly, especially on Windows, so Spark can find Java and Python.
| Environment Variable | Purpose | Example Value |
|---|---|---|
| JAVA_HOME | Points Spark to your Java installation | /usr/lib/jvm/java-11-openjdk |
| SPARK_HOME | Location of Spark installation (auto-set by pip in most cases) | /usr/local/lib/python3.x/site-packages/pyspark |
| PYSPARK_PYTHON | Which Python executable workers should use | /usr/bin/python3 |
| HADOOP_HOME | Needed on Windows for winutils.exe | C:\hadoop |
⚠️ Common Mistake
"Java gateway process exited" errors almost always mean JAVA_HOME isn't set or points to an incompatible Java version. Spark 3.x works best with Java 8, 11, or 17 — not Java 21+ in some Spark versions.
Local Mode
FOR LEARNING & DEV
▾
What is Local Mode?
Local mode runs the entire Spark application — Driver AND "Executors" — as threads within a single JVM process on your laptop. There's no real cluster; Spark simulates one using your machine's CPU cores. This is the default mode when you don't specify a cluster manager.
🧠 Analogy
Local mode is like a chef cooking an entire restaurant's menu alone in their home kitchen — using multiple burners (CPU cores) at once to simulate having a team, but it's really just one person (one machine) doing everything.python — local mode options
from pyspark.sql import SparkSession
# local → use 1 core only
spark = SparkSession.builder.master("local").getOrCreate()
# local[4] → use exactly 4 cores
spark = SparkSession.builder.master("local[4]").getOrCreate()
# local[*] → use ALL available cores on your machine (most common for dev)
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Quick test
df = spark.range(1000000)
print(df.count()) # 1000000
| Master URL | Meaning | When to Use |
|---|---|---|
local | 1 thread, no parallelism | Simple debugging |
local[4] | 4 threads (4 "executor cores") | Testing parallelism behavior with a fixed count |
local[*] | One thread per CPU core available | Default for development — most common |
🔑 Key Insight
Code written and tested in local mode runs unchanged on a real cluster — just change .master("local[*]") to .master("yarn") or submit with --master yarn. This is why local mode is perfect for learning and iterating quickly before deploying to production.3.2
Standalone Cluster, Docker & Databricks
Beyond local mode — how to set up PySpark in a small standalone cluster, run it inside Docker containers, and the zero-setup experience on Databricks.
Standalone Cluster Setup
MULTI-MACHINE
▾
Setting Up a Real (Small) Cluster
Recall from Module 2 (2.3) that Spark's Standalone cluster manager is built-in. To actually set one up, you download the Spark binary distribution on each machine, start a master process on one machine, and start worker processes on the others, pointing them at the master.
Machine A: start-master.sh → Master UI on :8080
Machine B: start-worker.sh spark://A:7077
Machine C: start-worker.sh spark://A:7077
bash — standalone cluster setup
# Download Spark on every machine (same version on all!)
$ wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar -xvf spark-3.5.0-bin-hadoop3.tgz
$ cd spark-3.5.0-bin-hadoop3
# On the master machine
$ ./sbin/start-master.sh
# Visit http://<master-ip>:8080 to see the cluster UI
# On each worker machine, register with the master
$ ./sbin/start-worker.sh spark://<master-ip>:7077
# Submit a PySpark job from any machine that can reach the master
$ spark-submit --master spark://<master-ip>:7077 my_job.py
🔑 Key Insight
All machines need the same Spark version and ideally the same Python version, since the Driver ships Python code/dependencies that workers must be able to run.
Docker
CONTAINERIZED
▾
Running PySpark in Docker
Running PySpark inside Docker gives you a clean, reproducible environment — no fighting with Java/Python version conflicts on your host machine. You either use a pre-built image (like Jupyter's PySpark notebook image) or build your own.
🧠 Analogy
Installing PySpark directly on your laptop is like cooking in someone else's kitchen — you have to deal with whatever pots, stoves, and ingredients are already there. Docker is like bringing your own fully-equipped portable kitchen — it works exactly the same everywhere you set it up.bash — pyspark via docker
# Quick start: Jupyter + PySpark pre-built image
$ docker run -it -p 8888:8888 jupyter/pyspark-notebook
# Or run spark-submit inside a container with your code mounted
$ docker run -it --rm \
-v $(pwd):/app \
apache/spark:3.5.0 \
/opt/spark/bin/spark-submit /app/my_job.py
dockerfile — custom pyspark image
FROM apache/spark:3.5.0
USER root
RUN pip install pandas numpy
COPY my_job.py /app/my_job.py
WORKDIR /app
ENTRYPOINT ["/opt/spark/bin/spark-submit", "my_job.py"]
🔑 Key Insight
Docker isn't just for local dev — it's also how Spark runs on Kubernetes (Module 28): every Driver and Executor IS a Docker container/pod. Learning Docker basics here pays off later.
Databricks
MANAGED PLATFORM
▾
PySpark with Zero Setup
Databricks is a managed platform built around Spark. You don't install anything — you open a notebook in a browser, attach it to a cluster (which Databricks provisions for you on AWS/Azure/GCP), and a SparkSession called
spark is already created and ready to use. We cover Databricks fully in Module 27.python — pyspark in a databricks notebook
# No SparkSession.builder needed — 'spark' already exists!
df = spark.read.table("samples.nyctaxi.trips")
df.groupBy("pickup_zip").count().show()
# Databricks-specific helper for files, secrets, widgets
dbutils.fs.ls("/databricks-datasets/")
| Setup Method | Effort | Best For |
|---|---|---|
pip install pyspark (Local mode) | Minimal | Learning, small data, unit testing |
| Standalone Cluster | Medium — manual setup | Small dedicated on-prem clusters |
| Docker | Medium — image management | Reproducible dev environments, CI/CD |
| Databricks | Zero — fully managed | Production, teams, notebooks, enterprise scale |
3.3
SparkSession & SparkSession.builder
A practical, line-by-line breakdown of the code you'll write at the top of every PySpark script — the builder pattern and its key configuration methods.
The Builder Pattern
FOUNDATION
▾
SparkSession.builder — Putting It All Together
SparkSession.builder uses the builder design pattern — you chain method calls (.appName(), .master(), .config()) to configure the session step by step, then call .getOrCreate() at the end to actually create (or reuse) it. Each chained method returns the builder itself, so you can keep chaining.🧠 Analogy
Think of ordering a custom coffee: "I'll have a latte (.appName), with oat milk (.master), extra hot, two sugars (.config), and... that's my order, make it now (.getOrCreate())." Each step adds a detail to the order before it's finally placed.python — the full builder chain
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("DailySalesETL") # name shown in Spark UI
.master("local[*]") # where to run
.config("spark.sql.shuffle.partitions", "50") # tuning
.config("spark.executor.memory", "4g")
.getOrCreate() # create or reuse session
)
print(spark.version)
print(spark.sparkContext.appName) # "DailySalesETL"
Micro Topic: appName
IDENTIFICATION
▾
.appName() — Naming Your Application
.appName("...") sets a human-readable name for your Spark Application. This name appears in the Spark UI, Spark History Server, and cluster manager dashboards (YARN ResourceManager UI, Databricks Jobs UI). It does NOT affect execution — it's purely for identification and monitoring.python — appName in practice
spark = SparkSession.builder.appName("CustomerChurnETL").getOrCreate()
# In the Spark UI (http://localhost:4040), the title bar shows:
# "CustomerChurnETL - Spark Jobs"
# In YARN ResourceManager UI, the app list shows:
# Application Name: CustomerChurnETL | User: data_eng | Status: RUNNING
💡 Example
If 50 Spark jobs are running on a shared cluster and yours fails, a descriptive appName like "daily_orders_bronze_to_silver" instead of the default "PySparkShell" lets you (and your team) instantly find the right job in monitoring tools (Module 26).
Micro Topic: master
WHERE IT RUNS
▾
.master() — Telling Spark Where to Run
.master("...") tells Spark which cluster manager to connect to (tying directly back to Module 2.3). This determines whether your app runs locally on your machine or on a real distributed cluster.| master() Value | Meaning |
|---|---|
local[*] | Local mode, use all CPU cores (dev/learning) |
spark://host:7077 | Standalone cluster master |
yarn | YARN cluster manager (Hadoop/EMR) |
k8s://https://host:port | Kubernetes cluster |
python — master() vs spark-submit --master
# Option A: hardcode in code (good for local dev)
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Option B: leave master() OUT of the code, set it at submit time
spark = SparkSession.builder.appName("MyJob").getOrCreate()
# then: spark-submit --master yarn my_job.py
# or: spark-submit --master local[4] my_job.py
⚠️ Best Practice
For production code, avoid hardcoding .master("local[*]"). If you hardcode it and then run spark-submit --master yarn, the hardcoded value usually wins/conflicts. Best practice: omit .master() in code and set it via spark-submit --master ... or cluster config, so the same script works in dev (local) and prod (YARN/K8s) without code changes.
Micro Topic: config
TUNING & SETTINGS
▾
.config() — Setting Spark Configurations
.config(key, value) sets any of Spark's hundreds of configuration properties — memory settings, shuffle behavior, Delta/Iceberg integration, AQE, and more. You can chain multiple .config() calls, and many of these (Module 16) directly control performance.python — common config() examples
spark = (
SparkSession.builder
.appName("ConfigDemo")
# Performance tuning
.config("spark.sql.shuffle.partitions", "200")
.config("spark.sql.adaptive.enabled", "true")
# Memory
.config("spark.executor.memory", "4g")
.config("spark.driver.memory", "2g")
# Enable extra libraries (e.g., Delta Lake)
.config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.getOrCreate()
)
# Read a config value back at runtime
print(spark.conf.get("spark.sql.shuffle.partitions")) # "200"
# Change a config AFTER session creation (not all configs allow this)
spark.conf.set("spark.sql.shuffle.partitions", "100")
🔑 Key Insight
Some configs (like spark.executor.memory) must be set before the session/JVM starts and can't be changed afterward via spark.conf.set(). Others (like spark.sql.shuffle.partitions) are runtime-mutable. If a config isn't taking effect, check whether it's a "static" (startup-only) or "dynamic" property.
Micro Topic: getOrCreate
SINGLETON PATTERN
▾
.getOrCreate() — Reuse, Don't Duplicate
.getOrCreate() is the final call that actually builds the session. Spark allows only one active SparkContext per JVM. If a SparkSession already exists (e.g., in a notebook where cells re-run), getOrCreate() returns the existing one instead of creating a brand new (and conflicting) one.🧠 Analogy
It's like calling a hotel to "get or create" your reservation. If you already have a booking under your name, the front desk hands you the existing room key. If not, they create a new reservation. Either way, you walk away with exactly ONE valid key — never two conflicting bookings.python — getOrCreate behavior
# First call — creates a NEW session
spark1 = SparkSession.builder.appName("App1").getOrCreate()
# Second call in the same program/notebook — REUSES spark1
# even though appName looks different, the existing session wins
spark2 = SparkSession.builder.appName("App2").getOrCreate()
print(spark1 is spark2) # True — same object!
print(spark2.sparkContext.appName) # "App1" — name from FIRST creation sticks
# To truly start fresh, stop the old session first
spark1.stop()
spark3 = SparkSession.builder.appName("App3").getOrCreate() # now creates new
⚠️ Common Mistake
In notebooks, re-running a cell with new .config() values often appears to have "no effect" — because getOrCreate() silently returned the OLD session with OLD configs. Call spark.stop() first, or restart the kernel, to apply new session-level configs.🧠 Quick Check: You run
SparkSession.builder.appName("A").config("spark.executor.memory","8g").getOrCreate(), then later in the SAME notebook session run SparkSession.builder.appName("B").config("spark.executor.memory","16g").getOrCreate(). What executor memory does the active session actually have?