MODULE 3 PySpark Setup

1 / 3

Installation & Local Mode

Getting PySpark running on your machine — installation methods, prerequisites, and running Spark in local mode for development and learning.

📦

Installation GETTING STARTED ▾

What Do You Need to Install PySpark?

PySpark is a Python wrapper around Spark, which itself runs on the JVM (Java Virtual Machine). So to run PySpark, you need: (1) Java (JDK 8, 11, or 17) — Spark's engine runs on the JVM, (2) Python 3.8+, and (3) the pyspark package itself.

🧠 Analogy

PySpark is like a Python steering wheel attached to a car (Spark) whose engine runs on Java fuel (JVM). You need the engine (Java) installed even though you're only "driving" with Python.

bash — installing pyspark

# 1. Check / install Java (JDK 11 recommended)
$ java -version

# 2. Install PySpark via pip (includes Spark binaries)
$ pip install pyspark

# 3. Verify installation
$ pyspark --version

# 4. (Optional) Install commonly paired packages
$ pip install pyspark[sql] pandas numpy

🔑 Key Insight

pip install pyspark bundles a complete Spark distribution — you don't need to separately download Spark from apache.org for local development. That manual download route is mainly used for production cluster setups.

Common Installation Issues

A few environment variables often need to be set correctly, especially on Windows, so Spark can find Java and Python.

Environment Variable	Purpose	Example Value
JAVA_HOME	Points Spark to your Java installation	`/usr/lib/jvm/java-11-openjdk`
SPARK_HOME	Location of Spark installation (auto-set by pip in most cases)	`/usr/local/lib/python3.x/site-packages/pyspark`
PYSPARK_PYTHON	Which Python executable workers should use	`/usr/bin/python3`
HADOOP_HOME	Needed on Windows for `winutils.exe`	`C:\hadoop`

⚠️ Common Mistake

"Java gateway process exited" errors almost always mean JAVA_HOME isn't set or points to an incompatible Java version. Spark 3.x works best with Java 8, 11, or 17 — not Java 21+ in some Spark versions.

💻

Local Mode FOR LEARNING & DEV ▾

What is Local Mode?

Local mode runs the entire Spark application — Driver AND "Executors" — as threads within a single JVM process on your laptop. There's no real cluster; Spark simulates one using your machine's CPU cores. This is the default mode when you don't specify a cluster manager.

🧠 Analogy

Local mode is like a chef cooking an entire restaurant's menu alone in their home kitchen — using multiple burners (CPU cores) at once to simulate having a team, but it's really just one person (one machine) doing everything.

python — local mode options

from pyspark.sql import SparkSession

# local       → use 1 core only
spark = SparkSession.builder.master("local").getOrCreate()

# local[4]    → use exactly 4 cores
spark = SparkSession.builder.master("local[4]").getOrCreate()

# local[*]    → use ALL available cores on your machine (most common for dev)
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Quick test
df = spark.range(1000000)
print(df.count())  # 1000000

Master URL	Meaning	When to Use
`local`	1 thread, no parallelism	Simple debugging
`local[4]`	4 threads (4 "executor cores")	Testing parallelism behavior with a fixed count
`local[*]`	One thread per CPU core available	Default for development — most common

🔑 Key Insight

Code written and tested in local mode runs unchanged on a real cluster — just change .master("local[*]") to .master("yarn") or submit with --master yarn. This is why local mode is perfect for learning and iterating quickly before deploying to production.

3.2

Standalone Cluster, Docker & Databricks

Beyond local mode — how to set up PySpark in a small standalone cluster, run it inside Docker containers, and the zero-setup experience on Databricks.

🏗️

Standalone Cluster Setup MULTI-MACHINE ▾

Setting Up a Real (Small) Cluster

Recall from Module 2 (2.3) that Spark's Standalone cluster manager is built-in. To actually set one up, you download the Spark binary distribution on each machine, start a master process on one machine, and start worker processes on the others, pointing them at the master.

Machine A: start-master.sh → Master UI on :8080

Machine B: start-worker.sh spark://A:7077

Machine C: start-worker.sh spark://A:7077

bash — standalone cluster setup

# Download Spark on every machine (same version on all!)
$ wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar -xvf spark-3.5.0-bin-hadoop3.tgz
$ cd spark-3.5.0-bin-hadoop3

# On the master machine
$ ./sbin/start-master.sh
# Visit http://<master-ip>:8080 to see the cluster UI

# On each worker machine, register with the master
$ ./sbin/start-worker.sh spark://<master-ip>:7077

# Submit a PySpark job from any machine that can reach the master
$ spark-submit --master spark://<master-ip>:7077 my_job.py

🔑 Key Insight

All machines need the same Spark version and ideally the same Python version, since the Driver ships Python code/dependencies that workers must be able to run.

🐳

Docker CONTAINERIZED ▾

Running PySpark in Docker

Running PySpark inside Docker gives you a clean, reproducible environment — no fighting with Java/Python version conflicts on your host machine. You either use a pre-built image (like Jupyter's PySpark notebook image) or build your own.

🧠 Analogy

Installing PySpark directly on your laptop is like cooking in someone else's kitchen — you have to deal with whatever pots, stoves, and ingredients are already there. Docker is like bringing your own fully-equipped portable kitchen — it works exactly the same everywhere you set it up.

bash — pyspark via docker

# Quick start: Jupyter + PySpark pre-built image
$ docker run -it -p 8888:8888 jupyter/pyspark-notebook

# Or run spark-submit inside a container with your code mounted
$ docker run -it --rm \
    -v $(pwd):/app \
    apache/spark:3.5.0 \
    /opt/spark/bin/spark-submit /app/my_job.py

dockerfile — custom pyspark image

FROM apache/spark:3.5.0

USER root
RUN pip install pandas numpy

COPY my_job.py /app/my_job.py
WORKDIR /app

ENTRYPOINT ["/opt/spark/bin/spark-submit", "my_job.py"]

🔑 Key Insight

Docker isn't just for local dev — it's also how Spark runs on Kubernetes (Module 28): every Driver and Executor IS a Docker container/pod. Learning Docker basics here pays off later.

🧱

Databricks MANAGED PLATFORM ▾

PySpark with Zero Setup

Databricks is a managed platform built around Spark. You don't install anything — you open a notebook in a browser, attach it to a cluster (which Databricks provisions for you on AWS/Azure/GCP), and a SparkSession called spark is already created and ready to use. We cover Databricks fully in Module 27.

python — pyspark in a databricks notebook

# No SparkSession.builder needed — 'spark' already exists!
df = spark.read.table("samples.nyctaxi.trips")
df.groupBy("pickup_zip").count().show()

# Databricks-specific helper for files, secrets, widgets
dbutils.fs.ls("/databricks-datasets/")

Setup Method	Effort	Best For
`pip install pyspark` (Local mode)	Minimal	Learning, small data, unit testing
Standalone Cluster	Medium — manual setup	Small dedicated on-prem clusters
Docker	Medium — image management	Reproducible dev environments, CI/CD
Databricks	Zero — fully managed	Production, teams, notebooks, enterprise scale

3.3

SparkSession & SparkSession.builder

A practical, line-by-line breakdown of the code you'll write at the top of every PySpark script — the builder pattern and its key configuration methods.

🏗️

The Builder Pattern FOUNDATION ▾

SparkSession.builder — Putting It All Together

SparkSession.builder uses the builder design pattern — you chain method calls (.appName(), .master(), .config()) to configure the session step by step, then call .getOrCreate() at the end to actually create (or reuse) it. Each chained method returns the builder itself, so you can keep chaining.

🧠 Analogy

Think of ordering a custom coffee: "I'll have a latte (.appName), with oat milk (.master), extra hot, two sugars (.config), and... that's my order, make it now (.getOrCreate())." Each step adds a detail to the order before it's finally placed.

python — the full builder chain

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
        .appName("DailySalesETL")              # name shown in Spark UI
        .master("local[*]")                    # where to run
        .config("spark.sql.shuffle.partitions", "50")  # tuning
        .config("spark.executor.memory", "4g")
        .getOrCreate()                              # create or reuse session
)

print(spark.version)
print(spark.sparkContext.appName)   # "DailySalesETL"

🏷️

Micro Topic: appName IDENTIFICATION ▾

.appName() — Naming Your Application

.appName("...") sets a human-readable name for your Spark Application. This name appears in the Spark UI, Spark History Server, and cluster manager dashboards (YARN ResourceManager UI, Databricks Jobs UI). It does NOT affect execution — it's purely for identification and monitoring.

python — appName in practice

spark = SparkSession.builder.appName("CustomerChurnETL").getOrCreate()

# In the Spark UI (http://localhost:4040), the title bar shows:
# "CustomerChurnETL - Spark Jobs"

# In YARN ResourceManager UI, the app list shows:
# Application Name: CustomerChurnETL   |  User: data_eng  |  Status: RUNNING

💡 Example

If 50 Spark jobs are running on a shared cluster and yours fails, a descriptive appName like "daily_orders_bronze_to_silver" instead of the default "PySparkShell" lets you (and your team) instantly find the right job in monitoring tools (Module 26).

🗺️

Micro Topic: master WHERE IT RUNS ▾

.master() — Telling Spark Where to Run

.master("...") tells Spark which cluster manager to connect to (tying directly back to Module 2.3). This determines whether your app runs locally on your machine or on a real distributed cluster.

master() Value	Meaning
`local[*]`	Local mode, use all CPU cores (dev/learning)
`spark://host:7077`	Standalone cluster master
`yarn`	YARN cluster manager (Hadoop/EMR)
`k8s://https://host:port`	Kubernetes cluster

python — master() vs spark-submit --master

# Option A: hardcode in code (good for local dev)
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Option B: leave master() OUT of the code, set it at submit time
spark = SparkSession.builder.appName("MyJob").getOrCreate()
# then: spark-submit --master yarn my_job.py
#   or: spark-submit --master local[4] my_job.py

⚠️ Best Practice

For production code, avoid hardcoding .master("local[*]"). If you hardcode it and then run spark-submit --master yarn, the hardcoded value usually wins/conflicts. Best practice: omit .master() in code and set it via spark-submit --master ... or cluster config, so the same script works in dev (local) and prod (YARN/K8s) without code changes.

⚙️

Micro Topic: config TUNING & SETTINGS ▾

.config() — Setting Spark Configurations

.config(key, value) sets any of Spark's hundreds of configuration properties — memory settings, shuffle behavior, Delta/Iceberg integration, AQE, and more. You can chain multiple .config() calls, and many of these (Module 16) directly control performance.

python — common config() examples

spark = (
    SparkSession.builder
        .appName("ConfigDemo")
        # Performance tuning
        .config("spark.sql.shuffle.partitions", "200")
        .config("spark.sql.adaptive.enabled", "true")
        # Memory
        .config("spark.executor.memory", "4g")
        .config("spark.driver.memory", "2g")
        # Enable extra libraries (e.g., Delta Lake)
        .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .getOrCreate()
)

# Read a config value back at runtime
print(spark.conf.get("spark.sql.shuffle.partitions"))  # "200"

# Change a config AFTER session creation (not all configs allow this)
spark.conf.set("spark.sql.shuffle.partitions", "100")

🔑 Key Insight

Some configs (like spark.executor.memory) must be set before the session/JVM starts and can't be changed afterward via spark.conf.set(). Others (like spark.sql.shuffle.partitions) are runtime-mutable. If a config isn't taking effect, check whether it's a "static" (startup-only) or "dynamic" property.

🔁

Micro Topic: getOrCreate SINGLETON PATTERN ▾

.getOrCreate() — Reuse, Don't Duplicate

.getOrCreate() is the final call that actually builds the session. Spark allows only one active SparkContext per JVM. If a SparkSession already exists (e.g., in a notebook where cells re-run), getOrCreate() returns the existing one instead of creating a brand new (and conflicting) one.

🧠 Analogy

It's like calling a hotel to "get or create" your reservation. If you already have a booking under your name, the front desk hands you the existing room key. If not, they create a new reservation. Either way, you walk away with exactly ONE valid key — never two conflicting bookings.

python — getOrCreate behavior

# First call — creates a NEW session
spark1 = SparkSession.builder.appName("App1").getOrCreate()

# Second call in the same program/notebook — REUSES spark1
# even though appName looks different, the existing session wins
spark2 = SparkSession.builder.appName("App2").getOrCreate()

print(spark1 is spark2)               # True — same object!
print(spark2.sparkContext.appName)  # "App1" — name from FIRST creation sticks

# To truly start fresh, stop the old session first
spark1.stop()
spark3 = SparkSession.builder.appName("App3").getOrCreate()  # now creates new

⚠️ Common Mistake

In notebooks, re-running a cell with new .config() values often appears to have "no effect" — because getOrCreate() silently returned the OLD session with OLD configs. Call spark.stop() first, or restart the kernel, to apply new session-level configs.

🧠 Quick Check: You run SparkSession.builder.appName("A").config("spark.executor.memory","8g").getOrCreate(), then later in the SAME notebook session run SparkSession.builder.appName("B").config("spark.executor.memory","16g").getOrCreate(). What executor memory does the active session actually have?

16g — the latest config always wins

8g — the original session is reused, new configs are ignored

12g — Spark averages the two values