Spark Type System — Overview
Every column in a Spark DataFrame has an explicit data type. Understanding Spark's type system is critical — it controls how data is stored, compared, serialized, and optimized. Types fall into two families: Primitive (single values) and Complex (nested structures).
Primitive types hold one value per cell. They map directly to Python or Java native types under the hood.
Complex types let one cell hold a collection or a nested object. This is where Spark really diverges from traditional SQL databases.
Types appear in three main places in your code:
# Import primitive types
from pyspark.sql.types import (
StringType, IntegerType, LongType,
DoubleType, FloatType, BooleanType,
DecimalType, DateType, TimestampType,
BinaryType, ShortType, ByteType
)
# Import complex types
from pyspark.sql.types import (
ArrayType, MapType, StructType, StructField
)
# Import NullType (rarely used but good to know)
from pyspark.sql.types import NullType
StringType
The most commonly used type. Stores any sequence of characters — names, emails, JSON strings, encoded data. Internally stored as Java String (UTF-16).
StringType stores text data of any length. There is no VARCHAR(n) or CHAR(n) like in SQL — Spark's StringType is always variable length. It stores any Unicode character, so names, emojis, and multilingual text all work fine.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName("TypeDemo").getOrCreate()
# Define a schema with StringType columns
schema = StructType([
StructField("name", StringType(), nullable=True),
StructField("email", StringType(), nullable=True),
StructField("country", StringType(), nullable=False)
])
data = [
("Alice", "alice@example.com", "India"),
("Bob", None, "USA"),
("Chandra", "c@data.io", "India")
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
# |-- name: string (nullable = true)
# |-- email: string (nullable = true)
# |-- country: string (nullable = false)
df.show()
# +-------+-----------------+-------+
# | name| email|country|
# +-------+-----------------+-------+
# | Alice|alice@example.com| India|
# | Bob| null| USA|
# |Chandra| c@data.io| India|
# +-------+-----------------+-------+
You can also use the string alias instead of importing StringType — both produce the same result:
# Using DDL string — shorter, same result
df2 = spark.createDataFrame(data, schema="name STRING, email STRING, country STRING")
df2.printSchema() # Same output as above
# Casting a column to StringType
from pyspark.sql.functions import col
df_cast = df.withColumn("name_upper", col("name").cast(StringType()))
# (col is already string, but cast is valid)
None in Python becomes null in Spark. Always distinguish between a missing value (null) and a blank value ("").
Use StringType for: names, emails, IDs, codes, free-form text, encoded JSON within a column, phone numbers (to preserve leading zeros), and any data where arithmetic doesn't make sense.
IntegerType & LongType
Whole number types. Choose based on the size of the numbers you need to store. Using the wrong type wastes memory or causes overflow.
Spark has four integer types, each storing a different range of whole numbers. They differ only in byte size and max value:
| Type | Bytes | Min Value | Max Value | Use When |
|---|---|---|---|---|
| ByteType | 1 byte | -128 | 127 | Flags, small enums |
| ShortType | 2 bytes | -32,768 | 32,767 | Age, status codes |
| IntegerType | 4 bytes | -2.1 billion | 2.1 billion | Default for most counts |
| LongType | 8 bytes | -9.2 quintillion | 9.2 quintillion | Timestamps (ms), huge IDs |
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType
schema = StructType([
StructField("product_id", IntegerType(), False),
StructField("product_name", StringType(), False),
StructField("stock_count", IntegerType(), True),
StructField("total_sold", LongType(), True)
])
data = [
(101, "Laptop", 45, 120_000_000),
(102, "Phone", 200, 5_000_000_000),
(103, "Tablet", None, 890_000)
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
# |-- product_id: integer (nullable = false)
# |-- product_name: string (nullable = false)
# |-- stock_count: integer (nullable = true)
# |-- total_sold: long (nullable = true)
df.show()
LongType is critical when dealing with Unix timestamps in milliseconds, large transaction IDs, or row counts from massive datasets. IntegerType would overflow at ~2.1 billion, causing silent data corruption.
from pyspark.sql.functions import col
# Unix timestamp in ms is ~1.7 trillion — must use LongType!
# 1,700,000,000,000 > 2,147,483,647 (IntegerType max) → OVERFLOW
schema2 = StructType([
StructField("event_id", LongType(), False),
StructField("event_time_ms", LongType(), False), # Unix ms timestamp
StructField("user_id", LongType(), False)
])
events = [
(1, 1700000000000, 9999999999),
(2, 1700000001000, 1234567890),
]
df_events = spark.createDataFrame(events, schema=schema2)
df_events.show()
# Cast IntegerType to LongType if needed
df_cast = df.withColumn("total_sold", col("total_sold").cast("long"))
df_cast.printSchema()
DoubleType & FloatType
Floating point types for decimal numbers. They use binary approximation, which is fast but not perfectly precise. For financial calculations, use DecimalType instead.
| Type | Bytes | Precision | Python Type | Use When |
|---|---|---|---|---|
| FloatType | 4 bytes | ~7 decimal digits | float32 | ML features, approximations |
| DoubleType | 8 bytes | ~15 decimal digits | float64 | Scientific, pricing (non-financial) |
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, FloatType
schema = StructType([
StructField("sensor_id", StringType(), False),
StructField("temperature", DoubleType(), True),
StructField("confidence", FloatType(), True) # ML score 0.0-1.0
])
data = [
("SENSOR-01", 36.650123456789, 0.9875),
("SENSOR-02", -12.3, 0.5500),
("SENSOR-03", None, None),
]
df = spark.createDataFrame(data, schema=schema)
df.show()
# +---------+------------------+----------+
# |sensor_id| temperature|confidence|
# +---------+------------------+----------+
# |SENSOR-01|36.650123456789 |0.9875 |
# |SENSOR-02| -12.3| 0.55|
# |SENSOR-03| null| null|
# +---------+------------------+----------+
from pyspark.sql.functions import col, lit
# Floating point is NOT precise for all fractions
df_test = spark.range(1).withColumn("val", lit(0.1) + lit(0.2))
df_test.show()
# +-------------------+
# | val|
# +-------------------+
# |0.30000000000000004| ← NOT exactly 0.3!
# +-------------------+
# For money: use DecimalType, not DoubleType
BooleanType
The simplest type — only three possible values: true, false, or null. Used for flags, filter results, and conditional columns.
BooleanType stores exactly true or false. In Python you use True / False. The third possible value is null (unknown). Spark follows SQL three-valued logic: TRUE, FALSE, UNKNOWN (NULL).
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType
schema = StructType([
StructField("user_id", IntegerType(), False),
StructField("username", StringType(), False),
StructField("is_active", BooleanType(), True),
StructField("is_premium", BooleanType(), True),
StructField("email_verified", BooleanType(), True)
])
data = [
(1, "alice", True, True, True),
(2, "bob", True, False, None), # email unknown
(3, "charlie", False, False, False)
]
df = spark.createDataFrame(data, schema=schema)
df.show()
# +-------+--------+---------+----------+--------------+
# |user_id|username|is_active|is_premium|email_verified|
# +-------+--------+---------+----------+--------------+
# | 1| alice| true| true| true|
# | 2| bob| true| false| null|
# | 3| charlie| false| false| false|
# +-------+--------+---------+----------+--------------+
# Filter on boolean column — no need for == True
active_premium = df.filter(col("is_active") & col("is_premium"))
active_premium.show()
# Create boolean column from condition
from pyspark.sql.functions import col
df_with_flag = df.withColumn("fully_verified",
col("is_active") & col("email_verified") & col("is_premium")
)
df_with_flag.show()
When you do comparisons in Spark (like col("age") > 18), the result is automatically a BooleanType column.
from pyspark.sql.functions import col, when
# Create boolean flag from numeric condition
df_ages = spark.createDataFrame(
[("Alice", 25), ("Bob", 16), ("Charlie", 30)],
schema="name STRING, age INT"
)
df_ages = df_ages.withColumn("is_adult", col("age") >= 18)
df_ages.printSchema()
# |-- is_adult: boolean (nullable = false)
df_ages.show()
# +-------+---+--------+
# | name|age|is_adult|
# +-------+---+--------+
# | Alice| 25| true|
# | Bob| 16| false|
# |Charlie| 30| true|
# +-------+---+--------+
filter(col("is_active")) — not filter(col("is_active") == True). Both work, but the first is cleaner and standard.
DecimalType
The only exact decimal type in Spark. Used for financial data, prices, taxes, and any calculation where precision matters. DecimalType(precision, scale) — you specify exactly how many digits total and how many after the decimal point.
Precision = total number of digits (before + after decimal). Scale = digits after the decimal point.
from pyspark.sql.types import StructType, StructField, StringType, DecimalType, IntegerType
from decimal import Decimal
# Financial schema: price up to 8 digits total, 2 decimal places
# e.g. max price: 999999.99
schema = StructType([
StructField("order_id", IntegerType(), False),
StructField("product", StringType(), False),
StructField("unit_price", DecimalType(10, 2), True), # 99999999.99 max
StructField("tax_rate", DecimalType(5, 4), True), # 0.1800 = 18%
StructField("quantity", IntegerType(), True)
])
data = [
(1001, "Laptop", Decimal("89999.99"), Decimal("0.1800"), 2),
(1002, "Phone", Decimal("24999.00"), Decimal("0.1800"), 1),
(1003, "Cable", Decimal("199.50"), Decimal("0.0500"), 10),
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- unit_price: decimal(10,2) (nullable = true)
# |-- tax_rate: decimal(5,4) (nullable = true)
# Exact arithmetic — no floating point errors
from pyspark.sql.functions import col, round
df_totals = df.withColumn(
"total_before_tax",
(col("unit_price") * col("quantity")).cast(DecimalType(14, 2))
).withColumn(
"tax_amount",
(col("total_before_tax") * col("tax_rate")).cast(DecimalType(12, 2))
)
df_totals.show(truncate=False)
| DecimalType | Max Value | Use Case |
|---|---|---|
DecimalType(10, 2) | 99,999,999.99 | Retail prices, invoices |
DecimalType(15, 2) | 9.99 trillion | Bank balances, revenue |
DecimalType(5, 4) | 9.9999 | Tax rates, percentages |
DecimalType(38, 18) | Max precision | Cryptocurrency amounts |
DecimalType(10,2) * DecimalType(5,4) → result is DecimalType(15,6).
DateType & TimestampType
Two essential types for working with time data. DateType stores calendar dates (no time). TimestampType stores date + time + timezone. Choosing the wrong one causes bugs that are hard to debug in production.
DateType stores a calendar date: year, month, day. No time component. Internally stored as the number of days since epoch (1970-01-01). Format is always yyyy-MM-dd.
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType
from datetime import date
from pyspark.sql.functions import col, to_date, year, month, datediff, current_date
schema = StructType([
StructField("order_id", IntegerType(), False),
StructField("order_date", DateType(), True),
StructField("delivery_date", DateType(), True)
])
data = [
(1, date(2024, 1, 15), date(2024, 1, 20)),
(2, date(2024, 3, 5), date(2024, 3, 8)),
(3, date(2024, 6, 20), None)
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- order_date: date (nullable = true)
# Date arithmetic — how many days between order and delivery?
df_diff = df.withColumn("days_to_deliver",
datediff(col("delivery_date"), col("order_date"))
)
# Extract year, month from date
df_parts = df.withColumn("order_year", year(col("order_date")))\
.withColumn("order_month", month(col("order_date")))
# Parse string to DateType
df_str = spark.createDataFrame([("2024-07-15",)], ["date_str"])
df_date = df_str.withColumn("parsed_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_date.printSchema() # parsed_date: date
TimestampType stores a precise point in time: date + time down to microseconds + timezone awareness. Internally stored as microseconds since epoch. Spark interprets timestamps in the session timezone (default: UTC).
from pyspark.sql.types import TimestampType
from datetime import datetime
from pyspark.sql.functions import to_timestamp, current_timestamp, hour, minute
schema_ts = StructType([
StructField("event_id", IntegerType(), False),
StructField("event_name", StringType(), False),
StructField("event_time", TimestampType(), True)
])
data_ts = [
(1, "login", datetime(2024, 6, 15, 9, 30, 0)),
(2, "purchase", datetime(2024, 6, 15, 14, 22, 45)),
(3, "logout", datetime(2024, 6, 15, 18, 0, 0))
]
df_ts = spark.createDataFrame(data_ts, schema=schema_ts)
df_ts.show(truncate=False)
# +--------+----------+-------------------+
# |event_id|event_name| event_time|
# +--------+----------+-------------------+
# | 1| login|2024-06-15 09:30:00|
# | 2| purchase|2024-06-15 14:22:45|
# | 3| logout|2024-06-15 18:00:00|
# +--------+----------+-------------------+
# Parse string to timestamp
df_str2 = spark.createDataFrame([("2024-06-15 09:30:00",)], ["ts_str"])
df_ts2 = df_str2.withColumn("ts", to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
# Extract parts
df_ts.withColumn("event_hour", hour(col("event_time"))).show()
| Question | Use |
|---|---|
| Only care about the day (birthday, invoice date, batch date)? | DateType |
| Need exact time of an event (click, transaction, log)? | TimestampType |
| Doing date arithmetic (days between dates)? | DateType |
| Ordering events by time within a day? | TimestampType |
| Streaming event time for watermarks? | TimestampType (required) |
spark.conf.set("spark.sql.session.timeZone", "UTC"). In Spark 3.4+, TimestampNTZType (No Time Zone) is also available if you want timezone-naive timestamps.
ArrayType
ArrayType lets a single cell hold a list of values — all of the same type. This is one of Spark's most powerful features for semi-structured and nested data.
ArrayType(elementType, containsNull) defines a list of elements where every element must be the same type. elementType is the type of each item in the array. containsNull (default True) allows null values inside the array.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
schema = StructType([
StructField("user_id", IntegerType(), False),
StructField("username", StringType(), False),
StructField("tags", ArrayType(StringType()), True), # list of strings
StructField("scores", ArrayType(IntegerType()), True), # list of ints
])
data = [
(1, "alice", ["python", "spark", "sql"], [95, 87, 92]),
(2, "bob", ["java", "scala"], [78, 84]),
(3, "charlie", [], None), # empty array
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
# |-- user_id: integer (nullable = false)
# |-- username: string (nullable = false)
# |-- tags: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- scores: array (nullable = true)
# | |-- element: integer (containsNull = true)
df.show(truncate=False)
# +-------+--------+-------------------+-----------+
# |user_id|username| tags| scores|
# +-------+--------+-------------------+-----------+
# | 1| alice|[python, spark, sql]|[95, 87, 92]|
# | 2| bob| [java, scala]| [78, 84]|
# | 3| charlie| []| null|
# +-------+--------+-------------------+-----------+
from pyspark.sql.functions import (
col, explode, array_contains, size, array_sort,
element_at, array_distinct, flatten
)
# 1. Check if array contains a value
df.withColumn("has_spark", array_contains(col("tags"), "spark")).show()
# 2. Get size of array
df.withColumn("num_tags", size(col("tags"))).show()
# 3. Get element by position (1-indexed!)
df.withColumn("first_tag", element_at(col("tags"), 1)).show()
# 4. Sort the array
df.withColumn("sorted_tags", array_sort(col("tags"))).show(truncate=False)
# 5. Explode — turn each array element into a separate row
df.select("user_id", explode(col("tags")).alias("tag")).show()
# +-------+------+
# |user_id| tag|
# +-------+------+
# | 1|python|
# | 1| spark| ← one row per element!
# | 1| sql|
# | 2| java|
# | 2| scala|
# +-------+------+
# Note: row 3 (charlie, empty array) disappears with explode — use explode_outer to keep it
explode() drops rows where the array is null or empty. explode_outer() keeps those rows and returns null for the exploded value. Always use explode_outer when you want to preserve all parent rows.
MapType
MapType stores key-value pairs inside a single cell — like a Python dictionary per row. All keys must be the same type, all values must be the same type.
MapType is defined by three parameters: the type of the keys, the type of the values, and whether values can be null. Keys cannot be null. Common use case: storing metadata, properties, or dynamic attributes per row.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, DoubleType
schema = StructType([
StructField("product_id", IntegerType(), False),
StructField("name", StringType(), False),
# Map: attribute_name → attribute_value (string → string)
StructField("attributes", MapType(StringType(), StringType()), True),
# Map: region → revenue (string → double)
StructField("revenue_by_region", MapType(StringType(), DoubleType()), True)
])
data = [
(101, "Laptop",
{"color": "silver", "weight": "1.5kg", "brand": "Dell"},
{"India": 1200000.0, "USA": 850000.0}),
(102, "Phone",
{"color": "black", "storage": "256GB"},
{"India": 3200000.0, "UAE": 400000.0}),
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- attributes: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
# |-- revenue_by_region: map (nullable = true)
# | |-- key: string
# | |-- value: double (valueContainsNull = true)
df.show(truncate=False)
from pyspark.sql.functions import col, map_keys, map_values, element_at, explode
# 1. Get value by key — two ways
df.withColumn("color", col("attributes")["color"]).show()
df.withColumn("color", element_at(col("attributes"), "color")).show()
# 2. Get all keys
df.withColumn("attr_keys", map_keys(col("attributes"))).show(truncate=False)
# attr_keys: [color, weight, brand]
# 3. Get all values
df.withColumn("attr_vals", map_values(col("attributes"))).show(truncate=False)
# 4. Explode map — one row per key-value pair
df.select("product_id", explode(col("attributes")).alias("attr_key", "attr_val")).show()
# +----------+--------+---------+
# |product_id|attr_key| attr_val|
# +----------+--------+---------+
# | 101| color| silver|
# | 101| weight| 1.5kg|
# | 101| brand| Dell|
# | 102| color| black|
# | 102| storage| 256GB|
# +----------+--------+---------+
StructType as a Complex Column Type
StructType is used both as the top-level schema AND as a column type for nested objects. A struct column holds multiple named fields — like a row within a row.
When a column's type is StructType, that column holds a structured sub-object with named fields. You see this when reading JSON or when designing schemas for APIs. Access nested fields using dot notation.
from pyspark.sql.types import (
StructType, StructField, StringType, IntegerType, DoubleType
)
# Inner struct: address fields
address_schema = StructType([
StructField("street", StringType(), True),
StructField("city", StringType(), True),
StructField("pincode",IntegerType(),True),
StructField("country",StringType(), True)
])
# Inner struct: contact info
contact_schema = StructType([
StructField("phone", StringType(), True),
StructField("email", StringType(), True)
])
# Outer schema — uses structs as column types
customer_schema = StructType([
StructField("customer_id", IntegerType(), False),
StructField("name", StringType(), False),
StructField("address", address_schema, True), # ← Struct column!
StructField("contact", contact_schema, True) # ← Struct column!
])
data = [
(1, "Alice",
("MG Road", "Bengaluru", 560001, "India"),
("+91-9876543210", "alice@example.com")),
(2, "Bob",
("5th Ave", "New York", 10001, "USA"),
("+1-2125550100", "bob@example.com"))
]
df = spark.createDataFrame(data, schema=customer_schema)
df.printSchema()
# root
# |-- customer_id: integer (nullable = false)
# |-- name: string (nullable = false)
# |-- address: struct (nullable = true)
# | |-- street: string (nullable = true)
# | |-- city: string (nullable = true)
# | |-- pincode: integer (nullable = true)
# | |-- country: string (nullable = true)
# |-- contact: struct (nullable = true)
# | |-- phone: string (nullable = true)
# | |-- email: string (nullable = true)
# Access nested fields with dot notation
df.select(
"customer_id",
"name",
col("address.city"),
col("address.country"),
col("contact.email")
).show()
from pyspark.sql.functions import struct, col
# Bundle flat columns into a struct
df_flat = spark.createDataFrame([
(1, "Alice", "MG Road", "Bengaluru"),
(2, "Bob", "5th Ave", "New York")
], ["id", "name", "street", "city"])
# Group street + city into a struct called "address"
df_nested = df_flat.withColumn("address",
struct(
col("street").alias("street"),
col("city").alias("city")
)
).drop("street", "city")
df_nested.printSchema()
# root
# |-- id: long
# |-- name: string
# |-- address: struct (nullable = false)
# | |-- street: string
# | |-- city: string
col("address.city") or in SQL SELECT address.city FROM .... For deeply nested structs, chain the dots: col("order.customer.address.city").
Nested Structs
Structs can contain other structs, creating deeply nested hierarchies. This is the most common pattern when ingesting JSON from APIs, MongoDB, or event systems.
Real-world JSON APIs frequently have 3-4 levels of nesting. Spark handles this naturally — you just nest StructType inside StructType.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
# Level 3 — geo coordinates
geo_schema = StructType([
StructField("lat", DoubleType(), True),
StructField("lon", DoubleType(), True)
])
# Level 2 — address includes geo
address_schema = StructType([
StructField("street", StringType(), True),
StructField("city", StringType(), True),
StructField("geo", geo_schema, True) # ← nested struct
])
# Level 1 — customer includes address
customer_schema = StructType([
StructField("name", StringType(), True),
StructField("address", address_schema, True) # ← nested struct
])
# Root schema
root_schema = StructType([
StructField("order_id", IntegerType(), False),
StructField("customer", customer_schema, True) # ← nested struct
])
data = [
(1001, (("Alice", (("MG Road", "Bengaluru", (12.97, 77.59)))))),
(1002, (("Bob", (("5th Ave", "New York", (40.71, -74.0)))))),
]
df = spark.createDataFrame(data, schema=root_schema)
df.printSchema()
# Access 3 levels deep — chain dot notation
df.select(
"order_id",
col("customer.name"),
col("customer.address.city"),
col("customer.address.geo.lat").alias("latitude"),
col("customer.address.geo.lon").alias("longitude")
).show()
# +--------+-----+---------+---------+----------+
# |order_id| name| city| latitude| longitude|
# +--------+-----+---------+---------+----------+
# | 1001|Alice|Bengaluru| 12.97| 77.59|
# | 1002| Bob| New York| 40.71| -74.0 |
# +--------+-----+---------+---------+----------+
col("struct.*") to expand all fields, or select each field individually. This is called struct flattening and is covered in depth in Module 11.
Nested Arrays
Arrays can contain structs (array of objects), and structs can contain arrays (object with list fields). These combinations are extremely common in real-world JSON data.
An ArrayType(StructType(...)) is an array where each element is a structured object. This is how order line items, tags with metadata, and product variants are typically stored.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, DoubleType
from pyspark.sql.functions import col, explode
# Each order has multiple items — items is Array of Structs
item_schema = StructType([
StructField("item_name", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("price", DoubleType(), True)
])
order_schema = StructType([
StructField("order_id", IntegerType(), False),
StructField("customer", StringType(), False),
StructField("items", ArrayType(item_schema), True) # ← Array of Structs
])
data = [
(101, "Alice", [
("Laptop", 1, 89999.0),
("Mouse", 2, 499.0),
("Bag", 1, 1299.0)
]),
(102, "Bob", [
("Phone", 1, 24999.0)
])
]
df = spark.createDataFrame(data, schema=order_schema)
df.printSchema()
# |-- items: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- item_name: string (nullable = true)
# | | |-- quantity: integer (nullable = true)
# | | |-- price: double (nullable = true)
# Explode array of structs — each item becomes its own row
df_exploded = df.select(
"order_id",
"customer",
explode(col("items")).alias("item")
)
# Access struct fields on the exploded column
df_flat = df_exploded.select(
"order_id",
"customer",
col("item.item_name"),
col("item.quantity"),
col("item.price")
)
df_flat.show()
# +--------+--------+---------+--------+-------+
# |order_id|customer|item_name|quantity| price|
# +--------+--------+---------+--------+-------+
# | 101| Alice| Laptop| 1|89999.0|
# | 101| Alice| Mouse| 2| 499.0|
# | 101| Alice| Bag| 1| 1299.0|
# | 102| Bob| Phone| 1|24999.0|
# +--------+--------+---------+--------+-------+
from pyspark.sql.functions import flatten
# Matrix-like structure: array of arrays of integers
matrix_schema = StructType([
StructField("id", IntegerType(), False),
StructField("matrix", ArrayType(ArrayType(IntegerType())), True)
])
data = [
(1, [[1,2,3], [4,5,6], [7,8,9]]),
(2, [[10,20], [30,40]])
]
df_matrix = spark.createDataFrame(data, schema=matrix_schema)
df_matrix.show(truncate=False)
# Flatten array of arrays into a single array
df_flat2 = df_matrix.withColumn("flat", flatten(col("matrix")))
df_flat2.show(truncate=False)
# +---+-------------------+--------------------+
# | id| matrix| flat|
# +---+-------------------+--------------------+
# | 1|[[1,2,3],[4,5,6],..| [1, 2, 3, 4, 5, ..|
# | 2| [[10,20],[30,..| [10, 20, 30, 40]|
# +---+-------------------+--------------------+
Nested Maps
Maps can nest inside structs, and maps can contain arrays or other maps as values. These patterns appear in flexible, schema-less event data.
A map can have values that are arrays — useful for storing multiple values per key, like a user's favorite items per category.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, ArrayType
from pyspark.sql.functions import col, map_keys, map_values, explode
# Map: category → list of product names
schema = StructType([
StructField("user_id", IntegerType(), False),
StructField("preferences", MapType(StringType(), ArrayType(StringType())), True)
])
data = [
(1, {
"electronics": ["Laptop", "Phone", "Tablet"],
"books": ["Clean Code", "DDIA"],
"sports": ["Cricket Bat"]
}),
(2, {
"electronics": ["Headphones"],
"fashion": ["Sneakers", "Jacket"]
})
]
df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- preferences: map (nullable = true)
# | |-- key: string
# | |-- value: array (valueContainsNull = true)
# | | |-- element: string (containsNull = true)
df.show(truncate=False)
# Get electronics preferences for each user
df.withColumn("elec_prefs", col("preferences")["electronics"]).show(truncate=False)
# Explode the map first, then explode the array
df_step1 = df.select("user_id", explode(col("preferences")).alias("category", "products"))
df_step2 = df_step1.select("user_id", "category", explode(col("products")).alias("product"))
df_step2.show()
# +-------+------------+------------+
# |user_id| category| product|
# +-------+------------+------------+
# | 1|electronics | Laptop|
# | 1|electronics | Phone|
# | 1| books| Clean Code|
# ... |
# Map: region_code → {sales, units} struct
metric_schema = StructType([
StructField("sales", DoubleType(), True),
StructField("units", IntegerType(), True)
])
schema2 = StructType([
StructField("product", StringType(), False),
StructField("by_region", MapType(StringType(), metric_schema), True)
])
data2 = [
("Laptop", {
"IN": (1200000.0, 300),
"US": (850000.0, 80)
}),
]
df2 = spark.createDataFrame(data2, schema=schema2)
df2.printSchema()
# |-- by_region: map (nullable = true)
# | |-- key: string
# | |-- value: struct (valueContainsNull = true)
# | | |-- sales: double (nullable = true)
# | | |-- units: integer (nullable = true)
# Access nested map → struct field
df2.withColumn("india_sales", col("by_region")["IN"]["sales"]).show()
Type Casting & Conversions
Changing a column from one type to another. This happens constantly in real data work — string IDs that should be integers, timestamps stored as strings, prices stored as text.
Use .cast() on a column to convert it to a new type. You can pass either a type class instance or a DDL string (shorter).
from pyspark.sql.functions import col, to_date, to_timestamp
from pyspark.sql.types import IntegerType, DoubleType, LongType, BooleanType, StringType
df_raw = spark.createDataFrame([
("101", "45.99", "1", "2024-01-15", "2024-01-15 10:30:00"),
("102", "12.50", "0", "2024-02-20", "2024-02-20 14:22:00"),
], ["id_str", "price_str", "flag_str", "date_str", "ts_str"])
# Method 1: Pass type class instance
df_cast = df_raw.withColumn("id_int", col("id_str").cast(IntegerType()))
df_cast = df_cast.withColumn("price_dbl", col("price_str").cast(DoubleType()))
# Method 2: DDL string shorthand (cleaner, same result)
df_cast = df_cast.withColumn("flag_bool", col("flag_str").cast("boolean"))
df_cast = df_cast.withColumn("id_long", col("id_str").cast("long"))
# Method 3: Dates and timestamps (use functions, not cast)
df_cast = df_cast.withColumn("order_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_cast = df_cast.withColumn("order_ts", to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
df_cast.printSchema()
df_cast.show()
| DDL String | Type Class | Notes |
|---|---|---|
"string" | StringType() | Most common |
"int" or "integer" | IntegerType() | 4-byte int |
"long" or "bigint" | LongType() | 8-byte int |
"double" | DoubleType() | 8-byte float |
"float" | FloatType() | 4-byte float |
"boolean" | BooleanType() | "1"/"true"/"yes" → true |
"decimal(p,s)" | DecimalType(p,s) | Exact decimal |
"date" | DateType() | Use to_date() instead |
"timestamp" | TimestampType() | Use to_timestamp() instead |
# What happens when a string can't be cast to int?
df_bad = spark.createDataFrame([
("123",),
("abc",), # ← can't cast to int
("45.6",), # ← can't cast to int (has decimal)
(None,)
], ["val"])
df_result = df_bad.withColumn("val_int", col("val").cast("int"))
df_result.show()
# +-----+-------+
# | val|val_int|
# +-----+-------+
# | 123| 123| ← OK
# | abc| null| ← cast failed → NULL (no error!)
# | 45.6| null| ← cast failed → NULL
# | null| null| ← null stays null
# +-----+-------+
# IMPORTANT: Spark does NOT throw an error on failed casts!
# It silently returns null. Always validate after casting.
cast() never throws an error on bad values — it returns null instead. This means if you cast a column with bad data, you'll lose values silently. Always count nulls before and after casting to detect data quality issues.
Quick Reference & Quiz
Complete type reference cheatsheet and 5 interview-style questions to test your understanding of Spark data types.
| Type | DDL | Python | Key Use Case |
|---|---|---|---|
StringType() | STRING | str | Names, emails, codes, text |
IntegerType() | INT | int (≤2.1B) | Counts, IDs, ages |
LongType() | BIGINT | int (large) | Unix timestamps, huge IDs |
FloatType() | FLOAT | float32 | ML scores, approximations |
DoubleType() | DOUBLE | float64 | Scientific values, rates |
DecimalType(p,s) | DECIMAL(p,s) | Decimal | Money, prices, taxes |
BooleanType() | BOOLEAN | bool | Flags, conditions |
DateType() | DATE | date | Calendar dates (no time) |
TimestampType() | TIMESTAMP | datetime | Event times, log times |
ArrayType(T) | ARRAY<T> | list | Tags, items, scores |
MapType(K,V) | MAP<K,V> | dict | Properties, attributes |
StructType() | STRUCT<...> | tuple/Row | Nested objects, sub-rows |
You are building a payment system and need to store transaction amounts like ₹89,999.99. Which type should you use?
You cast a column with values ["123", "abc", "456"] to IntegerType. What does Spark return for "abc"?
A user has multiple phone numbers: ["98765", "91234"]. Another column stores their language preferences with proficiency: {"english": "native", "hindi": "fluent"}. What types would you use?
A column stores Unix timestamps in milliseconds (e.g. 1700000000000 = Nov 2023). Why can't you use IntegerType?
A DataFrame has a struct column order containing a nested struct customer with a field name. How do you select that field?
col("order.customer.name"). You can chain as many levels as needed. The bracket notation is for maps and arrays, not structs.