Spark Type System — Overview

Every column in a Spark DataFrame has an explicit data type. Understanding Spark's type system is critical — it controls how data is stored, compared, serialized, and optimized. Types fall into two families: Primitive (single values) and Complex (nested structures).

🗂️

The Two Families of Spark Types FOUNDATION ▾

Primitive Types — Single Values

Primitive types hold one value per cell. They map directly to Python or Java native types under the hood.

🔤

Text

StringType — any text, characters, unicode

🔢

Integer Numbers

ByteType, ShortType, IntegerType, LongType

💧

Decimal Numbers

FloatType, DoubleType, DecimalType

✅

Boolean

BooleanType — true / false only

📅

Date & Time

DateType, TimestampType, TimestampNTZType

🧮

Binary

BinaryType — raw byte arrays

Complex Types — Nested Structures

Complex types let one cell hold a collection or a nested object. This is where Spark really diverges from traditional SQL databases.

📋

ArrayType

A list of values, all the same type. Like Python list.

🗺️

MapType

Key-value pairs inside one cell. Like Python dict.

🏗️

StructType

A nested row (sub-table). Like a row within a row.

How Types Are Used in PySpark

Types appear in three main places in your code:

Schema Definition

→

StructField("name", StringType(), nullable)

Type Casting

→

df.col("age").cast(IntegerType())

Function Return Types

→

UDF return type = ArrayType(StringType())

python — all import paths

# Import primitive types
from pyspark.sql.types import (
    StringType, IntegerType, LongType,
    DoubleType, FloatType, BooleanType,
    DecimalType, DateType, TimestampType,
    BinaryType, ShortType, ByteType
)

# Import complex types
from pyspark.sql.types import (
    ArrayType, MapType, StructType, StructField
)

# Import NullType (rarely used but good to know)
from pyspark.sql.types import NullType

Key Rule

Every column in a Spark DataFrame must have exactly ONE type. Unlike Python where a list can hold mixed types, Spark columns are strictly typed. This enables massive performance optimizations.

6.2

StringType

The most commonly used type. Stores any sequence of characters — names, emails, JSON strings, encoded data. Internally stored as Java String (UTF-16).

🔤

StringType — Everything You Need to Know PRIMITIVE ▾

What is StringType?

StringType stores text data of any length. There is no VARCHAR(n) or CHAR(n) like in SQL — Spark's StringType is always variable length. It stores any Unicode character, so names, emojis, and multilingual text all work fine.

Analogy

Think of StringType like a sticky note that can hold any amount of text — it doesn't matter if it's 1 character or 10,000 characters, it's the same type.

Defining a StringType Column

python

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("TypeDemo").getOrCreate()

# Define a schema with StringType columns
schema = StructType([
    StructField("name",    StringType(), nullable=True),
    StructField("email",   StringType(), nullable=True),
    StructField("country", StringType(), nullable=False)
])

data = [
    ("Alice",   "alice@example.com",  "India"),
    ("Bob",     None,                  "USA"),
    ("Chandra", "c@data.io",          "India")
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- name: string (nullable = true)
#  |-- email: string (nullable = true)
#  |-- country: string (nullable = false)

df.show()
# +-------+-----------------+-------+
# |   name|            email|country|
# +-------+-----------------+-------+
# |  Alice|alice@example.com|  India|
# |    Bob|             null|    USA|
# |Chandra|        c@data.io|  India|
# +-------+-----------------+-------+

String Type Shorthand

You can also use the string alias instead of importing StringType — both produce the same result:

python — string alias

# Using DDL string — shorter, same result
df2 = spark.createDataFrame(data, schema="name STRING, email STRING, country STRING")
df2.printSchema()  # Same output as above

# Casting a column to StringType
from pyspark.sql.functions import col

df_cast = df.withColumn("name_upper", col("name").cast(StringType()))
# (col is already string, but cast is valid)

Watch Out

NULL is not the same as empty string "" in Spark. None in Python becomes null in Spark. Always distinguish between a missing value (null) and a blank value ("").

When to Use StringType

Use StringType for: names, emails, IDs, codes, free-form text, encoded JSON within a column, phone numbers (to preserve leading zeros), and any data where arithmetic doesn't make sense.

Real World

A customer_id like "CUST-00123" must be StringType — it starts with letters and hyphens. If you use IntegerType, Spark will fail to parse it.

6.3

IntegerType & LongType

Whole number types. Choose based on the size of the numbers you need to store. Using the wrong type wastes memory or causes overflow.

🔢

Integer Family — ByteType, ShortType, IntegerType, LongType PRIMITIVE ▾

All Integer Types — Range Comparison

Spark has four integer types, each storing a different range of whole numbers. They differ only in byte size and max value:

Type	Bytes	Min Value	Max Value	Use When
ByteType	1 byte	-128	127	Flags, small enums
ShortType	2 bytes	-32,768	32,767	Age, status codes
IntegerType	4 bytes	-2.1 billion	2.1 billion	Default for most counts
LongType	8 bytes	-9.2 quintillion	9.2 quintillion	Timestamps (ms), huge IDs

Analogy

Think of these as boxes of different sizes. ByteType is a tiny box that holds only small numbers. LongType is a warehouse — you can store astronomical numbers, but it uses 8x the space of ByteType.

IntegerType — Most Common Whole Number Type

python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

schema = StructType([
    StructField("product_id",  IntegerType(), False),
    StructField("product_name", StringType(),  False),
    StructField("stock_count",  IntegerType(), True),
    StructField("total_sold",   LongType(),    True)
])

data = [
    (101, "Laptop",  45,    120_000_000),
    (102, "Phone",   200,   5_000_000_000),
    (103, "Tablet",  None,  890_000)
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- product_id: integer (nullable = false)
#  |-- product_name: string (nullable = false)
#  |-- stock_count: integer (nullable = true)
#  |-- total_sold: long (nullable = true)

df.show()

LongType — For Large Numbers

LongType is critical when dealing with Unix timestamps in milliseconds, large transaction IDs, or row counts from massive datasets. IntegerType would overflow at ~2.1 billion, causing silent data corruption.

python — overflow danger

from pyspark.sql.functions import col

# Unix timestamp in ms is ~1.7 trillion — must use LongType!
# 1,700,000,000,000 > 2,147,483,647 (IntegerType max) → OVERFLOW

schema2 = StructType([
    StructField("event_id",       LongType(), False),
    StructField("event_time_ms",   LongType(), False),  # Unix ms timestamp
    StructField("user_id",         LongType(), False)
])

events = [
    (1, 1700000000000, 9999999999),
    (2, 1700000001000, 1234567890),
]
df_events = spark.createDataFrame(events, schema=schema2)
df_events.show()

# Cast IntegerType to LongType if needed
df_cast = df.withColumn("total_sold", col("total_sold").cast("long"))
df_cast.printSchema()

Interview Trap

If you read a CSV with a column containing values like 5,000,000,000 and Spark infers the schema, it will correctly use LongType. But if you force IntegerType, Spark will silently return null for values that overflow. Always verify your schema!

6.4

DoubleType & FloatType

Floating point types for decimal numbers. They use binary approximation, which is fast but not perfectly precise. For financial calculations, use DecimalType instead.

💧

FloatType vs DoubleType — Precision & Performance PRIMITIVE ▾

Key Differences

Type	Bytes	Precision	Python Type	Use When
FloatType	4 bytes	~7 decimal digits	float32	ML features, approximations
DoubleType	8 bytes	~15 decimal digits	float64	Scientific, pricing (non-financial)

Analogy

Float is like a ruler with millimeter marks — accurate enough for most everyday tasks. Double is like a vernier caliper — far more precise. Neither is exact for all fractions (like 1/3), so never use them for money.

DoubleType — Default for Decimal Numbers

python

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, FloatType

schema = StructType([
    StructField("sensor_id",   StringType(), False),
    StructField("temperature",  DoubleType(), True),
    StructField("confidence",   FloatType(),  True)  # ML score 0.0-1.0
])

data = [
    ("SENSOR-01", 36.650123456789, 0.9875),
    ("SENSOR-02", -12.3,           0.5500),
    ("SENSOR-03",  None,            None),
]

df = spark.createDataFrame(data, schema=schema)
df.show()
# +---------+------------------+----------+
# |sensor_id|       temperature|confidence|
# +---------+------------------+----------+
# |SENSOR-01|36.650123456789   |0.9875    |
# |SENSOR-02|             -12.3|      0.55|
# |SENSOR-03|              null|      null|
# +---------+------------------+----------+

Floating Point Precision Trap

python — precision issue demo

from pyspark.sql.functions import col, lit

# Floating point is NOT precise for all fractions
df_test = spark.range(1).withColumn("val", lit(0.1) + lit(0.2))
df_test.show()
# +-------------------+
# |                val|
# +-------------------+
# |0.30000000000000004|   ← NOT exactly 0.3!
# +-------------------+

# For money: use DecimalType, not DoubleType

Critical Warning

Never use DoubleType or FloatType for financial calculations (prices, balances, tax). The binary representation cannot store all decimals exactly. Use DecimalType(precision, scale) for money.

6.5

BooleanType

The simplest type — only three possible values: true, false, or null. Used for flags, filter results, and conditional columns.

✅

BooleanType — True / False / Null PRIMITIVE ▾

BooleanType Basics

BooleanType stores exactly true or false. In Python you use True / False. The third possible value is null (unknown). Spark follows SQL three-valued logic: TRUE, FALSE, UNKNOWN (NULL).

python

from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType

schema = StructType([
    StructField("user_id",     IntegerType(), False),
    StructField("username",     StringType(),  False),
    StructField("is_active",    BooleanType(), True),
    StructField("is_premium",   BooleanType(), True),
    StructField("email_verified", BooleanType(), True)
])

data = [
    (1, "alice",   True,  True,  True),
    (2, "bob",     True,  False, None),   # email unknown
    (3, "charlie", False, False, False)
]

df = spark.createDataFrame(data, schema=schema)
df.show()
# +-------+--------+---------+----------+--------------+
# |user_id|username|is_active|is_premium|email_verified|
# +-------+--------+---------+----------+--------------+
# |      1|   alice|     true|      true|          true|
# |      2|     bob|     true|     false|          null|
# |      3| charlie|    false|     false|         false|
# +-------+--------+---------+----------+--------------+

# Filter on boolean column — no need for == True
active_premium = df.filter(col("is_active") & col("is_premium"))
active_premium.show()

# Create boolean column from condition
from pyspark.sql.functions import col
df_with_flag = df.withColumn("fully_verified",
    col("is_active") & col("email_verified") & col("is_premium")
)
df_with_flag.show()

Boolean from Comparisons

When you do comparisons in Spark (like col("age") > 18), the result is automatically a BooleanType column.

python — boolean from comparison

from pyspark.sql.functions import col, when

# Create boolean flag from numeric condition
df_ages = spark.createDataFrame(
    [("Alice", 25), ("Bob", 16), ("Charlie", 30)],
    schema="name STRING, age INT"
)

df_ages = df_ages.withColumn("is_adult", col("age") >= 18)
df_ages.printSchema()
# |-- is_adult: boolean (nullable = false)

df_ages.show()
# +-------+---+--------+
# |   name|age|is_adult|
# +-------+---+--------+
# |  Alice| 25|    true|
# |    Bob| 16|   false|
# |Charlie| 30|    true|
# +-------+---+--------+

Best Practice

When filtering on a boolean column, write filter(col("is_active")) — not filter(col("is_active") == True). Both work, but the first is cleaner and standard.

6.6

DecimalType

The only exact decimal type in Spark. Used for financial data, prices, taxes, and any calculation where precision matters. DecimalType(precision, scale) — you specify exactly how many digits total and how many after the decimal point.

🏦

DecimalType(precision, scale) — Exact Arithmetic PRIMITIVE ▾

Understanding Precision and Scale

Precision = total number of digits (before + after decimal). Scale = digits after the decimal point.

DecimalType(10, 2)
↑ precision=10 total digits, scale=2 after decimal
Example value: 12345678.99
              ←— 8 digits before ——→ ←2→
Total = 8 + 2 = 10 digits ✓ Fits in DecimalType(10,2)
Value 99999999999.99 → 13 total digits → OVERFLOW (needs precision ≥ 13)

DecimalType in Practice

python

from pyspark.sql.types import StructType, StructField, StringType, DecimalType, IntegerType
from decimal import Decimal

# Financial schema: price up to 8 digits total, 2 decimal places
# e.g. max price: 999999.99
schema = StructType([
    StructField("order_id",   IntegerType(),         False),
    StructField("product",    StringType(),          False),
    StructField("unit_price", DecimalType(10, 2),   True),  # 99999999.99 max
    StructField("tax_rate",   DecimalType(5, 4),    True),  # 0.1800 = 18%
    StructField("quantity",   IntegerType(),         True)
])

data = [
    (1001, "Laptop",  Decimal("89999.99"), Decimal("0.1800"), 2),
    (1002, "Phone",   Decimal("24999.00"), Decimal("0.1800"), 1),
    (1003, "Cable",   Decimal("199.50"),   Decimal("0.0500"), 10),
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- unit_price: decimal(10,2) (nullable = true)
# |-- tax_rate: decimal(5,4) (nullable = true)

# Exact arithmetic — no floating point errors
from pyspark.sql.functions import col, round

df_totals = df.withColumn(
    "total_before_tax",
    (col("unit_price") * col("quantity")).cast(DecimalType(14, 2))
).withColumn(
    "tax_amount",
    (col("total_before_tax") * col("tax_rate")).cast(DecimalType(12, 2))
)
df_totals.show(truncate=False)

Common DecimalType Configurations

DecimalType	Max Value	Use Case
`DecimalType(10, 2)`	99,999,999.99	Retail prices, invoices
`DecimalType(15, 2)`	9.99 trillion	Bank balances, revenue
`DecimalType(5, 4)`	9.9999	Tax rates, percentages
`DecimalType(38, 18)`	Max precision	Cryptocurrency amounts

Watch Out

When you multiply two DecimalType columns, the result precision expands. Cast the result explicitly to avoid precision overflow or unexpected results. E.g. DecimalType(10,2) * DecimalType(5,4) → result is DecimalType(15,6).

6.7

DateType & TimestampType

Two essential types for working with time data. DateType stores calendar dates (no time). TimestampType stores date + time + timezone. Choosing the wrong one causes bugs that are hard to debug in production.

📅

DateType vs TimestampType — When to Use Each PRIMITIVE ▾

DateType — Date Only (No Time)

DateType stores a calendar date: year, month, day. No time component. Internally stored as the number of days since epoch (1970-01-01). Format is always yyyy-MM-dd.

python — DateType

from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType
from datetime import date
from pyspark.sql.functions import col, to_date, year, month, datediff, current_date

schema = StructType([
    StructField("order_id",    IntegerType(), False),
    StructField("order_date",   DateType(),    True),
    StructField("delivery_date", DateType(),   True)
])

data = [
    (1, date(2024, 1, 15), date(2024, 1, 20)),
    (2, date(2024, 3, 5),  date(2024, 3, 8)),
    (3, date(2024, 6, 20), None)
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- order_date: date (nullable = true)

# Date arithmetic — how many days between order and delivery?
df_diff = df.withColumn("days_to_deliver",
    datediff(col("delivery_date"), col("order_date"))
)

# Extract year, month from date
df_parts = df.withColumn("order_year", year(col("order_date")))\
             .withColumn("order_month", month(col("order_date")))

# Parse string to DateType
df_str = spark.createDataFrame([("2024-07-15",)], ["date_str"])
df_date = df_str.withColumn("parsed_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_date.printSchema()  # parsed_date: date

TimestampType — Date + Time + Timezone

TimestampType stores a precise point in time: date + time down to microseconds + timezone awareness. Internally stored as microseconds since epoch. Spark interprets timestamps in the session timezone (default: UTC).

python — TimestampType

from pyspark.sql.types import TimestampType
from datetime import datetime
from pyspark.sql.functions import to_timestamp, current_timestamp, hour, minute

schema_ts = StructType([
    StructField("event_id",    IntegerType(),   False),
    StructField("event_name",  StringType(),    False),
    StructField("event_time",  TimestampType(), True)
])

data_ts = [
    (1, "login",    datetime(2024, 6, 15, 9, 30, 0)),
    (2, "purchase", datetime(2024, 6, 15, 14, 22, 45)),
    (3, "logout",   datetime(2024, 6, 15, 18, 0, 0))
]

df_ts = spark.createDataFrame(data_ts, schema=schema_ts)
df_ts.show(truncate=False)
# +--------+----------+-------------------+
# |event_id|event_name|         event_time|
# +--------+----------+-------------------+
# |       1|     login|2024-06-15 09:30:00|
# |       2|  purchase|2024-06-15 14:22:45|
# |       3|    logout|2024-06-15 18:00:00|
# +--------+----------+-------------------+

# Parse string to timestamp
df_str2 = spark.createDataFrame([("2024-06-15 09:30:00",)], ["ts_str"])
df_ts2 = df_str2.withColumn("ts", to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

# Extract parts
df_ts.withColumn("event_hour", hour(col("event_time"))).show()

DateType vs TimestampType — Decision Guide

Question	Use
Only care about the day (birthday, invoice date, batch date)?	DateType
Need exact time of an event (click, transaction, log)?	TimestampType
Doing date arithmetic (days between dates)?	DateType
Ordering events by time within a day?	TimestampType
Streaming event time for watermarks?	TimestampType (required)

Timezone Pitfall

TimestampType is timezone-aware and uses your Spark session timezone. Set it explicitly: spark.conf.set("spark.sql.session.timeZone", "UTC"). In Spark 3.4+, TimestampNTZType (No Time Zone) is also available if you want timezone-naive timestamps.

6.8

ArrayType

ArrayType lets a single cell hold a list of values — all of the same type. This is one of Spark's most powerful features for semi-structured and nested data.

📋

ArrayType — Lists Inside Cells COMPLEX ▾

What is ArrayType?

ArrayType(elementType, containsNull) defines a list of elements where every element must be the same type. elementType is the type of each item in the array. containsNull (default True) allows null values inside the array.

Analogy

Imagine a spreadsheet cell that doesn't hold one value but a whole grocery list. That list is an ArrayType — it can have 0, 1, or 100 items, and every item must be the same kind of thing (all strings, all integers, etc.).

Creating DataFrames with ArrayType

python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

schema = StructType([
    StructField("user_id",    IntegerType(),             False),
    StructField("username",   StringType(),              False),
    StructField("tags",       ArrayType(StringType()),    True),   # list of strings
    StructField("scores",     ArrayType(IntegerType()),   True),   # list of ints
])

data = [
    (1, "alice",   ["python", "spark", "sql"],  [95, 87, 92]),
    (2, "bob",     ["java", "scala"],           [78, 84]),
    (3, "charlie", [],                          None),    # empty array
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- user_id: integer (nullable = false)
#  |-- username: string (nullable = false)
#  |-- tags: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- scores: array (nullable = true)
#  |    |-- element: integer (containsNull = true)

df.show(truncate=False)
# +-------+--------+-------------------+-----------+
# |user_id|username|               tags|     scores|
# +-------+--------+-------------------+-----------+
# |      1|   alice|[python, spark, sql]|[95, 87, 92]|
# |      2|     bob|      [java, scala]|   [78, 84]|
# |      3| charlie|                 []|       null|
# +-------+--------+-------------------+-----------+

Working with ArrayType — Key Functions

python — array functions

from pyspark.sql.functions import (
    col, explode, array_contains, size, array_sort,
    element_at, array_distinct, flatten
)

# 1. Check if array contains a value
df.withColumn("has_spark", array_contains(col("tags"), "spark")).show()

# 2. Get size of array
df.withColumn("num_tags", size(col("tags"))).show()

# 3. Get element by position (1-indexed!)
df.withColumn("first_tag", element_at(col("tags"), 1)).show()

# 4. Sort the array
df.withColumn("sorted_tags", array_sort(col("tags"))).show(truncate=False)

# 5. Explode — turn each array element into a separate row
df.select("user_id", explode(col("tags")).alias("tag")).show()
# +-------+------+
# |user_id|   tag|
# +-------+------+
# |      1|python|
# |      1| spark|  ← one row per element!
# |      1|   sql|
# |      2|  java|
# |      2| scala|
# +-------+------+
# Note: row 3 (charlie, empty array) disappears with explode — use explode_outer to keep it

explode vs explode_outer

explode() drops rows where the array is null or empty. explode_outer() keeps those rows and returns null for the exploded value. Always use explode_outer when you want to preserve all parent rows.

6.9

MapType

MapType stores key-value pairs inside a single cell — like a Python dictionary per row. All keys must be the same type, all values must be the same type.

🗺️

MapType — Dictionaries Inside Cells COMPLEX ▾

MapType(keyType, valueType, valueContainsNull)

MapType is defined by three parameters: the type of the keys, the type of the values, and whether values can be null. Keys cannot be null. Common use case: storing metadata, properties, or dynamic attributes per row.

python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, DoubleType

schema = StructType([
    StructField("product_id", IntegerType(),                     False),
    StructField("name",        StringType(),                      False),
    # Map: attribute_name → attribute_value (string → string)
    StructField("attributes",  MapType(StringType(), StringType()), True),
    # Map: region → revenue (string → double)
    StructField("revenue_by_region", MapType(StringType(), DoubleType()), True)
])

data = [
    (101, "Laptop",
     {"color": "silver", "weight": "1.5kg", "brand": "Dell"},
     {"India": 1200000.0, "USA": 850000.0}),
    (102, "Phone",
     {"color": "black", "storage": "256GB"},
     {"India": 3200000.0, "UAE": 400000.0}),
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- attributes: map (nullable = true)
# |    |-- key: string
# |    |-- value: string (valueContainsNull = true)
# |-- revenue_by_region: map (nullable = true)
# |    |-- key: string
# |    |-- value: double (valueContainsNull = true)

df.show(truncate=False)

Accessing Map Values

python — map access functions

from pyspark.sql.functions import col, map_keys, map_values, element_at, explode

# 1. Get value by key — two ways
df.withColumn("color", col("attributes")["color"]).show()
df.withColumn("color", element_at(col("attributes"), "color")).show()

# 2. Get all keys
df.withColumn("attr_keys", map_keys(col("attributes"))).show(truncate=False)
# attr_keys: [color, weight, brand]

# 3. Get all values
df.withColumn("attr_vals", map_values(col("attributes"))).show(truncate=False)

# 4. Explode map — one row per key-value pair
df.select("product_id", explode(col("attributes")).alias("attr_key", "attr_val")).show()
# +----------+--------+---------+
# |product_id|attr_key| attr_val|
# +----------+--------+---------+
# |       101|   color|   silver|
# |       101|  weight|    1.5kg|
# |       101|   brand|     Dell|
# |       102|   color|    black|
# |       102| storage|    256GB|
# +----------+--------+---------+

Keys Must Be Unique

Within a single map value, each key must be unique. Duplicate keys in a MapType column cause undefined behavior — Spark may keep one or throw an error.

6.10

StructType as a Complex Column Type

StructType is used both as the top-level schema AND as a column type for nested objects. A struct column holds multiple named fields — like a row within a row.

🏗️

StructType Column — Nested Object in a Cell COMPLEX ▾

StructType as a Column Type

When a column's type is StructType, that column holds a structured sub-object with named fields. You see this when reading JSON or when designing schemas for APIs. Access nested fields using dot notation.

python

from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, DoubleType
)

# Inner struct: address fields
address_schema = StructType([
    StructField("street", StringType(), True),
    StructField("city",   StringType(), True),
    StructField("pincode",IntegerType(),True),
    StructField("country",StringType(), True)
])

# Inner struct: contact info
contact_schema = StructType([
    StructField("phone", StringType(), True),
    StructField("email", StringType(), True)
])

# Outer schema — uses structs as column types
customer_schema = StructType([
    StructField("customer_id", IntegerType(),    False),
    StructField("name",         StringType(),    False),
    StructField("address",      address_schema,  True),   # ← Struct column!
    StructField("contact",      contact_schema,  True)    # ← Struct column!
])

data = [
    (1, "Alice",
     ("MG Road", "Bengaluru", 560001, "India"),
     ("+91-9876543210", "alice@example.com")),
    (2, "Bob",
     ("5th Ave", "New York", 10001, "USA"),
     ("+1-2125550100", "bob@example.com"))
]

df = spark.createDataFrame(data, schema=customer_schema)
df.printSchema()
# root
#  |-- customer_id: integer (nullable = false)
#  |-- name: string (nullable = false)
#  |-- address: struct (nullable = true)
#  |    |-- street: string (nullable = true)
#  |    |-- city: string (nullable = true)
#  |    |-- pincode: integer (nullable = true)
#  |    |-- country: string (nullable = true)
#  |-- contact: struct (nullable = true)
#  |    |-- phone: string (nullable = true)
#  |    |-- email: string (nullable = true)

# Access nested fields with dot notation
df.select(
    "customer_id",
    "name",
    col("address.city"),
    col("address.country"),
    col("contact.email")
).show()

Creating Struct Columns Dynamically

python — struct() function

from pyspark.sql.functions import struct, col

# Bundle flat columns into a struct
df_flat = spark.createDataFrame([
    (1, "Alice", "MG Road", "Bengaluru"),
    (2, "Bob",   "5th Ave", "New York")
], ["id", "name", "street", "city"])

# Group street + city into a struct called "address"
df_nested = df_flat.withColumn("address",
    struct(
        col("street").alias("street"),
        col("city").alias("city")
    )
).drop("street", "city")

df_nested.printSchema()
# root
#  |-- id: long
#  |-- name: string
#  |-- address: struct (nullable = false)
#  |    |-- street: string
#  |    |-- city: string

Dot Notation for Access

Access nested struct fields with dot notation in both the Python API and Spark SQL: col("address.city") or in SQL SELECT address.city FROM .... For deeply nested structs, chain the dots: col("order.customer.address.city").

6.11

Nested Structs

Structs can contain other structs, creating deeply nested hierarchies. This is the most common pattern when ingesting JSON from APIs, MongoDB, or event systems.

🪆

Multi-Level Nested Structs MICRO TOPIC ▾

Deeply Nested Schema Example

Real-world JSON APIs frequently have 3-4 levels of nesting. Spark handles this naturally — you just nest StructType inside StructType.

order (StructType)

├── order_id : IntegerType

├── customer : StructType

│ ├── name : StringType

│ └── address : StructType

│ ├── street : StringType

│ ├── city : StringType

│ └── geo : StructType

│ ├── lat : DoubleType

│ └── lon : DoubleType

└── items : ArrayType(StructType)

Code: 3-Level Nested Struct

python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Level 3 — geo coordinates
geo_schema = StructType([
    StructField("lat", DoubleType(), True),
    StructField("lon", DoubleType(), True)
])

# Level 2 — address includes geo
address_schema = StructType([
    StructField("street", StringType(), True),
    StructField("city",   StringType(), True),
    StructField("geo",    geo_schema,   True)  # ← nested struct
])

# Level 1 — customer includes address
customer_schema = StructType([
    StructField("name",    StringType(),    True),
    StructField("address", address_schema,  True)  # ← nested struct
])

# Root schema
root_schema = StructType([
    StructField("order_id",  IntegerType(),   False),
    StructField("customer",  customer_schema, True)  # ← nested struct
])

data = [
    (1001, (("Alice", (("MG Road", "Bengaluru", (12.97, 77.59)))))),
    (1002, (("Bob",   (("5th Ave", "New York",  (40.71, -74.0)))))),
]

df = spark.createDataFrame(data, schema=root_schema)
df.printSchema()

# Access 3 levels deep — chain dot notation
df.select(
    "order_id",
    col("customer.name"),
    col("customer.address.city"),
    col("customer.address.geo.lat").alias("latitude"),
    col("customer.address.geo.lon").alias("longitude")
).show()
# +--------+-----+---------+---------+----------+
# |order_id| name|     city| latitude| longitude|
# +--------+-----+---------+---------+----------+
# |    1001|Alice|Bengaluru|    12.97|     77.59|
# |    1002|  Bob| New York|    40.71|    -74.0 |
# +--------+-----+---------+---------+----------+

Flattening Nested Structs

To flatten a nested struct into individual columns, use col("struct.*") to expand all fields, or select each field individually. This is called struct flattening and is covered in depth in Module 11.

6.12

Nested Arrays

Arrays can contain structs (array of objects), and structs can contain arrays (object with list fields). These combinations are extremely common in real-world JSON data.

📦

Arrays of Structs & Structs with Arrays MICRO TOPIC ▾

Pattern 1 — Array of Structs

An ArrayType(StructType(...)) is an array where each element is a structured object. This is how order line items, tags with metadata, and product variants are typically stored.

python — ArrayType of StructType

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, DoubleType
from pyspark.sql.functions import col, explode

# Each order has multiple items — items is Array of Structs
item_schema = StructType([
    StructField("item_name", StringType(),  True),
    StructField("quantity",  IntegerType(), True),
    StructField("price",     DoubleType(),  True)
])

order_schema = StructType([
    StructField("order_id", IntegerType(),              False),
    StructField("customer", StringType(),               False),
    StructField("items",    ArrayType(item_schema),      True)  # ← Array of Structs
])

data = [
    (101, "Alice", [
        ("Laptop", 1, 89999.0),
        ("Mouse",  2, 499.0),
        ("Bag",    1, 1299.0)
    ]),
    (102, "Bob", [
        ("Phone", 1, 24999.0)
    ])
]

df = spark.createDataFrame(data, schema=order_schema)
df.printSchema()
# |-- items: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- item_name: string (nullable = true)
# |    |    |-- quantity: integer (nullable = true)
# |    |    |-- price: double (nullable = true)

# Explode array of structs — each item becomes its own row
df_exploded = df.select(
    "order_id",
    "customer",
    explode(col("items")).alias("item")
)

# Access struct fields on the exploded column
df_flat = df_exploded.select(
    "order_id",
    "customer",
    col("item.item_name"),
    col("item.quantity"),
    col("item.price")
)
df_flat.show()
# +--------+--------+---------+--------+-------+
# |order_id|customer|item_name|quantity|  price|
# +--------+--------+---------+--------+-------+
# |     101|   Alice|   Laptop|       1|89999.0|
# |     101|   Alice|    Mouse|       2|  499.0|
# |     101|   Alice|      Bag|       1| 1299.0|
# |     102|     Bob|    Phone|       1|24999.0|
# +--------+--------+---------+--------+-------+

Pattern 2 — Array of Arrays

python — ArrayType of ArrayType

from pyspark.sql.functions import flatten

# Matrix-like structure: array of arrays of integers
matrix_schema = StructType([
    StructField("id",     IntegerType(),                          False),
    StructField("matrix", ArrayType(ArrayType(IntegerType())),      True)
])

data = [
    (1, [[1,2,3], [4,5,6], [7,8,9]]),
    (2, [[10,20], [30,40]])
]

df_matrix = spark.createDataFrame(data, schema=matrix_schema)
df_matrix.show(truncate=False)

# Flatten array of arrays into a single array
df_flat2 = df_matrix.withColumn("flat", flatten(col("matrix")))
df_flat2.show(truncate=False)
# +---+-------------------+--------------------+
# | id|             matrix|                flat|
# +---+-------------------+--------------------+
# |  1|[[1,2,3],[4,5,6],..|  [1, 2, 3, 4, 5, ..|
# |  2|     [[10,20],[30,..|     [10, 20, 30, 40]|
# +---+-------------------+--------------------+

6.13

Nested Maps

Maps can nest inside structs, and maps can contain arrays or other maps as values. These patterns appear in flexible, schema-less event data.

🗺️

Maps with Complex Values — Maps inside Structs MICRO TOPIC ▾

Map with Array Values

A map can have values that are arrays — useful for storing multiple values per key, like a user's favorite items per category.

python — MapType with ArrayType values

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, ArrayType
from pyspark.sql.functions import col, map_keys, map_values, explode

# Map: category → list of product names
schema = StructType([
    StructField("user_id",    IntegerType(), False),
    StructField("preferences", MapType(StringType(), ArrayType(StringType())), True)
])

data = [
    (1, {
        "electronics": ["Laptop", "Phone", "Tablet"],
        "books":       ["Clean Code", "DDIA"],
        "sports":      ["Cricket Bat"]
    }),
    (2, {
        "electronics": ["Headphones"],
        "fashion":     ["Sneakers", "Jacket"]
    })
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- preferences: map (nullable = true)
# |    |-- key: string
# |    |-- value: array (valueContainsNull = true)
# |    |    |-- element: string (containsNull = true)

df.show(truncate=False)

# Get electronics preferences for each user
df.withColumn("elec_prefs", col("preferences")["electronics"]).show(truncate=False)

# Explode the map first, then explode the array
df_step1 = df.select("user_id", explode(col("preferences")).alias("category", "products"))
df_step2 = df_step1.select("user_id", "category", explode(col("products")).alias("product"))
df_step2.show()
# +-------+------------+------------+
# |user_id|    category|     product|
# +-------+------------+------------+
# |      1|electronics |      Laptop|
# |      1|electronics |       Phone|
# |      1|       books|  Clean Code|
# ...                              |

Map with Struct Values

python — MapType with StructType values

# Map: region_code → {sales, units} struct
metric_schema = StructType([
    StructField("sales", DoubleType(),  True),
    StructField("units", IntegerType(), True)
])

schema2 = StructType([
    StructField("product",  StringType(), False),
    StructField("by_region", MapType(StringType(), metric_schema), True)
])

data2 = [
    ("Laptop", {
        "IN": (1200000.0, 300),
        "US": (850000.0,  80)
    }),
]
df2 = spark.createDataFrame(data2, schema=schema2)
df2.printSchema()
# |-- by_region: map (nullable = true)
# |    |-- key: string
# |    |-- value: struct (valueContainsNull = true)
# |    |    |-- sales: double (nullable = true)
# |    |    |-- units: integer (nullable = true)

# Access nested map → struct field
df2.withColumn("india_sales", col("by_region")["IN"]["sales"]).show()

Real World Pattern

Event tracking systems often use MapType(StringType(), StringType()) for flexible event properties — where different events have different keys. This avoids schema explosion while maintaining the structure.

6.14

Type Casting & Conversions

Changing a column from one type to another. This happens constantly in real data work — string IDs that should be integers, timestamps stored as strings, prices stored as text.

🔄

Casting Types — .cast() and DDL String Shortcuts CONVERSION ▾

The .cast() Method

Use .cast() on a column to convert it to a new type. You can pass either a type class instance or a DDL string (shorter).

python — all casting methods

from pyspark.sql.functions import col, to_date, to_timestamp
from pyspark.sql.types import IntegerType, DoubleType, LongType, BooleanType, StringType

df_raw = spark.createDataFrame([
    ("101", "45.99", "1",  "2024-01-15", "2024-01-15 10:30:00"),
    ("102", "12.50", "0",  "2024-02-20", "2024-02-20 14:22:00"),
], ["id_str", "price_str", "flag_str", "date_str", "ts_str"])

# Method 1: Pass type class instance
df_cast = df_raw.withColumn("id_int",   col("id_str").cast(IntegerType()))
df_cast = df_cast.withColumn("price_dbl", col("price_str").cast(DoubleType()))

# Method 2: DDL string shorthand (cleaner, same result)
df_cast = df_cast.withColumn("flag_bool", col("flag_str").cast("boolean"))
df_cast = df_cast.withColumn("id_long",   col("id_str").cast("long"))

# Method 3: Dates and timestamps (use functions, not cast)
df_cast = df_cast.withColumn("order_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_cast = df_cast.withColumn("order_ts",   to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

df_cast.printSchema()
df_cast.show()

DDL String Type Aliases — Quick Reference

DDL String	Type Class	Notes
`"string"`	StringType()	Most common
`"int"` or `"integer"`	IntegerType()	4-byte int
`"long"` or `"bigint"`	LongType()	8-byte int
`"double"`	DoubleType()	8-byte float
`"float"`	FloatType()	4-byte float
`"boolean"`	BooleanType()	"1"/"true"/"yes" → true
`"decimal(p,s)"`	DecimalType(p,s)	Exact decimal
`"date"`	DateType()	Use to_date() instead
`"timestamp"`	TimestampType()	Use to_timestamp() instead

What Happens When Casting Fails?

python — cast failure behavior

# What happens when a string can't be cast to int?
df_bad = spark.createDataFrame([
    ("123",),
    ("abc",),    # ← can't cast to int
    ("45.6",),  # ← can't cast to int (has decimal)
    (None,)
], ["val"])

df_result = df_bad.withColumn("val_int", col("val").cast("int"))
df_result.show()
# +-----+-------+
# |  val|val_int|
# +-----+-------+
# |  123|    123|  ← OK
# |  abc|   null|  ← cast failed → NULL (no error!)
# | 45.6|   null|  ← cast failed → NULL
# | null|   null|  ← null stays null
# +-----+-------+

# IMPORTANT: Spark does NOT throw an error on failed casts!
# It silently returns null. Always validate after casting.

Silent Failure — Critical to Know

Spark's cast() never throws an error on bad values — it returns null instead. This means if you cast a column with bad data, you'll lose values silently. Always count nulls before and after casting to detect data quality issues.

6.15

Quick Reference & Quiz

Complete type reference cheatsheet and 5 interview-style questions to test your understanding of Spark data types.

📌

Module 6 — Complete Cheatsheet REFERENCE ▾

All Spark Types at a Glance

Type	DDL	Python	Key Use Case
`StringType()`	STRING	str	Names, emails, codes, text
`IntegerType()`	INT	int (≤2.1B)	Counts, IDs, ages
`LongType()`	BIGINT	int (large)	Unix timestamps, huge IDs
`FloatType()`	FLOAT	float32	ML scores, approximations
`DoubleType()`	DOUBLE	float64	Scientific values, rates
`DecimalType(p,s)`	DECIMAL(p,s)	Decimal	Money, prices, taxes
`BooleanType()`	BOOLEAN	bool	Flags, conditions
`DateType()`	DATE	date	Calendar dates (no time)
`TimestampType()`	TIMESTAMP	datetime	Event times, log times
`ArrayType(T)`	ARRAY<T>	list	Tags, items, scores
`MapType(K,V)`	MAP<K,V>	dict	Properties, attributes
`StructType()`	STRUCT<...>	tuple/Row	Nested objects, sub-rows

Type Selection Decision Tree

Is it text?
→ Yes → StringType()
Is it a whole number?
→ Fits in 2.1 billion? → IntegerType()
→ Larger? → LongType()
Is it a decimal?
→ Money/Finance? → DecimalType(p,s)
→ Approx ok? → DoubleType()
Is it a date?
→ Date only? → DateType()
→ Date + time? → TimestampType()
Is it a list? → ArrayType(elementType)
Is it key-value pairs? → MapType(keyType, valueType)
Is it a nested object? → StructType([StructField...])

🧠

Module 6 — Quiz (5 Questions) QUIZ ▾

Q1: Type for Financial Values

You are building a payment system and need to store transaction amounts like ₹89,999.99. Which type should you use?

A) DoubleType() — it handles decimals

B) FloatType() — smaller memory footprint

C) DecimalType(10, 2) — exact precision for money

D) StringType() — to preserve formatting

✅ Correct! DecimalType(10,2) gives exact decimal arithmetic — no binary approximation errors. DoubleType and FloatType use binary floating point which cannot represent all decimals exactly (e.g. 0.1 + 0.2 ≠ 0.3).

Q2: Cast Failure Behavior

You cast a column with values ["123", "abc", "456"] to IntegerType. What does Spark return for "abc"?

A) Throws AnalysisException

B) Throws RuntimeException

C) Returns null silently

D) Drops the row

✅ Correct! Spark returns null for failed casts — it never throws an error. This is a common source of silent data loss bugs. Always validate by counting nulls after a cast.

Q3: Array vs Map

A user has multiple phone numbers: ["98765", "91234"]. Another column stores their language preferences with proficiency: {"english": "native", "hindi": "fluent"}. What types would you use?

A) ArrayType(StringType()) and MapType(StringType(), StringType())

B) MapType for both

C) ArrayType for both

D) StructType for both

✅ Correct! A list of phone numbers with no key → ArrayType. A mapping of language → proficiency → MapType. Array is ordered, Map has key-value pairs.

Q4: LongType Necessity

A column stores Unix timestamps in milliseconds (e.g. 1700000000000 = Nov 2023). Why can't you use IntegerType?

A) IntegerType doesn't support negative numbers

B) 1,700,000,000,000 exceeds IntegerType's max of ~2.1 billion

C) Timestamps must always use TimestampType

D) IntegerType can't store millisecond precision

✅ Correct! IntegerType max is 2,147,483,647. A Unix ms timestamp for 2023 is ~1,700,000,000,000 — about 800x larger. This would overflow IntegerType and return null, causing data loss.

Q5: Accessing Nested Struct

A DataFrame has a struct column order containing a nested struct customer with a field name. How do you select that field?

A) col("order['customer']['name']")

B) col("order").customer.name

C) col("order.customer.name")

D) getField("order").getField("customer").getField("name")

✅ Correct! Dot notation is the standard way to access nested struct fields: col("order.customer.name"). You can chain as many levels as needed. The bracket notation is for maps and arrays, not structs.

Module 6 Complete ✓

You've covered all Spark data types: 9 primitive types (String, Integer, Long, Float, Double, Decimal, Boolean, Date, Timestamp) and 3 complex types (Array, Map, Struct) plus their nested combinations. Next up: Module 7 — DataFrame Transformations — the most important module in the series.