MODULE 6 Spark Data Types
1 / 15
6.1

Spark Type System — Overview

Every column in a Spark DataFrame has an explicit data type. Understanding Spark's type system is critical — it controls how data is stored, compared, serialized, and optimized. Types fall into two families: Primitive (single values) and Complex (nested structures).

🗂️
The Two Families of Spark Types FOUNDATION
Primitive Types — Single Values

Primitive types hold one value per cell. They map directly to Python or Java native types under the hood.

🔤
Text
StringType — any text, characters, unicode
🔢
Integer Numbers
ByteType, ShortType, IntegerType, LongType
💧
Decimal Numbers
FloatType, DoubleType, DecimalType
Boolean
BooleanType — true / false only
📅
Date & Time
DateType, TimestampType, TimestampNTZType
🧮
Binary
BinaryType — raw byte arrays
Complex Types — Nested Structures

Complex types let one cell hold a collection or a nested object. This is where Spark really diverges from traditional SQL databases.

📋
ArrayType
A list of values, all the same type. Like Python list.
🗺️
MapType
Key-value pairs inside one cell. Like Python dict.
🏗️
StructType
A nested row (sub-table). Like a row within a row.
How Types Are Used in PySpark

Types appear in three main places in your code:

Schema Definition
StructField("name", StringType(), nullable)
Type Casting
df.col("age").cast(IntegerType())
Function Return Types
UDF return type = ArrayType(StringType())
python — all import paths
# Import primitive types
from pyspark.sql.types import (
    StringType, IntegerType, LongType,
    DoubleType, FloatType, BooleanType,
    DecimalType, DateType, TimestampType,
    BinaryType, ShortType, ByteType
)

# Import complex types
from pyspark.sql.types import (
    ArrayType, MapType, StructType, StructField
)

# Import NullType (rarely used but good to know)
from pyspark.sql.types import NullType
Key Rule
Every column in a Spark DataFrame must have exactly ONE type. Unlike Python where a list can hold mixed types, Spark columns are strictly typed. This enables massive performance optimizations.
6.2

StringType

The most commonly used type. Stores any sequence of characters — names, emails, JSON strings, encoded data. Internally stored as Java String (UTF-16).

🔤
StringType — Everything You Need to Know PRIMITIVE
What is StringType?

StringType stores text data of any length. There is no VARCHAR(n) or CHAR(n) like in SQL — Spark's StringType is always variable length. It stores any Unicode character, so names, emojis, and multilingual text all work fine.

Analogy
Think of StringType like a sticky note that can hold any amount of text — it doesn't matter if it's 1 character or 10,000 characters, it's the same type.
Defining a StringType Column
python
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("TypeDemo").getOrCreate()

# Define a schema with StringType columns
schema = StructType([
    StructField("name",    StringType(), nullable=True),
    StructField("email",   StringType(), nullable=True),
    StructField("country", StringType(), nullable=False)
])

data = [
    ("Alice",   "alice@example.com",  "India"),
    ("Bob",     None,                  "USA"),
    ("Chandra", "c@data.io",          "India")
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- name: string (nullable = true)
#  |-- email: string (nullable = true)
#  |-- country: string (nullable = false)

df.show()
# +-------+-----------------+-------+
# |   name|            email|country|
# +-------+-----------------+-------+
# |  Alice|alice@example.com|  India|
# |    Bob|             null|    USA|
# |Chandra|        c@data.io|  India|
# +-------+-----------------+-------+
String Type Shorthand

You can also use the string alias instead of importing StringType — both produce the same result:

python — string alias
# Using DDL string — shorter, same result
df2 = spark.createDataFrame(data, schema="name STRING, email STRING, country STRING")
df2.printSchema()  # Same output as above

# Casting a column to StringType
from pyspark.sql.functions import col

df_cast = df.withColumn("name_upper", col("name").cast(StringType()))
# (col is already string, but cast is valid)
Watch Out
NULL is not the same as empty string "" in Spark. None in Python becomes null in Spark. Always distinguish between a missing value (null) and a blank value ("").
When to Use StringType

Use StringType for: names, emails, IDs, codes, free-form text, encoded JSON within a column, phone numbers (to preserve leading zeros), and any data where arithmetic doesn't make sense.

Real World
A customer_id like "CUST-00123" must be StringType — it starts with letters and hyphens. If you use IntegerType, Spark will fail to parse it.
6.3

IntegerType & LongType

Whole number types. Choose based on the size of the numbers you need to store. Using the wrong type wastes memory or causes overflow.

🔢
Integer Family — ByteType, ShortType, IntegerType, LongType PRIMITIVE
All Integer Types — Range Comparison

Spark has four integer types, each storing a different range of whole numbers. They differ only in byte size and max value:

TypeBytesMin ValueMax ValueUse When
ByteType1 byte-128127Flags, small enums
ShortType2 bytes-32,76832,767Age, status codes
IntegerType4 bytes-2.1 billion2.1 billionDefault for most counts
LongType8 bytes-9.2 quintillion9.2 quintillionTimestamps (ms), huge IDs
Analogy
Think of these as boxes of different sizes. ByteType is a tiny box that holds only small numbers. LongType is a warehouse — you can store astronomical numbers, but it uses 8x the space of ByteType.
IntegerType — Most Common Whole Number Type
python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

schema = StructType([
    StructField("product_id",  IntegerType(), False),
    StructField("product_name", StringType(),  False),
    StructField("stock_count",  IntegerType(), True),
    StructField("total_sold",   LongType(),    True)
])

data = [
    (101, "Laptop",  45,    120_000_000),
    (102, "Phone",   200,   5_000_000_000),
    (103, "Tablet",  None,  890_000)
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- product_id: integer (nullable = false)
#  |-- product_name: string (nullable = false)
#  |-- stock_count: integer (nullable = true)
#  |-- total_sold: long (nullable = true)

df.show()
LongType — For Large Numbers

LongType is critical when dealing with Unix timestamps in milliseconds, large transaction IDs, or row counts from massive datasets. IntegerType would overflow at ~2.1 billion, causing silent data corruption.

python — overflow danger
from pyspark.sql.functions import col

# Unix timestamp in ms is ~1.7 trillion — must use LongType!
# 1,700,000,000,000 > 2,147,483,647 (IntegerType max) → OVERFLOW

schema2 = StructType([
    StructField("event_id",       LongType(), False),
    StructField("event_time_ms",   LongType(), False),  # Unix ms timestamp
    StructField("user_id",         LongType(), False)
])

events = [
    (1, 1700000000000, 9999999999),
    (2, 1700000001000, 1234567890),
]
df_events = spark.createDataFrame(events, schema=schema2)
df_events.show()

# Cast IntegerType to LongType if needed
df_cast = df.withColumn("total_sold", col("total_sold").cast("long"))
df_cast.printSchema()
Interview Trap
If you read a CSV with a column containing values like 5,000,000,000 and Spark infers the schema, it will correctly use LongType. But if you force IntegerType, Spark will silently return null for values that overflow. Always verify your schema!
6.4

DoubleType & FloatType

Floating point types for decimal numbers. They use binary approximation, which is fast but not perfectly precise. For financial calculations, use DecimalType instead.

💧
FloatType vs DoubleType — Precision & Performance PRIMITIVE
Key Differences
TypeBytesPrecisionPython TypeUse When
FloatType 4 bytes ~7 decimal digits float32 ML features, approximations
DoubleType 8 bytes ~15 decimal digits float64 Scientific, pricing (non-financial)
Analogy
Float is like a ruler with millimeter marks — accurate enough for most everyday tasks. Double is like a vernier caliper — far more precise. Neither is exact for all fractions (like 1/3), so never use them for money.
DoubleType — Default for Decimal Numbers
python
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, FloatType

schema = StructType([
    StructField("sensor_id",   StringType(), False),
    StructField("temperature",  DoubleType(), True),
    StructField("confidence",   FloatType(),  True)  # ML score 0.0-1.0
])

data = [
    ("SENSOR-01", 36.650123456789, 0.9875),
    ("SENSOR-02", -12.3,           0.5500),
    ("SENSOR-03",  None,            None),
]

df = spark.createDataFrame(data, schema=schema)
df.show()
# +---------+------------------+----------+
# |sensor_id|       temperature|confidence|
# +---------+------------------+----------+
# |SENSOR-01|36.650123456789   |0.9875    |
# |SENSOR-02|             -12.3|      0.55|
# |SENSOR-03|              null|      null|
# +---------+------------------+----------+
Floating Point Precision Trap
python — precision issue demo
from pyspark.sql.functions import col, lit

# Floating point is NOT precise for all fractions
df_test = spark.range(1).withColumn("val", lit(0.1) + lit(0.2))
df_test.show()
# +-------------------+
# |                val|
# +-------------------+
# |0.30000000000000004|   ← NOT exactly 0.3!
# +-------------------+

# For money: use DecimalType, not DoubleType
Critical Warning
Never use DoubleType or FloatType for financial calculations (prices, balances, tax). The binary representation cannot store all decimals exactly. Use DecimalType(precision, scale) for money.
6.5

BooleanType

The simplest type — only three possible values: true, false, or null. Used for flags, filter results, and conditional columns.

BooleanType — True / False / Null PRIMITIVE
BooleanType Basics

BooleanType stores exactly true or false. In Python you use True / False. The third possible value is null (unknown). Spark follows SQL three-valued logic: TRUE, FALSE, UNKNOWN (NULL).

python
from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType

schema = StructType([
    StructField("user_id",     IntegerType(), False),
    StructField("username",     StringType(),  False),
    StructField("is_active",    BooleanType(), True),
    StructField("is_premium",   BooleanType(), True),
    StructField("email_verified", BooleanType(), True)
])

data = [
    (1, "alice",   True,  True,  True),
    (2, "bob",     True,  False, None),   # email unknown
    (3, "charlie", False, False, False)
]

df = spark.createDataFrame(data, schema=schema)
df.show()
# +-------+--------+---------+----------+--------------+
# |user_id|username|is_active|is_premium|email_verified|
# +-------+--------+---------+----------+--------------+
# |      1|   alice|     true|      true|          true|
# |      2|     bob|     true|     false|          null|
# |      3| charlie|    false|     false|         false|
# +-------+--------+---------+----------+--------------+

# Filter on boolean column — no need for == True
active_premium = df.filter(col("is_active") & col("is_premium"))
active_premium.show()

# Create boolean column from condition
from pyspark.sql.functions import col
df_with_flag = df.withColumn("fully_verified",
    col("is_active") & col("email_verified") & col("is_premium")
)
df_with_flag.show()
Boolean from Comparisons

When you do comparisons in Spark (like col("age") > 18), the result is automatically a BooleanType column.

python — boolean from comparison
from pyspark.sql.functions import col, when

# Create boolean flag from numeric condition
df_ages = spark.createDataFrame(
    [("Alice", 25), ("Bob", 16), ("Charlie", 30)],
    schema="name STRING, age INT"
)

df_ages = df_ages.withColumn("is_adult", col("age") >= 18)
df_ages.printSchema()
# |-- is_adult: boolean (nullable = false)

df_ages.show()
# +-------+---+--------+
# |   name|age|is_adult|
# +-------+---+--------+
# |  Alice| 25|    true|
# |    Bob| 16|   false|
# |Charlie| 30|    true|
# +-------+---+--------+
Best Practice
When filtering on a boolean column, write filter(col("is_active")) — not filter(col("is_active") == True). Both work, but the first is cleaner and standard.
6.6

DecimalType

The only exact decimal type in Spark. Used for financial data, prices, taxes, and any calculation where precision matters. DecimalType(precision, scale) — you specify exactly how many digits total and how many after the decimal point.

🏦
DecimalType(precision, scale) — Exact Arithmetic PRIMITIVE
Understanding Precision and Scale

Precision = total number of digits (before + after decimal). Scale = digits after the decimal point.

DecimalType(10, 2)
↑ precision=10 total digits, scale=2 after decimal
Example value: 12345678.99
←— 8 digits before ——→ ←2→
Total = 8 + 2 = 10 digits ✓ Fits in DecimalType(10,2)
Value 99999999999.99 → 13 total digits → OVERFLOW (needs precision ≥ 13)
DecimalType in Practice
python
from pyspark.sql.types import StructType, StructField, StringType, DecimalType, IntegerType
from decimal import Decimal

# Financial schema: price up to 8 digits total, 2 decimal places
# e.g. max price: 999999.99
schema = StructType([
    StructField("order_id",   IntegerType(),         False),
    StructField("product",    StringType(),          False),
    StructField("unit_price", DecimalType(10, 2),   True),  # 99999999.99 max
    StructField("tax_rate",   DecimalType(5, 4),    True),  # 0.1800 = 18%
    StructField("quantity",   IntegerType(),         True)
])

data = [
    (1001, "Laptop",  Decimal("89999.99"), Decimal("0.1800"), 2),
    (1002, "Phone",   Decimal("24999.00"), Decimal("0.1800"), 1),
    (1003, "Cable",   Decimal("199.50"),   Decimal("0.0500"), 10),
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- unit_price: decimal(10,2) (nullable = true)
# |-- tax_rate: decimal(5,4) (nullable = true)

# Exact arithmetic — no floating point errors
from pyspark.sql.functions import col, round

df_totals = df.withColumn(
    "total_before_tax",
    (col("unit_price") * col("quantity")).cast(DecimalType(14, 2))
).withColumn(
    "tax_amount",
    (col("total_before_tax") * col("tax_rate")).cast(DecimalType(12, 2))
)
df_totals.show(truncate=False)
Common DecimalType Configurations
DecimalTypeMax ValueUse Case
DecimalType(10, 2)99,999,999.99Retail prices, invoices
DecimalType(15, 2)9.99 trillionBank balances, revenue
DecimalType(5, 4)9.9999Tax rates, percentages
DecimalType(38, 18)Max precisionCryptocurrency amounts
Watch Out
When you multiply two DecimalType columns, the result precision expands. Cast the result explicitly to avoid precision overflow or unexpected results. E.g. DecimalType(10,2) * DecimalType(5,4) → result is DecimalType(15,6).
6.7

DateType & TimestampType

Two essential types for working with time data. DateType stores calendar dates (no time). TimestampType stores date + time + timezone. Choosing the wrong one causes bugs that are hard to debug in production.

📅
DateType vs TimestampType — When to Use Each PRIMITIVE
DateType — Date Only (No Time)

DateType stores a calendar date: year, month, day. No time component. Internally stored as the number of days since epoch (1970-01-01). Format is always yyyy-MM-dd.

python — DateType
from pyspark.sql.types import StructType, StructField, StringType, DateType, IntegerType
from datetime import date
from pyspark.sql.functions import col, to_date, year, month, datediff, current_date

schema = StructType([
    StructField("order_id",    IntegerType(), False),
    StructField("order_date",   DateType(),    True),
    StructField("delivery_date", DateType(),   True)
])

data = [
    (1, date(2024, 1, 15), date(2024, 1, 20)),
    (2, date(2024, 3, 5),  date(2024, 3, 8)),
    (3, date(2024, 6, 20), None)
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- order_date: date (nullable = true)

# Date arithmetic — how many days between order and delivery?
df_diff = df.withColumn("days_to_deliver",
    datediff(col("delivery_date"), col("order_date"))
)

# Extract year, month from date
df_parts = df.withColumn("order_year", year(col("order_date")))\
             .withColumn("order_month", month(col("order_date")))

# Parse string to DateType
df_str = spark.createDataFrame([("2024-07-15",)], ["date_str"])
df_date = df_str.withColumn("parsed_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_date.printSchema()  # parsed_date: date
TimestampType — Date + Time + Timezone

TimestampType stores a precise point in time: date + time down to microseconds + timezone awareness. Internally stored as microseconds since epoch. Spark interprets timestamps in the session timezone (default: UTC).

python — TimestampType
from pyspark.sql.types import TimestampType
from datetime import datetime
from pyspark.sql.functions import to_timestamp, current_timestamp, hour, minute

schema_ts = StructType([
    StructField("event_id",    IntegerType(),   False),
    StructField("event_name",  StringType(),    False),
    StructField("event_time",  TimestampType(), True)
])

data_ts = [
    (1, "login",    datetime(2024, 6, 15, 9, 30, 0)),
    (2, "purchase", datetime(2024, 6, 15, 14, 22, 45)),
    (3, "logout",   datetime(2024, 6, 15, 18, 0, 0))
]

df_ts = spark.createDataFrame(data_ts, schema=schema_ts)
df_ts.show(truncate=False)
# +--------+----------+-------------------+
# |event_id|event_name|         event_time|
# +--------+----------+-------------------+
# |       1|     login|2024-06-15 09:30:00|
# |       2|  purchase|2024-06-15 14:22:45|
# |       3|    logout|2024-06-15 18:00:00|
# +--------+----------+-------------------+

# Parse string to timestamp
df_str2 = spark.createDataFrame([("2024-06-15 09:30:00",)], ["ts_str"])
df_ts2 = df_str2.withColumn("ts", to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

# Extract parts
df_ts.withColumn("event_hour", hour(col("event_time"))).show()
DateType vs TimestampType — Decision Guide
QuestionUse
Only care about the day (birthday, invoice date, batch date)?DateType
Need exact time of an event (click, transaction, log)?TimestampType
Doing date arithmetic (days between dates)?DateType
Ordering events by time within a day?TimestampType
Streaming event time for watermarks?TimestampType (required)
Timezone Pitfall
TimestampType is timezone-aware and uses your Spark session timezone. Set it explicitly: spark.conf.set("spark.sql.session.timeZone", "UTC"). In Spark 3.4+, TimestampNTZType (No Time Zone) is also available if you want timezone-naive timestamps.
6.8

ArrayType

ArrayType lets a single cell hold a list of values — all of the same type. This is one of Spark's most powerful features for semi-structured and nested data.

📋
ArrayType — Lists Inside Cells COMPLEX
What is ArrayType?

ArrayType(elementType, containsNull) defines a list of elements where every element must be the same type. elementType is the type of each item in the array. containsNull (default True) allows null values inside the array.

Analogy
Imagine a spreadsheet cell that doesn't hold one value but a whole grocery list. That list is an ArrayType — it can have 0, 1, or 100 items, and every item must be the same kind of thing (all strings, all integers, etc.).
Creating DataFrames with ArrayType
python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

schema = StructType([
    StructField("user_id",    IntegerType(),             False),
    StructField("username",   StringType(),              False),
    StructField("tags",       ArrayType(StringType()),    True),   # list of strings
    StructField("scores",     ArrayType(IntegerType()),   True),   # list of ints
])

data = [
    (1, "alice",   ["python", "spark", "sql"],  [95, 87, 92]),
    (2, "bob",     ["java", "scala"],           [78, 84]),
    (3, "charlie", [],                          None),    # empty array
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# root
#  |-- user_id: integer (nullable = false)
#  |-- username: string (nullable = false)
#  |-- tags: array (nullable = true)
#  |    |-- element: string (containsNull = true)
#  |-- scores: array (nullable = true)
#  |    |-- element: integer (containsNull = true)

df.show(truncate=False)
# +-------+--------+-------------------+-----------+
# |user_id|username|               tags|     scores|
# +-------+--------+-------------------+-----------+
# |      1|   alice|[python, spark, sql]|[95, 87, 92]|
# |      2|     bob|      [java, scala]|   [78, 84]|
# |      3| charlie|                 []|       null|
# +-------+--------+-------------------+-----------+
Working with ArrayType — Key Functions
python — array functions
from pyspark.sql.functions import (
    col, explode, array_contains, size, array_sort,
    element_at, array_distinct, flatten
)

# 1. Check if array contains a value
df.withColumn("has_spark", array_contains(col("tags"), "spark")).show()

# 2. Get size of array
df.withColumn("num_tags", size(col("tags"))).show()

# 3. Get element by position (1-indexed!)
df.withColumn("first_tag", element_at(col("tags"), 1)).show()

# 4. Sort the array
df.withColumn("sorted_tags", array_sort(col("tags"))).show(truncate=False)

# 5. Explode — turn each array element into a separate row
df.select("user_id", explode(col("tags")).alias("tag")).show()
# +-------+------+
# |user_id|   tag|
# +-------+------+
# |      1|python|
# |      1| spark|  ← one row per element!
# |      1|   sql|
# |      2|  java|
# |      2| scala|
# +-------+------+
# Note: row 3 (charlie, empty array) disappears with explode — use explode_outer to keep it
explode vs explode_outer
explode() drops rows where the array is null or empty. explode_outer() keeps those rows and returns null for the exploded value. Always use explode_outer when you want to preserve all parent rows.
6.9

MapType

MapType stores key-value pairs inside a single cell — like a Python dictionary per row. All keys must be the same type, all values must be the same type.

🗺️
MapType — Dictionaries Inside Cells COMPLEX
MapType(keyType, valueType, valueContainsNull)

MapType is defined by three parameters: the type of the keys, the type of the values, and whether values can be null. Keys cannot be null. Common use case: storing metadata, properties, or dynamic attributes per row.

python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, DoubleType

schema = StructType([
    StructField("product_id", IntegerType(),                     False),
    StructField("name",        StringType(),                      False),
    # Map: attribute_name → attribute_value (string → string)
    StructField("attributes",  MapType(StringType(), StringType()), True),
    # Map: region → revenue (string → double)
    StructField("revenue_by_region", MapType(StringType(), DoubleType()), True)
])

data = [
    (101, "Laptop",
     {"color": "silver", "weight": "1.5kg", "brand": "Dell"},
     {"India": 1200000.0, "USA": 850000.0}),
    (102, "Phone",
     {"color": "black", "storage": "256GB"},
     {"India": 3200000.0, "UAE": 400000.0}),
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- attributes: map (nullable = true)
# |    |-- key: string
# |    |-- value: string (valueContainsNull = true)
# |-- revenue_by_region: map (nullable = true)
# |    |-- key: string
# |    |-- value: double (valueContainsNull = true)

df.show(truncate=False)
Accessing Map Values
python — map access functions
from pyspark.sql.functions import col, map_keys, map_values, element_at, explode

# 1. Get value by key — two ways
df.withColumn("color", col("attributes")["color"]).show()
df.withColumn("color", element_at(col("attributes"), "color")).show()

# 2. Get all keys
df.withColumn("attr_keys", map_keys(col("attributes"))).show(truncate=False)
# attr_keys: [color, weight, brand]

# 3. Get all values
df.withColumn("attr_vals", map_values(col("attributes"))).show(truncate=False)

# 4. Explode map — one row per key-value pair
df.select("product_id", explode(col("attributes")).alias("attr_key", "attr_val")).show()
# +----------+--------+---------+
# |product_id|attr_key| attr_val|
# +----------+--------+---------+
# |       101|   color|   silver|
# |       101|  weight|    1.5kg|
# |       101|   brand|     Dell|
# |       102|   color|    black|
# |       102| storage|    256GB|
# +----------+--------+---------+
Keys Must Be Unique
Within a single map value, each key must be unique. Duplicate keys in a MapType column cause undefined behavior — Spark may keep one or throw an error.
6.10

StructType as a Complex Column Type

StructType is used both as the top-level schema AND as a column type for nested objects. A struct column holds multiple named fields — like a row within a row.

🏗️
StructType Column — Nested Object in a Cell COMPLEX
StructType as a Column Type

When a column's type is StructType, that column holds a structured sub-object with named fields. You see this when reading JSON or when designing schemas for APIs. Access nested fields using dot notation.

python
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, DoubleType
)

# Inner struct: address fields
address_schema = StructType([
    StructField("street", StringType(), True),
    StructField("city",   StringType(), True),
    StructField("pincode",IntegerType(),True),
    StructField("country",StringType(), True)
])

# Inner struct: contact info
contact_schema = StructType([
    StructField("phone", StringType(), True),
    StructField("email", StringType(), True)
])

# Outer schema — uses structs as column types
customer_schema = StructType([
    StructField("customer_id", IntegerType(),    False),
    StructField("name",         StringType(),    False),
    StructField("address",      address_schema,  True),   # ← Struct column!
    StructField("contact",      contact_schema,  True)    # ← Struct column!
])

data = [
    (1, "Alice",
     ("MG Road", "Bengaluru", 560001, "India"),
     ("+91-9876543210", "alice@example.com")),
    (2, "Bob",
     ("5th Ave", "New York", 10001, "USA"),
     ("+1-2125550100", "bob@example.com"))
]

df = spark.createDataFrame(data, schema=customer_schema)
df.printSchema()
# root
#  |-- customer_id: integer (nullable = false)
#  |-- name: string (nullable = false)
#  |-- address: struct (nullable = true)
#  |    |-- street: string (nullable = true)
#  |    |-- city: string (nullable = true)
#  |    |-- pincode: integer (nullable = true)
#  |    |-- country: string (nullable = true)
#  |-- contact: struct (nullable = true)
#  |    |-- phone: string (nullable = true)
#  |    |-- email: string (nullable = true)

# Access nested fields with dot notation
df.select(
    "customer_id",
    "name",
    col("address.city"),
    col("address.country"),
    col("contact.email")
).show()
Creating Struct Columns Dynamically
python — struct() function
from pyspark.sql.functions import struct, col

# Bundle flat columns into a struct
df_flat = spark.createDataFrame([
    (1, "Alice", "MG Road", "Bengaluru"),
    (2, "Bob",   "5th Ave", "New York")
], ["id", "name", "street", "city"])

# Group street + city into a struct called "address"
df_nested = df_flat.withColumn("address",
    struct(
        col("street").alias("street"),
        col("city").alias("city")
    )
).drop("street", "city")

df_nested.printSchema()
# root
#  |-- id: long
#  |-- name: string
#  |-- address: struct (nullable = false)
#  |    |-- street: string
#  |    |-- city: string
Dot Notation for Access
Access nested struct fields with dot notation in both the Python API and Spark SQL: col("address.city") or in SQL SELECT address.city FROM .... For deeply nested structs, chain the dots: col("order.customer.address.city").
6.11

Nested Structs

Structs can contain other structs, creating deeply nested hierarchies. This is the most common pattern when ingesting JSON from APIs, MongoDB, or event systems.

🪆
Multi-Level Nested Structs MICRO TOPIC
Deeply Nested Schema Example

Real-world JSON APIs frequently have 3-4 levels of nesting. Spark handles this naturally — you just nest StructType inside StructType.

order (StructType)
├── order_id : IntegerType
├── customer : StructType
│ ├── name : StringType
│ └── address : StructType
│ ├── street : StringType
│ ├── city : StringType
│ └── geo : StructType
│ ├── lat : DoubleType
│ └── lon : DoubleType
└── items : ArrayType(StructType)
Code: 3-Level Nested Struct
python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Level 3 — geo coordinates
geo_schema = StructType([
    StructField("lat", DoubleType(), True),
    StructField("lon", DoubleType(), True)
])

# Level 2 — address includes geo
address_schema = StructType([
    StructField("street", StringType(), True),
    StructField("city",   StringType(), True),
    StructField("geo",    geo_schema,   True)  # ← nested struct
])

# Level 1 — customer includes address
customer_schema = StructType([
    StructField("name",    StringType(),    True),
    StructField("address", address_schema,  True)  # ← nested struct
])

# Root schema
root_schema = StructType([
    StructField("order_id",  IntegerType(),   False),
    StructField("customer",  customer_schema, True)  # ← nested struct
])

data = [
    (1001, (("Alice", (("MG Road", "Bengaluru", (12.97, 77.59)))))),
    (1002, (("Bob",   (("5th Ave", "New York",  (40.71, -74.0)))))),
]

df = spark.createDataFrame(data, schema=root_schema)
df.printSchema()

# Access 3 levels deep — chain dot notation
df.select(
    "order_id",
    col("customer.name"),
    col("customer.address.city"),
    col("customer.address.geo.lat").alias("latitude"),
    col("customer.address.geo.lon").alias("longitude")
).show()
# +--------+-----+---------+---------+----------+
# |order_id| name|     city| latitude| longitude|
# +--------+-----+---------+---------+----------+
# |    1001|Alice|Bengaluru|    12.97|     77.59|
# |    1002|  Bob| New York|    40.71|    -74.0 |
# +--------+-----+---------+---------+----------+
Flattening Nested Structs
To flatten a nested struct into individual columns, use col("struct.*") to expand all fields, or select each field individually. This is called struct flattening and is covered in depth in Module 11.
6.12

Nested Arrays

Arrays can contain structs (array of objects), and structs can contain arrays (object with list fields). These combinations are extremely common in real-world JSON data.

📦
Arrays of Structs & Structs with Arrays MICRO TOPIC
Pattern 1 — Array of Structs

An ArrayType(StructType(...)) is an array where each element is a structured object. This is how order line items, tags with metadata, and product variants are typically stored.

python — ArrayType of StructType
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, DoubleType
from pyspark.sql.functions import col, explode

# Each order has multiple items — items is Array of Structs
item_schema = StructType([
    StructField("item_name", StringType(),  True),
    StructField("quantity",  IntegerType(), True),
    StructField("price",     DoubleType(),  True)
])

order_schema = StructType([
    StructField("order_id", IntegerType(),              False),
    StructField("customer", StringType(),               False),
    StructField("items",    ArrayType(item_schema),      True)  # ← Array of Structs
])

data = [
    (101, "Alice", [
        ("Laptop", 1, 89999.0),
        ("Mouse",  2, 499.0),
        ("Bag",    1, 1299.0)
    ]),
    (102, "Bob", [
        ("Phone", 1, 24999.0)
    ])
]

df = spark.createDataFrame(data, schema=order_schema)
df.printSchema()
# |-- items: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- item_name: string (nullable = true)
# |    |    |-- quantity: integer (nullable = true)
# |    |    |-- price: double (nullable = true)

# Explode array of structs — each item becomes its own row
df_exploded = df.select(
    "order_id",
    "customer",
    explode(col("items")).alias("item")
)

# Access struct fields on the exploded column
df_flat = df_exploded.select(
    "order_id",
    "customer",
    col("item.item_name"),
    col("item.quantity"),
    col("item.price")
)
df_flat.show()
# +--------+--------+---------+--------+-------+
# |order_id|customer|item_name|quantity|  price|
# +--------+--------+---------+--------+-------+
# |     101|   Alice|   Laptop|       1|89999.0|
# |     101|   Alice|    Mouse|       2|  499.0|
# |     101|   Alice|      Bag|       1| 1299.0|
# |     102|     Bob|    Phone|       1|24999.0|
# +--------+--------+---------+--------+-------+
Pattern 2 — Array of Arrays
python — ArrayType of ArrayType
from pyspark.sql.functions import flatten

# Matrix-like structure: array of arrays of integers
matrix_schema = StructType([
    StructField("id",     IntegerType(),                          False),
    StructField("matrix", ArrayType(ArrayType(IntegerType())),      True)
])

data = [
    (1, [[1,2,3], [4,5,6], [7,8,9]]),
    (2, [[10,20], [30,40]])
]

df_matrix = spark.createDataFrame(data, schema=matrix_schema)
df_matrix.show(truncate=False)

# Flatten array of arrays into a single array
df_flat2 = df_matrix.withColumn("flat", flatten(col("matrix")))
df_flat2.show(truncate=False)
# +---+-------------------+--------------------+
# | id|             matrix|                flat|
# +---+-------------------+--------------------+
# |  1|[[1,2,3],[4,5,6],..|  [1, 2, 3, 4, 5, ..|
# |  2|     [[10,20],[30,..|     [10, 20, 30, 40]|
# +---+-------------------+--------------------+
6.13

Nested Maps

Maps can nest inside structs, and maps can contain arrays or other maps as values. These patterns appear in flexible, schema-less event data.

🗺️
Maps with Complex Values — Maps inside Structs MICRO TOPIC
Map with Array Values

A map can have values that are arrays — useful for storing multiple values per key, like a user's favorite items per category.

python — MapType with ArrayType values
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, MapType, ArrayType
from pyspark.sql.functions import col, map_keys, map_values, explode

# Map: category → list of product names
schema = StructType([
    StructField("user_id",    IntegerType(), False),
    StructField("preferences", MapType(StringType(), ArrayType(StringType())), True)
])

data = [
    (1, {
        "electronics": ["Laptop", "Phone", "Tablet"],
        "books":       ["Clean Code", "DDIA"],
        "sports":      ["Cricket Bat"]
    }),
    (2, {
        "electronics": ["Headphones"],
        "fashion":     ["Sneakers", "Jacket"]
    })
]

df = spark.createDataFrame(data, schema=schema)
df.printSchema()
# |-- preferences: map (nullable = true)
# |    |-- key: string
# |    |-- value: array (valueContainsNull = true)
# |    |    |-- element: string (containsNull = true)

df.show(truncate=False)

# Get electronics preferences for each user
df.withColumn("elec_prefs", col("preferences")["electronics"]).show(truncate=False)

# Explode the map first, then explode the array
df_step1 = df.select("user_id", explode(col("preferences")).alias("category", "products"))
df_step2 = df_step1.select("user_id", "category", explode(col("products")).alias("product"))
df_step2.show()
# +-------+------------+------------+
# |user_id|    category|     product|
# +-------+------------+------------+
# |      1|electronics |      Laptop|
# |      1|electronics |       Phone|
# |      1|       books|  Clean Code|
# ...                              |
Map with Struct Values
python — MapType with StructType values
# Map: region_code → {sales, units} struct
metric_schema = StructType([
    StructField("sales", DoubleType(),  True),
    StructField("units", IntegerType(), True)
])

schema2 = StructType([
    StructField("product",  StringType(), False),
    StructField("by_region", MapType(StringType(), metric_schema), True)
])

data2 = [
    ("Laptop", {
        "IN": (1200000.0, 300),
        "US": (850000.0,  80)
    }),
]
df2 = spark.createDataFrame(data2, schema=schema2)
df2.printSchema()
# |-- by_region: map (nullable = true)
# |    |-- key: string
# |    |-- value: struct (valueContainsNull = true)
# |    |    |-- sales: double (nullable = true)
# |    |    |-- units: integer (nullable = true)

# Access nested map → struct field
df2.withColumn("india_sales", col("by_region")["IN"]["sales"]).show()
Real World Pattern
Event tracking systems often use MapType(StringType(), StringType()) for flexible event properties — where different events have different keys. This avoids schema explosion while maintaining the structure.
6.14

Type Casting & Conversions

Changing a column from one type to another. This happens constantly in real data work — string IDs that should be integers, timestamps stored as strings, prices stored as text.

🔄
Casting Types — .cast() and DDL String Shortcuts CONVERSION
The .cast() Method

Use .cast() on a column to convert it to a new type. You can pass either a type class instance or a DDL string (shorter).

python — all casting methods
from pyspark.sql.functions import col, to_date, to_timestamp
from pyspark.sql.types import IntegerType, DoubleType, LongType, BooleanType, StringType

df_raw = spark.createDataFrame([
    ("101", "45.99", "1",  "2024-01-15", "2024-01-15 10:30:00"),
    ("102", "12.50", "0",  "2024-02-20", "2024-02-20 14:22:00"),
], ["id_str", "price_str", "flag_str", "date_str", "ts_str"])

# Method 1: Pass type class instance
df_cast = df_raw.withColumn("id_int",   col("id_str").cast(IntegerType()))
df_cast = df_cast.withColumn("price_dbl", col("price_str").cast(DoubleType()))

# Method 2: DDL string shorthand (cleaner, same result)
df_cast = df_cast.withColumn("flag_bool", col("flag_str").cast("boolean"))
df_cast = df_cast.withColumn("id_long",   col("id_str").cast("long"))

# Method 3: Dates and timestamps (use functions, not cast)
df_cast = df_cast.withColumn("order_date", to_date(col("date_str"), "yyyy-MM-dd"))
df_cast = df_cast.withColumn("order_ts",   to_timestamp(col("ts_str"), "yyyy-MM-dd HH:mm:ss"))

df_cast.printSchema()
df_cast.show()
DDL String Type Aliases — Quick Reference
DDL StringType ClassNotes
"string"StringType()Most common
"int" or "integer"IntegerType()4-byte int
"long" or "bigint"LongType()8-byte int
"double"DoubleType()8-byte float
"float"FloatType()4-byte float
"boolean"BooleanType()"1"/"true"/"yes" → true
"decimal(p,s)"DecimalType(p,s)Exact decimal
"date"DateType()Use to_date() instead
"timestamp"TimestampType()Use to_timestamp() instead
What Happens When Casting Fails?
python — cast failure behavior
# What happens when a string can't be cast to int?
df_bad = spark.createDataFrame([
    ("123",),
    ("abc",),    # ← can't cast to int
    ("45.6",),  # ← can't cast to int (has decimal)
    (None,)
], ["val"])

df_result = df_bad.withColumn("val_int", col("val").cast("int"))
df_result.show()
# +-----+-------+
# |  val|val_int|
# +-----+-------+
# |  123|    123|  ← OK
# |  abc|   null|  ← cast failed → NULL (no error!)
# | 45.6|   null|  ← cast failed → NULL
# | null|   null|  ← null stays null
# +-----+-------+

# IMPORTANT: Spark does NOT throw an error on failed casts!
# It silently returns null. Always validate after casting.
Silent Failure — Critical to Know
Spark's cast() never throws an error on bad values — it returns null instead. This means if you cast a column with bad data, you'll lose values silently. Always count nulls before and after casting to detect data quality issues.
6.15

Quick Reference & Quiz

Complete type reference cheatsheet and 5 interview-style questions to test your understanding of Spark data types.

📌
Module 6 — Complete Cheatsheet REFERENCE
All Spark Types at a Glance
TypeDDLPythonKey Use Case
StringType()STRINGstrNames, emails, codes, text
IntegerType()INTint (≤2.1B)Counts, IDs, ages
LongType()BIGINTint (large)Unix timestamps, huge IDs
FloatType()FLOATfloat32ML scores, approximations
DoubleType()DOUBLEfloat64Scientific values, rates
DecimalType(p,s)DECIMAL(p,s)DecimalMoney, prices, taxes
BooleanType()BOOLEANboolFlags, conditions
DateType()DATEdateCalendar dates (no time)
TimestampType()TIMESTAMPdatetimeEvent times, log times
ArrayType(T)ARRAY<T>listTags, items, scores
MapType(K,V)MAP<K,V>dictProperties, attributes
StructType()STRUCT<...>tuple/RowNested objects, sub-rows
Type Selection Decision Tree
Is it text?
→ Yes → StringType()
Is it a whole number?
→ Fits in 2.1 billion? → IntegerType()
→ Larger? → LongType()
Is it a decimal?
→ Money/Finance? → DecimalType(p,s)
→ Approx ok? → DoubleType()
Is it a date?
→ Date only? → DateType()
→ Date + time? → TimestampType()
Is it a list? → ArrayType(elementType)
Is it key-value pairs? → MapType(keyType, valueType)
Is it a nested object? → StructType([StructField...])
🧠
Module 6 — Quiz (5 Questions) QUIZ
Q1: Type for Financial Values

You are building a payment system and need to store transaction amounts like ₹89,999.99. Which type should you use?

A) DoubleType() — it handles decimals
B) FloatType() — smaller memory footprint
C) DecimalType(10, 2) — exact precision for money
D) StringType() — to preserve formatting
Correct! DecimalType(10,2) gives exact decimal arithmetic — no binary approximation errors. DoubleType and FloatType use binary floating point which cannot represent all decimals exactly (e.g. 0.1 + 0.2 ≠ 0.3).
Q2: Cast Failure Behavior

You cast a column with values ["123", "abc", "456"] to IntegerType. What does Spark return for "abc"?

A) Throws AnalysisException
B) Throws RuntimeException
C) Returns null silently
D) Drops the row
Correct! Spark returns null for failed casts — it never throws an error. This is a common source of silent data loss bugs. Always validate by counting nulls after a cast.
Q3: Array vs Map

A user has multiple phone numbers: ["98765", "91234"]. Another column stores their language preferences with proficiency: {"english": "native", "hindi": "fluent"}. What types would you use?

A) ArrayType(StringType()) and MapType(StringType(), StringType())
B) MapType for both
C) ArrayType for both
D) StructType for both
Correct! A list of phone numbers with no key → ArrayType. A mapping of language → proficiency → MapType. Array is ordered, Map has key-value pairs.
Q4: LongType Necessity

A column stores Unix timestamps in milliseconds (e.g. 1700000000000 = Nov 2023). Why can't you use IntegerType?

A) IntegerType doesn't support negative numbers
B) 1,700,000,000,000 exceeds IntegerType's max of ~2.1 billion
C) Timestamps must always use TimestampType
D) IntegerType can't store millisecond precision
Correct! IntegerType max is 2,147,483,647. A Unix ms timestamp for 2023 is ~1,700,000,000,000 — about 800x larger. This would overflow IntegerType and return null, causing data loss.
Q5: Accessing Nested Struct

A DataFrame has a struct column order containing a nested struct customer with a field name. How do you select that field?

A) col("order['customer']['name']")
B) col("order").customer.name
C) col("order.customer.name")
D) getField("order").getField("customer").getField("name")
Correct! Dot notation is the standard way to access nested struct fields: col("order.customer.name"). You can chain as many levels as needed. The bracket notation is for maps and arrays, not structs.
Module 6 Complete ✓
You've covered all Spark data types: 9 primitive types (String, Integer, Long, Float, Double, Decimal, Boolean, Date, Timestamp) and 3 complex types (Array, Map, Struct) plus their nested combinations. Next up: Module 7 — DataFrame Transformations — the most important module in the series.