What is a DataFrame?
DataFrames are the primary way you work with data in PySpark. They are like a table in a database or a spreadsheet — data organized in rows and columns — but distributed across a cluster and capable of processing billions of rows.
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a pandas DataFrame — but it runs across many machines in a cluster.
Every DataFrame has a schema (the column names and their data types) and consists of rows. Under the hood, a DataFrame is just an RDD of Row objects with schema information attached.
| Feature | RDD | DataFrame |
|---|---|---|
| Schema | No schema (untyped) | Named columns + types |
| Optimization | No optimizer | Catalyst + Tungsten |
| Performance | Slower | Much faster |
| API | Functional (map, filter) | SQL-like (select, filter, groupBy) |
| Use case | Low-level control, custom logic | Most real-world ETL work |
| Language | Python objects | Row objects / SQL expressions |
Creating DataFrames from a Python List
The simplest way to create a DataFrame — pass a Python list of tuples or Row objects directly to spark.createDataFrame(). Perfect for learning and quick testing.
Pass a list of tuples and a list of column names. Spark infers the types automatically.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Module5").getOrCreate()
# List of tuples — each tuple is one row
data = [
("Alice", 30, "Mumbai"),
("Bob", 25, "Delhi"),
("Carol", 35, "Pune"),
]
# Column names provided as second argument
df = spark.createDataFrame(data, ["name", "age", "city"])
df.show()
# +-----+---+------+
# | name|age| city|
# +-----+---+------+
# |Alice| 30|Mumbai|
# | Bob| 25| Delhi|
# |Carol| 35| Pune|
# +-----+---+------+
df.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
# |-- city: string (nullable = true)
int as LongType (64-bit), not IntegerType (32-bit). Python str → StringType, float → DoubleType, bool → BooleanType.
Using Row objects gives you named fields directly. More readable for complex data.
from pyspark.sql import Row
# Row objects with named fields
data = [
Row(name="Alice", age=30, salary=75000.0),
Row(name="Bob", age=25, salary=60000.0),
Row(name="Carol", age=35, salary=90000.0),
]
df = spark.createDataFrame(data)
df.show()
# +-----+---+--------+
# | name|age| salary|
# +-----+---+--------+
# |Alice| 30| 75000.0|
# | Bob| 25| 60000.0|
# |Carol| 35| 90000.0|
# +-----+---+--------+
Python dicts can also be passed. Each dict key becomes a column name.
# List of dicts
data = [
{"name": "Alice", "dept": "Engineering", "active": True},
{"name": "Bob", "dept": "Marketing", "active": False},
]
df = spark.createDataFrame(data)
df.show()
# +------+------+-----------+
# |active| dept| name|
# +------+------+-----------+
# | true| Engi…| Alice|
# | false| Mark…| Bob|
# +------+------+-----------+
# Note: columns are sorted alphabetically when using dicts!
# Python None becomes SQL NULL in Spark
data = [
("Alice", 30, "Mumbai"),
("Bob", None, None), # age and city are NULL
]
df = spark.createDataFrame(data, ["name", "age", "city"])
df.show()
# +-----+----+------+
# | name| age| city|
# +-----+----+------+
# |Alice| 30|Mumbai|
# | Bob|null| null|
# +-----+----+------+
Creating DataFrames from an RDD
You can convert an existing RDD into a DataFrame by attaching a schema. This is the bridge between the low-level RDD world and the high-level DataFrame world.
The quickest way. Call .toDF() on an RDD of tuples and pass column names.
# Start with a plain RDD
rdd = spark.sparkContext.parallelize([
("Alice", 30, "Mumbai"),
("Bob", 25, "Delhi"),
])
# Convert to DataFrame with column names
df = rdd.toDF(["name", "age", "city"])
df.show()
df.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
# |-- city: string (nullable = true)
More explicit — you can pass a full StructType schema so Spark doesn't need to infer types.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
rdd = spark.sparkContext.parallelize([
("Alice", 30),
("Bob", 25),
])
schema = StructType([
StructField("name", StringType(), nullable=True),
StructField("age", IntegerType(), nullable=False),
])
df = spark.createDataFrame(rdd, schema)
df.printSchema()
# root
# |-- name: string (nullable = true)
# |-- age: integer (nullable = false)
toDF(), Spark infers types from Python types. With an explicit StructType schema, you control the exact types — IntegerType instead of LongType, nullable=False constraints, etc.
from pyspark.sql import Row
# RDD of Row objects — schema inferred from field names
rdd = spark.sparkContext.parallelize([
Row(name="Alice", score=95.5),
Row(name="Bob", score=87.0),
])
df = rdd.toDF() # No column names needed — Row already has them
df.show()
# +-----+-----+
# | name|score|
# +-----+-----+
# |Alice| 95.5|
# | Bob| 87.0|
# +-----+-----+
Creating DataFrames from CSV
CSV (Comma-Separated Values) is one of the most common file formats. Spark can read a single CSV file, a folder of CSVs, or even compressed CSVs — all with one read call.
# Simplest form — all options are defaults
df = spark.read.csv("path/to/employees.csv")
# With options using option() chaining
df = (spark.read
.option("header", "true") # first row = column names
.option("inferSchema", "true") # auto-detect types
.csv("path/to/employees.csv"))
df.show(5)
df.printSchema()
| Option | Default | What It Does | Example |
|---|---|---|---|
header | false | Use first row as column names | "true" |
inferSchema | false | Auto-detect column data types | "true" |
sep / delimiter | , | Field separator character | "|", "\t" |
quote | " | Quote character for fields | "'" |
escape | \ | Escape character inside quoted fields | "\\" |
nullValue | "" | String to treat as null | "NA", "NULL" |
dateFormat | yyyy-MM-dd | Date parsing format | "dd/MM/yyyy" |
multiLine | false | Allow newlines inside quoted fields | "true" |
encoding | UTF-8 | File character encoding | "ISO-8859-1" |
mode | PERMISSIVE | How to handle corrupt records | "DROPMALFORMED" |
# Pipe-delimited file with custom null value
df = (spark.read
.option("header", True)
.option("sep", "|")
.option("nullValue", "NA")
.option("inferSchema", True)
.csv("sales_data.csv"))
# Tab-separated (TSV) file
df = (spark.read
.option("header", True)
.option("sep", "\t")
.csv("data.tsv"))
inferSchema=true scans the entire file to detect types — slow on large files. In production, always provide an explicit schema. This is faster and ensures correct types.
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import StringType, IntegerType, DoubleType, DateType
schema = StructType([
StructField("emp_id", IntegerType(), nullable=False),
StructField("name", StringType(), nullable=True),
StructField("salary", DoubleType(), nullable=True),
StructField("join_date", DateType(), nullable=True),
])
df = (spark.read
.schema(schema)
.option("header", True)
.option("dateFormat", "yyyy-MM-dd")
.csv("employees.csv"))
df.printSchema()
# root
# |-- emp_id: integer (nullable = false)
# |-- name: string (nullable = true)
# |-- salary: double (nullable = true)
# |-- join_date: date (nullable = true)
# Read an entire folder of CSVs as one DataFrame
df = spark.read.option("header", True).csv("s3://bucket/data/2024/")
# Read specific files
df = spark.read.csv(["jan.csv", "feb.csv", "mar.csv"])
# Wildcard pattern
df = spark.read.csv("data/sales_202*.csv")
sales_20240101.csv, sales_20240102.csv, etc. You read them all in one shot with a wildcard: spark.read.csv("s3://bucket/sales/sales_2024*.csv")
Creating DataFrames from JSON
JSON is the most common format for APIs and semi-structured data. Spark handles flat JSON, nested JSON, and multi-line JSON files seamlessly.
By default, Spark expects one JSON object per line (newline-delimited JSON / NDJSON). This is the most efficient format for large datasets.
# employees.json — one JSON object per line:
# {"emp_id": 1, "name": "Alice", "dept": "Eng"}
# {"emp_id": 2, "name": "Bob", "dept": "HR"}
# {"emp_id": 3, "name": "Carol", "dept": "Eng"}
df = spark.read.json("employees.json")
df.show()
# +----+-----+---+
# |dept|emp_id|name|
# +----+-----+---+
# | Eng| 1|Alice|
# | HR| 2| Bob|
# | Eng| 3|Carol|
# +----+-----+---+
df.printSchema()
# root
# |-- dept: string (nullable = true)
# |-- emp_id: long (nullable = true)
# |-- name: string (nullable = true)
When your JSON file is a single large JSON object or array (not one-object-per-line), use multiLine=true.
# employees.json — a JSON array across multiple lines:
# [
# {"emp_id": 1, "name": "Alice"},
# {"emp_id": 2, "name": "Bob"}
# ]
df = (spark.read
.option("multiLine", True)
.json("employees_array.json"))
multiLine=true cannot be read in parallel — Spark must read the whole file as a single task. Avoid this for very large files. Prefer newline-delimited JSON for big data.
Spark automatically converts nested JSON objects into StructType columns. You access nested fields using dot notation.
# orders.json — nested structure:
# {"order_id": 1, "customer": {"name": "Alice", "city": "Mumbai"}, "amount": 500}
# {"order_id": 2, "customer": {"name": "Bob", "city": "Delhi"}, "amount": 300}
df = spark.read.json("orders.json")
df.printSchema()
# root
# |-- amount: long (nullable = true)
# |-- customer: struct (nullable = true)
# | |-- city: string (nullable = true)
# | |-- name: string (nullable = true)
# |-- order_id: long (nullable = true)
# Access nested field using dot notation
df.select("order_id", "customer.name", "customer.city").show()
# +--------+-----+------+
# |order_id| name| city|
# +--------+-----+------+
# | 1|Alice|Mumbai|
# | 2| Bob| Delhi|
# +--------+-----+------+
from pyspark.sql.types import StructType, StructField, StringType, LongType
# Explicit schema for JSON — prevents a full file scan
schema = StructType([
StructField("order_id", LongType(), True),
StructField("name", StringType(), True),
StructField("amount", LongType(), True),
])
df = spark.read.schema(schema).json("orders.json")
# Fields not in schema are silently dropped
# Missing fields become null
Creating DataFrames from Parquet & Delta
Parquet and Delta are the gold standards for big data storage. They are columnar, compressed, and schema-aware — making reads incredibly fast compared to CSV or JSON.
Parquet files store their schema inside the file itself — so you never need inferSchema or a header. Spark reads the schema from file metadata instantly.
# Read a single Parquet file
df = spark.read.parquet("employees.parquet")
# Read a partitioned Parquet directory (most common)
# e.g., data/year=2024/month=01/part-00000.parquet
df = spark.read.parquet("s3://bucket/data/employees/")
# Spark uses partition pruning automatically
# e.g., if you filter year=2024, Spark only reads that folder
df.filter("year = 2024").show()
df.printSchema() # Schema comes from Parquet metadata — no scan needed!
# Write DataFrame as Parquet
df.write.parquet("output/employees/")
# Write with partitioning (creates folder per value)
df.write.partitionBy("dept").parquet("output/employees_by_dept/")
# Creates: output/employees_by_dept/dept=Eng/
# output/employees_by_dept/dept=HR/
# Overwrite existing data
df.write.mode("overwrite").parquet("output/employees/")
Delta Lake is Parquet with ACID transactions on top. Reading is the same as Parquet but you get time travel, schema evolution, and ACID guarantees. Deep dive in Module 20 — here's the basics.
# Read a Delta table by path
df = spark.read.format("delta").load("s3://bucket/delta/employees/")
# Read a Delta table registered in the catalog
df = spark.read.table("employees")
# Time travel — read data as it was at a specific version
df = (spark.read
.format("delta")
.option("versionAsOf", 3)
.load("s3://bucket/delta/employees/"))
# Time travel — read as of a specific timestamp
df = (spark.read
.format("delta")
.option("timestampAsOf", "2024-01-01")
.load("s3://bucket/delta/employees/"))
| Feature | CSV | JSON | Parquet | Delta |
|---|---|---|---|---|
| Schema in file | No | No | Yes | Yes |
| Columnar storage | No | No | Yes | Yes |
| Compression | Basic | Basic | Excellent | Excellent |
| ACID transactions | No | No | No | Yes |
| Time travel | No | No | No | Yes |
| Read speed | Slow | Medium | Fast | Fast |
| Human readable | Yes | Yes | No (binary) | No (binary) |
Schema — StructType & StructField
The schema defines the structure of your DataFrame: column names, types, and nullability. Mastering StructType and StructField is essential for production PySpark code.
StructType is the schema of a DataFrame (or a nested column). It is a container that holds a list of StructField objects — one per column.
from pyspark.sql.types import (
StructType, StructField,
StringType, IntegerType, DoubleType,
BooleanType, DateType, TimestampType,
LongType, ArrayType, MapType
)
# A StructType holds a list of StructFields
schema = StructType([
StructField("emp_id", IntegerType(), nullable=False),
StructField("name", StringType(), nullable=True),
StructField("salary", DoubleType(), nullable=True),
StructField("is_active", BooleanType(), nullable=True),
StructField("join_date", DateType(), nullable=True),
])
# Use this schema when reading data
df = spark.read.schema(schema).csv("employees.csv", header=True)
StructField(name, dataType, nullable, metadata) defines one column:
— name: column name (string)
— dataType: type class like StringType(), IntegerType()
— nullable: can this column contain null? (True/False)
— metadata: optional dict of extra info (rarely used)
# Full StructField signature
StructField(
name="salary",
dataType=DoubleType(),
nullable=True,
metadata={"description": "Annual salary in USD", "unit": "USD"}
)
# Access StructField properties
field = schema[0] # First field
print(field.name) # "emp_id"
print(field.dataType) # IntegerType()
print(field.nullable) # False
print(field.metadata) # {}
For JSON-like nested data, a column's dataType can itself be another StructType.
# Nested schema: address is a struct inside employee
address_schema = StructType([
StructField("street", StringType(), True),
StructField("city", StringType(), True),
StructField("pincode",StringType(), True),
])
employee_schema = StructType([
StructField("emp_id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("address", address_schema, True), # Nested struct!
])
# Sample data matching this schema
data = [(1, "Alice", ("MG Road", "Mumbai", "400001"))]
df = spark.createDataFrame(data, employee_schema)
df.show(truncate=False)
# +------+-----+----------------------------+
# |emp_id| name| address|
# +------+-----+----------------------------+
# | 1|Alice|{MG Road, Mumbai, 400001} |
# +------+-----+----------------------------+
# Access nested field
df.select("name", "address.city").show()
# Schema with ArrayType and MapType columns
schema = StructType([
StructField("name", StringType(), True),
StructField("skills", ArrayType(StringType()), True),
StructField("scores", MapType(StringType(), IntegerType()), True),
])
data = [
("Alice", ["Python", "Spark"], {"math": 95, "english": 88}),
("Bob", ["Java", "SQL"], {"math": 70, "english": 80}),
]
df = spark.createDataFrame(data, schema)
df.printSchema()
# root
# |-- name: string (nullable = true)
# |-- skills: array (nullable = true)
# | |-- element: string (containsNull = true)
# |-- scores: map (nullable = true)
# | |-- key: string
# | |-- value: integer (valueContainsNull = true)
# Get schema object
s = df.schema
# Iterate over fields
for field in s.fields:
print(f"{field.name}: {field.dataType} (nullable={field.nullable})")
# Get list of column names
print(df.columns) # ['emp_id', 'name', 'salary', ...]
# Get list of (name, type) tuples
print(df.dtypes) # [('emp_id', 'int'), ('name', 'string'), ...]
# Schema as JSON string (useful for storing/versioning)
print(df.schema.json())
# Reconstruct schema from JSON string
schema_json = df.schema.json()
restored = StructType.fromJson(eval(schema_json))
Infer Schema vs Explicit Schema
One of the most important decisions in reading data: should Spark figure out the schema automatically, or should you define it yourself? Both approaches have clear use cases.
When you set inferSchema=true, Spark does a full extra pass over the data to sample and detect column types. It reads the file twice — once to infer types, once to actually load data.
# inferSchema=True — Spark reads file TWICE
df = (spark.read
.option("header", True)
.option("inferSchema", True)
.csv("employees.csv"))
df.printSchema()
# root
# |-- emp_id: integer (nullable = true) ← Inferred from values
# |-- name: string (nullable = true)
# |-- salary: double (nullable = true)
# |-- join_date: string (nullable = true) ← Dates often inferred as string!
from pyspark.sql.types import *
# Define schema yourself — complete control
schema = StructType([
StructField("emp_id", IntegerType(), False), # NOT nullable
StructField("name", StringType(), True),
StructField("salary", DecimalType(10,2),True), # Exact decimal
StructField("join_date", DateType(), True), # Proper date
StructField("is_active", BooleanType(), True),
])
df = (spark.read
.schema(schema)
.option("header", True)
.option("dateFormat", "yyyy-MM-dd")
.csv("employees.csv"))
# Reads file ONCE — faster and more predictable
| Aspect | inferSchema=true | Explicit Schema |
|---|---|---|
| File reads | 2 passes (slow) | 1 pass (fast) |
| Date/Time handling | Often inferred as String | Exact control |
| Column nullability | Always nullable=true | You control nullable |
| Large files (>1GB) | Significant overhead | No overhead |
| Dev convenience | Quick to write | More code needed |
| Production use | Not recommended | Standard approach |
| Schema stability | Can change between runs | Always consistent |
You can also define schemas as a SQL DDL string — much shorter to write:
# DDL string schema — concise alternative to StructType
schema_ddl = "emp_id INT NOT NULL, name STRING, salary DOUBLE, join_date DATE"
df = (spark.read
.schema(schema_ddl)
.option("header", True)
.csv("employees.csv"))
# Equivalent to the full StructType definition above
# Great for quick prototyping, still explicit
nullable, metadata & data types
Three micro-topics that are part of every StructField — understanding these makes you write precise, production-quality schemas.
nullable=True means the column can contain NULL values.
nullable=False means the column must not have NULLs.
Important: Spark uses nullable for optimization hints and documentation, but does not strictly enforce it at write time (unlike a database). However, some operations behave differently based on nullable.
# nullable=False signals "this column is a key / must have a value"
schema = StructType([
StructField("emp_id", IntegerType(), nullable=False), # Primary key
StructField("name", StringType(), nullable=True), # Can be missing
])
# How to check nullability in your DataFrame
for field in df.schema.fields:
print(f"{field.name}: nullable={field.nullable}")
# Force a column to be non-nullable after creation
from pyspark.sql.functions import col
df2 = df.withColumn("emp_id", col("emp_id").cast(IntegerType()))
# Use schema definition at creation for nullable=False
metadata is an optional Python dict you can attach to a StructField. It carries extra information about a column — descriptions, units, source systems, etc. It's stored in the schema and travels with the DataFrame.
# Attaching metadata to StructField
schema = StructType([
StructField("emp_id", IntegerType(), False,
metadata={"description": "Unique employee identifier",
"source": "HR_SYSTEM"}),
StructField("salary", DoubleType(), True,
metadata={"description": "Annual gross salary",
"unit": "USD",
"pii": False}),
StructField("ssn", StringType(), True,
metadata={"description": "Social Security Number",
"pii": True, # PII flag!
"encrypted": True}),
])
# Access metadata programmatically — useful for data governance tools
for field in schema.fields:
if field.metadata.get("pii"):
print(f"PII column found: {field.name}")
# PII column found: ssn
| Type Class | SQL Name | Size | Range / Notes |
|---|---|---|---|
ByteType() | TINYINT | 1 byte | -128 to 127 |
ShortType() | SMALLINT | 2 bytes | -32,768 to 32,767 |
IntegerType() | INT | 4 bytes | -2.1B to 2.1B |
LongType() | BIGINT | 8 bytes | Very large integers. Python int → LongType |
FloatType() | FLOAT | 4 bytes | Single precision decimal |
DoubleType() | DOUBLE | 8 bytes | Double precision. Python float → DoubleType |
DecimalType(p,s) | DECIMAL(p,s) | Variable | Exact decimal. p=precision, s=scale. Use for money. |
# DecimalType example — for financial data
# DecimalType(10, 2) = up to 10 digits, 2 decimal places
# e.g., 99999999.99 — suitable for currency
StructField("price", DecimalType(10, 2), True)
StructField("quantity", IntegerType(), True)
StructField("total", DecimalType(12, 2), True)
# Never use DoubleType for money — floating point precision issues!
# 0.1 + 0.2 = 0.30000000000000004 in floating point
| Type Class | SQL Name | Notes |
|---|---|---|
StringType() | STRING | UTF-8 text. Python str → StringType |
BooleanType() | BOOLEAN | true/false. Python bool → BooleanType |
BinaryType() | BINARY | Raw bytes (images, encrypted data) |
| Type Class | SQL Name | Notes |
|---|---|---|
DateType() | DATE | Year-month-day only (no time). e.g., 2024-01-15 |
TimestampType() | TIMESTAMP | Date + time + timezone (microseconds) |
TimestampNTZType() | TIMESTAMP_NTZ | Timestamp without timezone (Spark 3.4+) |
from pyspark.sql.functions import to_date, to_timestamp
# Reading dates from string CSV
df = (spark.read
.option("header", True)
.option("dateFormat", "dd/MM/yyyy")
.schema(StructType([
StructField("name", StringType(), True),
StructField("join_date", DateType(), True),
]))
.csv("employees.csv"))
# Cast string to date manually if needed
from pyspark.sql.functions import to_date
df = df.withColumn("join_date", to_date(col("join_date"), "dd/MM/yyyy"))
| Type Class | Description | Example |
|---|---|---|
ArrayType(elementType) | Ordered list of elements of the same type | ["Python", "Spark", "SQL"] |
MapType(keyType, valueType) | Key-value pairs (like a Python dict) | {"math": 95, "eng": 88} |
StructType([fields]) | Nested row / object (like a dict with fixed schema) | {"city": "Mumbai", "pin": "400001"} |
| Python Type | Spark Type Inferred | Recommended Explicit Type |
|---|---|---|
int | LongType | IntegerType or LongType depending on range |
float | DoubleType | DoubleType or DecimalType for money |
str | StringType | StringType |
bool | BooleanType | BooleanType |
None | NullType (then inferred) | Provide schema explicitly |
list | ArrayType | ArrayType(elementType) |
dict | MapType or StructType | StructType for fixed keys |
datetime.date | DateType | DateType |
datetime.datetime | TimestampType | TimestampType |
Exploring DataFrames
Before transforming data, you need to explore it. PySpark provides a rich set of methods to understand the shape, schema, and content of your DataFrames. These are the first commands you run on any new dataset.
show(n, truncate, vertical) — displays the first n rows as a table. This is an action — it triggers computation.
# Default: show 20 rows, truncate long strings at 20 chars
df.show()
# Show 5 rows
df.show(5)
# Show 10 rows, don't truncate long strings
df.show(10, truncate=False)
# Truncate at 50 characters
df.show(5, truncate=50)
# Vertical mode — one column per line (great for wide DataFrames)
df.show(3, vertical=True)
# -RECORD 0------
# emp_id | 1
# name | Alice
# salary | 75000.0
# ...
# Prints schema in a tree format
df.printSchema()
# root
# |-- emp_id: integer (nullable = false)
# |-- name: string (nullable = true)
# |-- address: struct (nullable = true)
# | |-- city: string (nullable = true)
# | |-- pincode: string (nullable = true)
# Get schema as object
s = df.schema # StructType
# Get column names only
cols = df.columns # ['emp_id', 'name', 'address']
# Get (name, type string) tuples
types = df.dtypes # [('emp_id', 'int'), ('name', 'string')]
# Count total rows — triggers a full scan (expensive on large data)
n = df.count()
print(f"Total rows: {n}")
# Check if DataFrame has zero rows
if df.isEmpty():
print("No data!") # Spark 3.3+
# describe() — count, mean, stddev, min, max for numeric/string cols
df.describe().show()
# +-------+------+-------+--------+
# |summary|emp_id| name| salary|
# +-------+------+-------+--------+
# | count| 3| 3| 3|
# | mean| 2.0| null| 75000.0|
# | stddev| 1.0| null| 15000.0|
# | min| 1| Alice| 60000.0|
# | max| 3| Carol| 90000.0|
# +-------+------+-------+--------+
# summary() — adds 25%, 50%, 75% percentiles
df.summary().show()
# describe specific columns only
df.describe("salary", "age").show()
# first() — returns first Row as a Python Row object
row = df.first()
print(row) # Row(emp_id=1, name='Alice', salary=75000.0)
print(row.name) # 'Alice'
print(row["name"]) # 'Alice'
print(row[1]) # 'Alice' (by index)
# head(n) — same as first() but returns list of n Rows
rows = df.head(3) # List of 3 Row objects
# take(n) — returns list of n Row objects
rows = df.take(5)
# collect() — returns ALL rows as a Python list (DANGEROUS on big data!)
all_rows = df.collect() # Brings all data to driver — OOM risk!
collect() pulls ALL data from all executors to the driver node. On a DataFrame with billions of rows this will crash your driver with an OutOfMemoryError. Only use collect() on small result DataFrames.
# Standard first-look workflow for any new DataFrame
# 1. Schema — what columns and types?
df.printSchema()
# 2. Row count — how big is this?
print(f"Rows: {df.count()}")
# 3. Column count
print(f"Columns: {len(df.columns)}")
# 4. Sample data
df.show(5, truncate=False)
# 5. Basic statistics
df.describe().show()
# 6. Check for nulls in each column
from pyspark.sql.functions import col, sum as spark_sum, isnan, when
null_counts = df.select([
spark_sum(when(col(c).isNull(), 1).otherwise(0)).alias(c)
for c in df.columns
])
null_counts.show()
# 7. Check number of partitions
print(f"Partitions: {df.rdd.getNumPartitions()}")
collect() pulls all data to the driver. On 1 billion rows, this causes an OOM crash. Use show(n) or take(n) instead.