PySpark Data Filtration¶
Todays post covers the following:
- Filtration by column value (one or multiple conditions)
- String related filtration using like / contains
- Missing data filtration
- List based filtration using isin
- General data clearning operations
Basic Filtering¶
You can refer to columns using any of these notations: df.age , df['age'], col('age') Basic Filtering
# Filter on >, <, >=, <=, == condition
df_filtered = df.filter(df.age > 30)
df_filtered = df.filter(df['age'] > 30)
# Using col() function
from pyspark.sql.functions import col
df_filtered = df.filter(col("age") > 30)
Filter with Multiple Conditions¶
Multiple conditions require parentheses around each condition
# AND condition ( & )
df_filtered = df.filter((df.age > 25) & (df.department == "Engineering"))
# OR condition ( | )
df_filtered = df.filter((df.age < 30) | (df.department == "Finance"))
String Filters¶
# Filter rows where department equals 'Marketing'
df_filtered = df.filter(df.department == "Marketing")
# Case-insensitive filter
df_filtered = df.filter(col("department").like("MARKETING"))
# Contains a substring
df_filtered = df.filter(col("department").contains("Engineer"))
# Filter rows where the name starts with 'A'
df.filter(col("name").startswith("A")).show()
# Filter rows where the name ends with 'e'
df.filter(col("name").endswith("e")).show()
# Filter rows where the name matches a regex
df.filter(col("name").rlike("^A.*")).show()
Null Filters¶
# Filter rows where a column is null
df_filtered = df.filter(df.department.isNull())
# Filter rows where a column is not null
df_filtered = df.filter(df.department.isNotNull())
Filter from list¶
# Filter rows where department is in a list
departments = ["Engineering", "Finance"]
df_filtered = df.filter(col("department").isin(departments))
# Negate the filter (not in list)
df_filtered = df.filter(~col("department").isin(departments))
Data Cleaning¶
# 1. Drop all fully duplicate rows
# Removes rows where all columns match exactly
df = df.dropDuplicates()
# 2. Drop duplicates based on specific columns
# Keeps the first row for each unique email
df = df.dropDuplicates(["email"])
# 3. Get only distinct rows (same as SELECT DISTINCT)
# Removes duplicates across all columns
df = df.distinct()
# 4. Drop rows with any null values
# Removes rows with even a single null field
df = df.dropna()
# 5. Drop rows with nulls in specific columns
# Only keeps rows where 'email' and 'age' are not null
df = df.dropna(subset=["email", "age"])
# 6. Fill missing values for all columns
# Replaces all nulls with a default value
df = df.fillna("N/A")
# 7. Fill missing values for specific columns
# Sets default age as 0 and country as "Unknown" if missing
df = df.fillna({"age": 0, "country": "Unknown"})