PySpark Data Filtration¶

Todays post covers the following:

Filtration by column value (one or multiple conditions)
String related filtration using like / contains
Missing data filtration
List based filtration using isin
General data clearning operations

Basic filtering¶

You can refer to columns using any of these notations: df.age , df['age'], col('age') Basic Filtering

# Filter on >, <, >=, <=, == condition
df_filtered = df.where(df.age > 30)
df_filtered = df.where(df['age'] > 30)

# Using col() function
from pyspark.sql.functions import col
df_filtered = df.where(f.col("age") > 30)

Filter with multiple conditions¶

Multiple conditions require parentheses around each condition

# AND condition ( & )
df_filtered = df.filter((df.age > 25) & (df.department == "Engineering"))
# OR condition ( | )
df_filtered = df.filter((df.age < 30) | (df.department == "Finance"))

String filters¶

String column related filtration using soft and hard matches as well as regex

# Filter rows where department equals 'Marketing'
df_filtered = df.where(df.department == "Marketing")

# Case-insensitive filter
df_filtered = df.where(f.col("department").like("MARKETING"))

# Contains a substring
df_filtered = df.where(f.col("department").contains("Engineer"))

# Filter rows where the name starts with 'A'
df.where(f.col("name").startswith("A")).show()

# Filter rows where the name ends with 'e'
df.where(f.col("name").endswith("e")).show()

# Filter rows where the name matches a regex
df.where(f.col("name").rlike("^A.*")).show()

Null filters¶

Show rows with column value is null for a single column filtering rows with missing data in that column and visa versa

# Filter rows where a column is null
df_filtered = df.where(df.department.isNull())

# Filter rows where a column is not null
df_filtered = df.where(df.department.isNotNull())

Filter from list¶

# Filter rows where department is in a list
departments = ["Engineering", "Finance"]
df_filtered = df.filter(f.col("department").isin(departments))
# Negate the filter (not in list)
df_filtered = df.where(~f.col("department").isin(departments))

Data cleaning¶

# 1. Drop all fully duplicate rows
# Removes rows where all columns match exactly
df = df.dropDuplicates()

# 2. Drop duplicates based on specific columns
# Keeps the first row for each unique email
df = df.dropDuplicates(["email"])

# 3. Get only distinct rows (same as SELECT DISTINCT)
# Removes duplicates across all columns
df = df.distinct()

# 4. Drop rows with any null values
# Removes rows with even a single null field
df = df.dropna()

# 5. Drop rows with nulls in specific columns
# Only keeps rows where 'email' and 'age' are not null
df = df.dropna(subset=["email", "age"])

# 6. Fill missing values for all columns
# Replaces all nulls with a default value
df = df.fillna("N/A")

# 7. Fill missing values for specific columns
# Sets default age as 0 and country as "Unknown" if missing
df = df.fillna({"age": 0, "country": "Unknown"})

Thank you for reading!

Any questions or comments about the posts below can be addressed to the mldsai-info channel or to me directly shtrauss2, on shtrausslearning or shtrausslearning