PySpark Aggregations¶

Todays post covers the following:

Aggregations without grouping
Aggregations with grouping
Filtering after grouping

Aggregations without grouping**¶

You can refer to columns using any of these notations: df.age , df['age'], col('age') Basic Filtering

#Count rows
df.count()

#Count Distinct Values in a column
df.select(countDistinct("Department")).show()

#Sum
df.select(sum("Salary")).show()

#Multiple Aggregations
df.select(min("Salary"), max("Salary")).show()

Aggregations with grouping¶

#Group by a single column
df.groupBy("Department").sum("Salary").show()

#GroupBy with Multiple Columns
df.groupBy("Department", "Employee").sum("Salary").show()

#Group by with multiple aggregations
df.groupBy("Department").agg(
                              count("Employee").alias("Employee_Count"),
                              avg("Salary").alias("Average_Salary"),
                              max("Salary").alias("Max_Salary")
)

Filtration after aggregation¶

Like in SQL filtration after grouping data (having)

#Filter after aggregation
df.groupBy("Department").agg(sum("Salary").alias("Total_Salary")).filter("Total_Salary > 8000").show()

Filtration after aggregation¶

Commonly used aggregation

Function	Description	Example
`count()`	Counts rows in a group.	`groupBy("Department").count()`
`sum()`	Sums values in a group.	`groupBy("Department").sum("Salary")`
`avg()` / `mean()`	Calculates average values.	`groupBy("Department").avg("Salary")`
`min()`	Finds the minimum value.	`groupBy("Department").min("Salary")`
`max()`	Finds the maximum value.	`groupBy("Department").max("Salary")`
`countDistinct()`	Counts distinct values in a group.	`countDistinct("Employee")`
`collect_list()`	Collects all values into a list.	`collect_list("Employee")`
`collect_set()`	Collects unique values into a set.	`collect_set("Employee")`

Thank you for reading!

Any questions or comments about the posts below can be addressed to the mldsai-info channel or to me directly shtrauss2, on shtrausslearning or shtrausslearning