I was trying to find a filter function (takes a List type object and a function s.t. the function should be of type of the input list elements and should return a bool value, and the output of the filter of these two functions contains the original list element in which the function returns true on the element).
When I try to apply filter, I get an error. Are there any ways to apply filter to a RelationalGroupedDataset? (I wasn't able to find any in the attached docs: https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/RelationalGroupedDataset.html)
Also, is there proper notation for how I should be accessing a specific column value for a RelationalGroupedDataset?
Thanks!
CodePudding user response:
Quoting from your link, RelationalGroupedDataset
is "A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot)".
In other words, it provides a way to apply an aggregation function (sum, min, max, avg, etc. etc.), not a filter, to a set of records grouped by certain key(s).
Depending on what you need to filter -- original, ungrouped records before aggregation, or the result of an aggregation function -- a where()
can be applied either before groupBy()
or after agg()
. In the later case, it has semantic of an SQL's ...GROUP BY ... HAVING ...
query.
CodePudding user response:
Here is is an example:
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.where(col("sum_bonus") >= 50000)
.show(false)
It should give you guidance.