How can you apply filter for a RelationalGroupedDataset class from apache.spark.sql using Scala?-CodePudding

I was trying to find a filter function (takes a List type object and a function s.t. the function should be of type of the input list elements and should return a bool value, and the output of the filter of these two functions contains the original list element in which the function returns true on the element).

When I try to apply filter, I get an error. Are there any ways to apply filter to a RelationalGroupedDataset? (I wasn't able to find any in the attached docs: https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/RelationalGroupedDataset.html)

Also, is there proper notation for how I should be accessing a specific column value for a RelationalGroupedDataset?

Thanks!

Original Call

Error Message

CodePudding user response：

Quoting from your link, RelationalGroupedDataset is "A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot)".

In other words, it provides a way to apply an aggregation function (sum, min, max, avg, etc. etc.), not a filter, to a set of records grouped by certain key(s).

Depending on what you need to filter -- original, ungrouped records before aggregation, or the result of an aggregation function -- a where() can be applied either before groupBy() or after agg(). In the later case, it has semantic of an SQL's ...GROUP BY ... HAVING ... query.

CodePudding user response：

Here is is an example:

df.groupBy("department")
  .agg(
    sum("salary").as("sum_salary"),
    avg("salary").as("avg_salary"),
    sum("bonus").as("sum_bonus"),
    max("bonus").as("max_bonus"))
  .where(col("sum_bonus") >= 50000)
  .show(false)

It should give you guidance.