Home > Enterprise >  How can you apply filter for a RelationalGroupedDataset class from apache.spark.sql using Scala?
How can you apply filter for a RelationalGroupedDataset class from apache.spark.sql using Scala?

Time:06-15

I was trying to find a filter function (takes a List type object and a function s.t. the function should be of type of the input list elements and should return a bool value, and the output of the filter of these two functions contains the original list element in which the function returns true on the element).

When I try to apply filter, I get an error. Are there any ways to apply filter to a RelationalGroupedDataset? (I wasn't able to find any in the attached docs: https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/RelationalGroupedDataset.html)

Also, is there proper notation for how I should be accessing a specific column value for a RelationalGroupedDataset?

Thanks!

Original Call

Error Message

CodePudding user response:

Quoting from your link, RelationalGroupedDataset is "A set of methods for aggregations on a DataFrame, created by groupBy, cube or rollup (and also pivot)".

In other words, it provides a way to apply an aggregation function (sum, min, max, avg, etc. etc.), not a filter, to a set of records grouped by certain key(s).

Depending on what you need to filter -- original, ungrouped records before aggregation, or the result of an aggregation function -- a where() can be applied either before groupBy() or after agg(). In the later case, it has semantic of an SQL's ...GROUP BY ... HAVING ... query.

CodePudding user response:

Here is is an example:

df.groupBy("department")
  .agg(
    sum("salary").as("sum_salary"),
    avg("salary").as("avg_salary"),
    sum("bonus").as("sum_bonus"),
    max("bonus").as("max_bonus"))
  .where(col("sum_bonus") >= 50000)
  .show(false)

It should give you guidance.

  • Related