Group and aggregate dataset in spark scala without using spark.sql()-CodePudding

I have a dataset with account information of customers as below

customerID	accountID	balance
ID001	ACC001	20
ID002	ACC002	400
ID003	ACC003	500
ID002	ACC004	30

I want to groupby and aggregrate the above data to get output as below without using spark.sql functions, instead allowed to use datasets API

accounts	number of accounts	totalBalance	averageBalance
[ID001,ACC001,20]	1	20	20
[[ID002,ACC002,400], [ID002,ACC004,30]]	2	430	215
[ID003,ACC003,500]	1	500	500

I tried using ds.groupBy("accountID").agg(Map("balance" -> "avg")), however I am only able to use Map function to get the average. Need help to do multiple aggregation without using spark sql functions.

Appreciate any help to achieve the above solution. Thanks

CodePudding user response：

Here is your solution


val cust_data = Seq[(String, String, Int)](
("ID001",   "ACC001",   20),
("ID002",   "ACC002",   400),
("ID003",   "ACC003",   500),
("ID002",   "ACC004",   30)).toDF("customerID", "accountID",    "balance")

val out_df = cust_data.groupBy("customerID").agg(count($"accountID").alias("number_of_accounts"),
                                   sum($"balance").alias("totalBalance"),
                                   avg($"balance").alias("averageBalance"))

                                    
out_df.show()

 ---------- ------------------ ------------ -------------- 
|customerID|number_of_accounts|totalBalance|averageBalance|
 ---------- ------------------ ------------ -------------- 
|     ID001|                 1|          20|          20.0|
|     ID002|                 2|         430|         215.0|
|     ID003|                 1|         500|         500.0|
 ---------- ------------------ ------------ --------------