I have a dataset with account information of customers as below
customerID | accountID | balance |
---|---|---|
ID001 | ACC001 | 20 |
ID002 | ACC002 | 400 |
ID003 | ACC003 | 500 |
ID002 | ACC004 | 30 |
I want to groupby and aggregrate the above data to get output as below without using spark.sql functions, instead allowed to use datasets API
accounts | number of accounts | totalBalance | averageBalance |
---|---|---|---|
[ID001,ACC001,20] | 1 | 20 | 20 |
[[ID002,ACC002,400], [ID002,ACC004,30]] | 2 | 430 | 215 |
[ID003,ACC003,500] | 1 | 500 | 500 |
I tried using ds.groupBy("accountID").agg(Map("balance" -> "avg"))
, however I am only able to use Map function to get the average. Need help to do multiple aggregation without using spark sql functions.
Appreciate any help to achieve the above solution. Thanks
CodePudding user response:
Here is your solution
val cust_data = Seq[(String, String, Int)](
("ID001", "ACC001", 20),
("ID002", "ACC002", 400),
("ID003", "ACC003", 500),
("ID002", "ACC004", 30)).toDF("customerID", "accountID", "balance")
val out_df = cust_data.groupBy("customerID").agg(count($"accountID").alias("number_of_accounts"),
sum($"balance").alias("totalBalance"),
avg($"balance").alias("averageBalance"))
out_df.show()
---------- ------------------ ------------ --------------
|customerID|number_of_accounts|totalBalance|averageBalance|
---------- ------------------ ------------ --------------
| ID001| 1| 20| 20.0|
| ID002| 2| 430| 215.0|
| ID003| 1| 500| 500.0|
---------- ------------------ ------------ --------------