Home > Software design >  Group and aggregate dataset in spark scala without using spark.sql()
Group and aggregate dataset in spark scala without using spark.sql()

Time:03-16

I have a dataset with account information of customers as below

customerID accountID balance
ID001 ACC001 20
ID002 ACC002 400
ID003 ACC003 500
ID002 ACC004 30

I want to groupby and aggregrate the above data to get output as below without using spark.sql functions, instead allowed to use datasets API

accounts number of accounts totalBalance averageBalance
[ID001,ACC001,20] 1 20 20
[[ID002,ACC002,400], [ID002,ACC004,30]] 2 430 215
[ID003,ACC003,500] 1 500 500

I tried using ds.groupBy("accountID").agg(Map("balance" -> "avg")), however I am only able to use Map function to get the average. Need help to do multiple aggregation without using spark sql functions.

Appreciate any help to achieve the above solution. Thanks

CodePudding user response:

Here is your solution


val cust_data = Seq[(String, String, Int)](
("ID001",   "ACC001",   20),
("ID002",   "ACC002",   400),
("ID003",   "ACC003",   500),
("ID002",   "ACC004",   30)).toDF("customerID", "accountID",    "balance")

val out_df = cust_data.groupBy("customerID").agg(count($"accountID").alias("number_of_accounts"),
                                   sum($"balance").alias("totalBalance"),
                                   avg($"balance").alias("averageBalance"))

                                    
out_df.show()

 ---------- ------------------ ------------ -------------- 
|customerID|number_of_accounts|totalBalance|averageBalance|
 ---------- ------------------ ------------ -------------- 
|     ID001|                 1|          20|          20.0|
|     ID002|                 2|         430|         215.0|
|     ID003|                 1|         500|         500.0|
 ---------- ------------------ ------------ -------------- 
  • Related