Home > Blockchain >  Aggregate dataset by key and list the rest of values
Aggregate dataset by key and list the rest of values

Time:09-24

I am trying to achieve aggregation by key and getting the list as aggregation.Not sure if i am able to explain.I am new to these spark features.

case class myScheme(classId: Int,
                        StudentId: Long,
                        joiningDate: java.sql.Timestamp,
                        SchoolName: String
                       )

println( myDS)

 classId | studentId| joiningDate |SchoolName
    1|100|2010-01-01|"School1"
    2|110|2011-01-01|"School1"
    2|200|2010-01-01|"School1"
    3|300|2020-02-02|"School2"

    

I want to group all student Ids by schoolName and classId.

The final result from the above dataset:

"school1" ,1, [100 ]
"school2" ,2, [110,200] 
"school3" ,3, [300]

CodePudding user response:

Your current input and expected result do not make much sense, but it seems that you need a simple groupBy with collect_list:

myDS.groupBy("SchoolName", "classId").agg(collect_list("studentId")).show 

 ---------- ------- ----------------------- 
|SchoolName|classId|collect_list(studentId)|
 ---------- ------- ----------------------- 
|   School1|      1|                  [100]|
|   School2|      3|                  [300]|
|   School1|      2|             [110, 200]|
 ---------- ------- ----------------------- 

CodePudding user response:

If the data is already available in form of a (typed) dataset, you can use groupByKey to stay within the typed API:

//input data
val ds: Dataset[myScheme] = ...

//case class for the result
case class Group(SchoolName: String,
                 classId: Int,
                 data: List[myScheme])

//the grouping operation
val dsGrouped: Dataset[Group] =
  ds.groupByKey(r=> (r.SchoolName, r.classId))
    .mapGroups((key, values) => Group(key._1, key._2, values.toList))

dsGrouped.show(false)

Output:

 ---------- ------- -------------------------------------------------------------------------------- 
|SchoolName|classId|data                                                                            |
 ---------- ------- -------------------------------------------------------------------------------- 
|School1   |1      |[{1, 100, 2010-01-01 00:00:00, School1}]                                        |
|School2   |3      |[{3, 300, 2020-02-02 00:00:00, School2}]                                        |
|School1   |2      |[{2, 110, 2011-01-01 00:00:00, School1}, {2, 200, 2001-01-01 00:00:00, School1}]|
 ---------- ------- -------------------------------------------------------------------------------- 
  • Related