I am trying to achieve aggregation by key and getting the list as aggregation.Not sure if i am able to explain.I am new to these spark features.
case class myScheme(classId: Int,
StudentId: Long,
joiningDate: java.sql.Timestamp,
SchoolName: String
)
println( myDS)
classId | studentId| joiningDate |SchoolName
1|100|2010-01-01|"School1"
2|110|2011-01-01|"School1"
2|200|2010-01-01|"School1"
3|300|2020-02-02|"School2"
I want to group all student Ids by schoolName and classId.
The final result from the above dataset:
"school1" ,1, [100 ]
"school2" ,2, [110,200]
"school3" ,3, [300]
CodePudding user response:
Your current input and expected result do not make much sense, but it seems that you need a simple groupBy
with collect_list
:
myDS.groupBy("SchoolName", "classId").agg(collect_list("studentId")).show
---------- ------- -----------------------
|SchoolName|classId|collect_list(studentId)|
---------- ------- -----------------------
| School1| 1| [100]|
| School2| 3| [300]|
| School1| 2| [110, 200]|
---------- ------- -----------------------
CodePudding user response:
If the data is already available in form of a (typed) dataset, you can use groupByKey
to stay within the typed API:
//input data
val ds: Dataset[myScheme] = ...
//case class for the result
case class Group(SchoolName: String,
classId: Int,
data: List[myScheme])
//the grouping operation
val dsGrouped: Dataset[Group] =
ds.groupByKey(r=> (r.SchoolName, r.classId))
.mapGroups((key, values) => Group(key._1, key._2, values.toList))
dsGrouped.show(false)
Output:
---------- ------- --------------------------------------------------------------------------------
|SchoolName|classId|data |
---------- ------- --------------------------------------------------------------------------------
|School1 |1 |[{1, 100, 2010-01-01 00:00:00, School1}] |
|School2 |3 |[{3, 300, 2020-02-02 00:00:00, School2}] |
|School1 |2 |[{2, 110, 2011-01-01 00:00:00, School1}, {2, 200, 2001-01-01 00:00:00, School1}]|
---------- ------- --------------------------------------------------------------------------------