How to group by key in apache spark-CodePudding

I have a case class like this

   case class Employee(name: String, id: Int)

and I want a dataset that looks like a Map ---> key and list of values i.e

   Dataset[String, List[Employee]]

My use case is I should group the IDs of employees with same name. Is there an operator in Spark to do that.

CodePudding user response：

With Datasets is so simple:

val data = Seq(
  Employee("john doe", 1),
  Employee("john doe", 2),
  Employee("john doe2", 2)
)

val ds = spark.sparkContext.parallelize(data).toDS()

val resultDS: Dataset[(String, List[Employee])] =
  ds.groupByKey(_.name).mapGroups { case (k, iter) => (k, iter.toList) }

resultDS.show(false)

It gives:

 --------- ------------------------------ 
|_1       |_2                            |
 --------- ------------------------------ 
|john doe2|[{john doe2, 2}]              |
|john doe |[{john doe, 1}, {john doe, 2}]|
 --------- ------------------------------