I have a case class like this
case class Employee(name: String, id: Int)
and I want a dataset that looks like a Map ---> key and list of values i.e
Dataset[String, List[Employee]]
My use case is I should group the IDs of employees with same name. Is there an operator in Spark to do that.
CodePudding user response:
With Datasets is so simple:
val data = Seq(
Employee("john doe", 1),
Employee("john doe", 2),
Employee("john doe2", 2)
)
val ds = spark.sparkContext.parallelize(data).toDS()
val resultDS: Dataset[(String, List[Employee])] =
ds.groupByKey(_.name).mapGroups { case (k, iter) => (k, iter.toList) }
resultDS.show(false)
It gives:
--------- ------------------------------
|_1 |_2 |
--------- ------------------------------
|john doe2|[{john doe2, 2}] |
|john doe |[{john doe, 1}, {john doe, 2}]|
--------- ------------------------------