I'm trying to transform a Spark dataframe into a Scalar map and additionally a list of values.
It is best illustrated as follows:
val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show()
---- -------
| age| name|
---- -------
|null|Michael|
| 30| Andy|
| 19| Justin|
| 21|Michael|
---- -------
To a Scala collection (Map of Maps(List(values))) represented like this:
Map(
(0 -> List(Map("age" -> null, "name" -> "Michael"), Map("age" -> 21, "name" -> "Michael"))),
(1 -> Map("age" -> 30, "name" -> "Andy")),
(2 -> Map("age" -> 19, "name" -> "Justin"))
)
As I don't know much about Scala, I wonder if this method is possible. It doesn't matter if it's not necessarily a List.
CodePudding user response:
The data structure you want is actually useless. Let me explain what I mean by asking 2 questions:
-
- What is the purpose of the integers of the outside map? are those indices? What is the logic of those indices? If those are indices, why not just use
Array
?
- What is the purpose of the integers of the outside map? are those indices? What is the logic of those indices? If those are indices, why not just use
-
- Why to use
Map[String, Any]
and do unsafe element accessing, while you can model into case classes?
- Why to use
So I think the best thing you can do would be this:
case class Person(name: String, age: Option[Int])
val persons = df.as[Person].collect
val personsByName: Map[String, Array[Person]] = persons.groupBy(_.name)
Result would be:
Map(
Michael -> Array(Person(Michael, None), Person(Michael, Some(21)),
Andy -> Array(Person(Andy, Some(30))),
Justin -> Array(Person(Justin, Some(19)))
)
But still, if you insist on the data structure, this is the code you need to use:
val result: Map[Int, List[Map[String, Any]]] =
persons.groupBy(_.name) // grouping persons by name
.zipWithIndex // coupling index with values of array
.map {
case ((name, persons), index) =>
// put index as key, map each person to the desired map
index -> persons.map(p => Map("age" -> p.age, "name" -> p.name)).toList
}