Home > database >  Converting a Spark Dataframe to a Scala Map collection list
Converting a Spark Dataframe to a Scala Map collection list

Time:07-11

I'm trying to transform a Spark dataframe into a Scalar map and additionally a list of values.

It is best illustrated as follows:

val df = sqlContext.read.json("examples/src/main/resources/people.json")
df.show()
 ---- ------- 
| age|   name|
 ---- ------- 
|null|Michael|
|  30|   Andy|
|  19| Justin|
|  21|Michael|
 ---- ------- 

To a Scala collection (Map of Maps(List(values))) represented like this:

Map(
  (0 -> List(Map("age" -> null, "name" -> "Michael"), Map("age" -> 21, "name" -> "Michael"))),
  (1 -> Map("age" -> 30, "name" -> "Andy")),
  (2 -> Map("age" -> 19, "name" -> "Justin"))
)

As I don't know much about Scala, I wonder if this method is possible. It doesn't matter if it's not necessarily a List.

CodePudding user response:

The data structure you want is actually useless. Let me explain what I mean by asking 2 questions:

    1. What is the purpose of the integers of the outside map? are those indices? What is the logic of those indices? If those are indices, why not just use Array?
    1. Why to use Map[String, Any] and do unsafe element accessing, while you can model into case classes?

So I think the best thing you can do would be this:

case class Person(name: String, age: Option[Int])
val persons = df.as[Person].collect
val personsByName: Map[String, Array[Person]] = persons.groupBy(_.name)

Result would be:

Map(
  Michael -> Array(Person(Michael, None), Person(Michael, Some(21)),
  Andy -> Array(Person(Andy, Some(30))),
  Justin -> Array(Person(Justin, Some(19)))
)

But still, if you insist on the data structure, this is the code you need to use:

val result: Map[Int, List[Map[String, Any]]] =
  persons.groupBy(_.name)       // grouping persons by name
  .zipWithIndex                 // coupling index with values of array
  .map { 
    case ((name, persons), index) => 
      // put index as key, map each person to the desired map
      index -> persons.map(p => Map("age" -> p.age, "name" -> p.name)).toList 
    }
  • Related