Converting a dataframe into a hashmap where Key is int and Value is a list in Scala-CodePudding

I have a dataframe which looks like this:

key	words
1	['a','test']
2	['hi', 'there]

And I would like to create the following hashmap:

Map(1 -> ['a', 'test'], 2 -> ['hi', 'there'])

But I cannot figure out how to do this, can anyone help me?

Thanks!

CodePudding user response：

There must be dozens of ways of doing this. One would be:

df.collect().map { case row => (row.getAs[Int](0) -> row.getAs[mutable.WrappedArray[String]](1))}.toMap

CodePudding user response：

This is very similar to the solution in this question. The following should give you the output you want. It gathers all the maps as a collection, and then uses the UDF to create a single map. This comes with the usual caveats regarding the potential poor performance of UDF functions.

import org.apache.spark.sql.functions.{col, map, collect_list, lit}
import org.apache.spark.sql.functions.udf

val joinMap = udf { values: Seq[Map[Int, Seq[String]]] =>
  values.flatten.toMap
}

val df = Seq((1, Seq("a", "test")), (2, Seq("hi", "there"))).toDF("key", "words")

val rDf = df
  .select(lit(1) as "id", map(col("key"), col("words")) as "kwMap")
  .groupBy("id")
  .agg(collect_list(col("kwMap")) as "kwMaps")
  .select(joinMap(col("kwMaps")) as "map")

rDf.show