Spark: How to convert array of objects with fields key-value into columns with keys as names-CodePudding

I have a column that contains array of objects as a value.

Objects have the following structure:

[
  {
    "key": "param1",
    "val": "value1"
  },
  {
    "key": "param2",
    "val": "value2"
  },
  {
    "key": "param3",
    "val": "value3"
  }
]

someColumn	colName
text	`[{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}]`

When I do:

df.withColumn("exploded", explode(col("colName")))

I get

someColumn	exploded
text	`{key: "param1", val: "value1"}`
text	`{key: "param2", val: "value2"}`
text	`{key: "param3", val: "value3"}`

Then I do next:

df.select("*", "exploded.*").drop("exploded")

I get this:

someColumn	key	value
text	param1	value1
text	param2	value2
text	param3	value3

I understand why I get such result but I need to get other structure.

I want to get next result:

someColumn	param1	param2	param3
text	value1	value2	value3

Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?

CodePudding user response：

Once you explode your dataset, you can:

df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))

This is created from the above statement:

 ---------- ------ ------ ------ 
|someColumn|param1|param2|param3|
 ---------- ------ ------ ------ 
|text      |value1|value2|value3|
 ---------- ------ ------ ------

which is what you like!

CodePudding user response：

I found one more solution:

val mappedDF = df
      .select(
        $"*",
        col("ColName").getField("key").as("keys"),
        col("ColName").getField("val").as("values")
      )
      .drop("ColName")
      .select(
        $"*",
        map_from_arrays($"keys", $"values").as("ColName")
      )

    val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
    mappedDF.select(col("*")  : keyCols:_*).drop("CalName")

This solution work faster than pivot. But I'm not sure that it's the best solution.

BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.

I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.