Home > database >  Spark: How to convert array of objects with fields key-value into columns with keys as names
Spark: How to convert array of objects with fields key-value into columns with keys as names

Time:08-01

I have a column that contains array of objects as a value.

Objects have the following structure:

[
  {
    "key": "param1",
    "val": "value1"
  },
  {
    "key": "param2",
    "val": "value2"
  },
  {
    "key": "param3",
    "val": "value3"
  }
]
someColumn colName
text [{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}]

When I do:

df.withColumn("exploded", explode(col("colName")))

I get

someColumn exploded
text {key: "param1", val: "value1"}
text {key: "param2", val: "value2"}
text {key: "param3", val: "value3"}

Then I do next:

df.select("*", "exploded.*").drop("exploded")

I get this:

someColumn key value
text param1 value1
text param2 value2
text param3 value3

I understand why I get such result but I need to get other structure.

I want to get next result:

someColumn param1 param2 param3
text value1 value2 value3

Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?

CodePudding user response:

Once you explode your dataset, you can:

df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))

This is created from the above statement:

 ---------- ------ ------ ------ 
|someColumn|param1|param2|param3|
 ---------- ------ ------ ------ 
|text      |value1|value2|value3|
 ---------- ------ ------ ------ 

which is what you like!

CodePudding user response:

I found one more solution:

val mappedDF = df
      .select(
        $"*",
        col("ColName").getField("key").as("keys"),
        col("ColName").getField("val").as("values")
      )
      .drop("ColName")
      .select(
        $"*",
        map_from_arrays($"keys", $"values").as("ColName")
      )

    val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
    val keys = keysDF.collect().map(f=>f.get(0))
    val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
    mappedDF.select(col("*")  : keyCols:_*).drop("CalName")

This solution work faster than pivot. But I'm not sure that it's the best solution.

BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.

I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.

  • Related