I have a column that contains array of objects as a value.
Objects have the following structure:
[
{
"key": "param1",
"val": "value1"
},
{
"key": "param2",
"val": "value2"
},
{
"key": "param3",
"val": "value3"
}
]
someColumn | colName |
---|---|
text | [{key: "param1", val: "value1"}, {key: "param2", val: "value2"}, {key: "param3", val: "value3"}] |
When I do:
df.withColumn("exploded", explode(col("colName")))
I get
someColumn | exploded |
---|---|
text | {key: "param1", val: "value1"} |
text | {key: "param2", val: "value2"} |
text | {key: "param3", val: "value3"} |
Then I do next:
df.select("*", "exploded.*").drop("exploded")
I get this:
someColumn | key | value |
---|---|---|
text | param1 | value1 |
text | param2 | value2 |
text | param3 | value3 |
I understand why I get such result but I need to get other structure.
I want to get next result:
someColumn | param1 | param2 | param3 |
---|---|---|---|
text | value1 | value2 | value3 |
Maybe do I have to transform array of Object[key, value] to Map and then to transform Map to Columns? What is the sequence of transformations I have to do?
CodePudding user response:
Once you explode your dataset, you can:
df = df.groupBy("someColumn").pivot("exploded.key").agg(first("exploded.val"))
This is created from the above statement:
---------- ------ ------ ------
|someColumn|param1|param2|param3|
---------- ------ ------ ------
|text |value1|value2|value3|
---------- ------ ------ ------
which is what you like!
CodePudding user response:
I found one more solution:
val mappedDF = df
.select(
$"*",
col("ColName").getField("key").as("keys"),
col("ColName").getField("val").as("values")
)
.drop("ColName")
.select(
$"*",
map_from_arrays($"keys", $"values").as("ColName")
)
val keysDF = mappedDF.select(explode(map_keys($"CalName"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0))
val keyCols = keys.map(f=> col("CalName").getItem(f).as(f.toString))
mappedDF.select(col("*") : keyCols:_*).drop("CalName")
This solution work faster than pivot
. But I'm not sure that it's the best solution.
BTW If we know list of keys and this list is fixed, this solution becomes more faster because we don't have to get list of keys from DF.
I wrote more universal code when we need to group by a few cols. In my post I simplified example for understanding purposes.