I have a schema that looks following:
|-- contributors: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- type: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- id: string (nullable = true)
I would like to have a dataframe that have the columns key
, name
and id
I have used the following code to get name
and id
but how do I get the column key
?
df.select(explode(col("contributors")))
.select(explode(col("value")))
.select(col("col.*"))
CodePudding user response:
You can add key
column in your second and third select. select
method of dataframe accepts several columns as argument.
You should modify your code as follows:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.*"))
With the following contributors
input column:
--------------------------------------------------------------------------------------------
|contributors |
--------------------------------------------------------------------------------------------
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []} |
--------------------------------------------------------------------------------------------
You get the following ouput:
---- ------ ------ ----
|key |type |name |id |
---- ------ ------ ----
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
---- ------ ------ ----
if you want to keep only name
and id
columns from value, you should also modify the last select
to select only col.id
and col.name
columns:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.name"), col("col.id"))
With the same contributors
column input, you get your expected ouput:
---- ------ ----
|key |name |id |
---- ------ ----
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
---- ------ ----