Home > Net >  Scala spark: extract columns from a schema
Scala spark: extract columns from a schema

Time:10-24

I have a schema that looks following:

 |-- contributors: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: array (valueContainsNull = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- id: string (nullable = true)

I would like to have a dataframe that have the columns key, name and id

I have used the following code to get name and id but how do I get the column key?

df.select(explode(col("contributors")))
  .select(explode(col("value")))
  .select(col("col.*"))

CodePudding user response:

You can add key column in your second and third select. select method of dataframe accepts several columns as argument.

You should modify your code as follows:

import org.apache.spark.sql.functions.{col, explode}

df.select(explode(col("contributors")))
  .select(col("key"), explode(col("value")))
  .select(col("key"), col("col.*"))

With the following contributors input column:

 -------------------------------------------------------------------------------------------- 
|contributors                                                                                |
 -------------------------------------------------------------------------------------------- 
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []}                      |
 -------------------------------------------------------------------------------------------- 

You get the following ouput:

 ---- ------ ------ ---- 
|key |type  |name  |id  |
 ---- ------ ------ ---- 
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
 ---- ------ ------ ---- 

if you want to keep only name and id columns from value, you should also modify the last select to select only col.id and col.name columns:

import org.apache.spark.sql.functions.{col, explode}

df.select(explode(col("contributors")))
  .select(col("key"), explode(col("value")))
  .select(col("key"), col("col.name"), col("col.id"))

With the same contributors column input, you get your expected ouput:

 ---- ------ ---- 
|key |name  |id  |
 ---- ------ ---- 
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
 ---- ------ ---- 
  • Related