Home > database >  Splitting Dataframe column containing structs into new columns
Splitting Dataframe column containing structs into new columns

Time:07-29

I have a Dataframe named df with following structure:

root
|-- country: string (nullable = true)
|-- competition: string (nullable = true)
|-- competitor: array (nullable = true)
    |-- element: struct (containsNull = true)
    |   |-- name: string (nullable = true)
        |-- time: string (nullable = true)

Which looks like this:

|country|competiton|competitor      |
|___________________________________| 
|USA    |WN        |[{Adam, 9.43}]  |
|China  |FN        |[{John, 9.56}]  |
|China  |FN        |[{Adam, 9.48}]  |
|USA    |MNU       |[{Phil, 10.02}] |
|...    |...       |...             |

I want to pivot (or something similar) the competitor Column into new columns depending on the values in each struct so it looks like this:

|country|competition|Adam|John|Phil  |
|____________________________________| 
|USA    |WN         |9.43|... |...   |
|China  |FN         |9.48|9.56|...   |
|USA    |MNU        |... |... |10.02 |              

The names are unique so if column already exists i dont want to create a new but fill the value in the already created one. There are alot of names so it needs to be done dynamically.

I have a pretty big dataset so i cant use Pandas.

CodePudding user response:

We can extract the struct from the array and then create new columns from that struct. The name column can be pivoted with the time as values.

data_sdf. \
    withColumn('name_time_struct', func.col('competitor')[0]). \
    select('country', 'competition', func.col('name_time_struct.*')). \
    groupBy('country', 'competition'). \
    pivot('name'). \
    agg(func.first('time')). \
    show()

 ------- ----------- ---- ---- ----- 
|country|competition|Adam|John| Phil|
 ------- ----------- ---- ---- ----- 
|    USA|        MNU|null|null|10.02|
|    USA|         WN|9.43|null| null|
|  China|         FN|9.48|9.56| null|
 ------- ----------- ---- ---- ----- 

P.S. this assumes there's just 1 struct in the array.

  • Related