Home > other >  How to split a list of objects to separate columns in pyspark dataframe
How to split a list of objects to separate columns in pyspark dataframe

Time:12-12

I have a column in dataframe as list of objects(Array of structs) like

column: [{key1:value1}, {key2:value2}, {key3:value3}]

I want to split this column into separate columns with key name as column name and value as column value in same row.
Final result like

key1:value1, key2:value2, key3:value3

How to achieve this in pyspark ?

E.g.

Sample data to create dataframe:

my_new_schema = StructType([
    StructField('id', LongType()),
    StructField('countries', ArrayType(StructType([
        StructField('name', StringType()),
        StructField('capital', StringType())
    ])))
])
l = [(1, [
        {'name': 'Italy', 'capital': 'Rome'},
        {'name': 'Spain', 'capital': 'Madrid'}
    ])
]
    
dz = spark.createDataFrame(l, schema=my_new_schema)
# we have array of structs:
dz.show(truncate=False)
 --- -------------------------------- 
|id |countries                       |
 --- -------------------------------- 
|1  |[{Italy, Rome}, {Spain, Madrid}]|
 --- -------------------------------- 

Expected output:

 --- -------- --------- 
|id |Italy   |  Spain  |
 --- ------------------ 
|1  |Rome    | Madrid  |
 --- -------- --------- 

CodePudding user response:

inline the countries array then pivot the country name column:

import pyspark.sql.functions as F

dz1 = dz.selectExpr(
    "id", 
    "inline(countries)"
).groupBy("id").pivot("name").agg(
    F.first("capital")
)

dz1.show()
# --- ----- ------ 
#|id |Italy|Spain |
# --- ----- ------ 
#|1  |Rome |Madrid|
# --- ----- ------ 
  • Related