Home > Mobile >  what's the easiest way to explode/flatten deeply nested struct using pyspark?
what's the easiest way to explode/flatten deeply nested struct using pyspark?

Time:11-04

i have an example dataset:

 --- ------------------------------ 
|id |example_field                  |
 --- ------------------------------ 
|1  |{[{[{111, AAA}, {222, BBB}]}]}|
 --- ------------------------------ 

The data type of the two fields are:

[('id', 'int'),
 ('example_field',
  'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]

My question is if there's a way/function to flatten the field example_field using pyspark?

my expected output is something like this:

id  field_1 field_2
1   111     AAA
1   222     BBB

CodePudding user response:

The following code should do the trick:

from pyspark.sql import functions as F

(
    df
    .withColumn('_temp_ef', F.explode('example_field.xxx'))
    .withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
    .select(
        'id',
        F.col('_temp_nf.*')
    )
)

The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.

The result is:

 --- ------- ------- 
|id |field_1|field_2|
 --- ------- ------- 
|1  |111    |AAA    |
|1  |222    |BBB    |
 --- ------- ------- 

Note: I assumed that your DataFrame is something like this:

root
 |-- id: integer (nullable = true)
 |-- example_field: struct (nullable = true)
 |    |-- xxx: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- nested_field: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- field_1: integer (nullable = true)
 |    |    |    |    |    |-- field_2: string (nullable = true)

  • Related