Home > Mobile >  what's the easiest way to explode/flatten deeply nested struct using pyspark?
what's the easiest way to explode/flatten deeply nested struct using pyspark?


i have an example dataset:

 --- ------------------------------ 
|id |example_field                  |
 --- ------------------------------ 
|1  |{[{[{111, AAA}, {222, BBB}]}]}|
 --- ------------------------------ 

The data type of the two fields are:

[('id', 'int'),

My question is if there's a way/function to flatten the field example_field using pyspark?

my expected output is something like this:

id  field_1 field_2
1   111     AAA
1   222     BBB

CodePudding user response:

The following code should do the trick:

from pyspark.sql import functions as F

    .withColumn('_temp_ef', F.explode('example_field.xxx'))
    .withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))

The function explode creates a row for each element in an array, while select turns the fields of nested_field structure into columns.

The result is:

 --- ------- ------- 
|id |field_1|field_2|
 --- ------- ------- 
|1  |111    |AAA    |
|1  |222    |BBB    |
 --- ------- ------- 

Note: I assumed that your DataFrame is something like this:

 |-- id: integer (nullable = true)
 |-- example_field: struct (nullable = true)
 |    |-- xxx: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- nested_field: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- field_1: integer (nullable = true)
 |    |    |    |    |    |-- field_2: string (nullable = true)

  • Related