i have an example dataset:
--- ------------------------------
|id |example_field |
--- ------------------------------
|1 |{[{[{111, AAA}, {222, BBB}]}]}|
--- ------------------------------
The data type of the two fields are:
[('id', 'int'),
('example_field',
'struct<xxx:array<struct<nested_field:array<struct<field_1:int,field_2:string>>>>>')]
My question is if there's a way/function to flatten the field example_field
using pyspark?
my expected output is something like this:
id field_1 field_2
1 111 AAA
1 222 BBB
CodePudding user response:
The following code should do the trick:
from pyspark.sql import functions as F
(
df
.withColumn('_temp_ef', F.explode('example_field.xxx'))
.withColumn('_temp_nf', F.explode('_temp_ef.nested_field'))
.select(
'id',
F.col('_temp_nf.*')
)
)
The function explode
creates a row for each element in an array, while select
turns the fields of nested_field
structure into columns.
The result is:
--- ------- -------
|id |field_1|field_2|
--- ------- -------
|1 |111 |AAA |
|1 |222 |BBB |
--- ------- -------
Note: I assumed that your DataFrame is something like this:
root
|-- id: integer (nullable = true)
|-- example_field: struct (nullable = true)
| |-- xxx: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- nested_field: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- field_1: integer (nullable = true)
| | | | | |-- field_2: string (nullable = true)