How to change schema in PySpark from this
|-- id: string (nullable = true)
|-- device: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- device_vendor: string (nullable = true)
| | |-- device_name: string (nullable = true)
| | |-- device_manufacturer: string (nullable = true)
to this
|-- id: string (nullable = true)
|-- device_vendor: string (nullable = true)
|-- device_name: string (nullable = true)
|-- device_manufacturer: string (nullable = true)
CodePudding user response:
First, take the first array's element using element_at
, then extract all elements from struct using *
.
df = df.withColumn('d', F.element_at('device', 1))
df = df.select('id', 'd.*')
CodePudding user response:
Use a combination of explode
and the *
selector:
import pyspark.sql.functions as F
df_flat = df.withColumn('device_exploded', F.explode('device'))
.select('id', 'device_exploded.*')
df_flat.printSchema()
# root
# |-- id: string (nullable = true)
# |-- device_vendor: string (nullable = true)
# |-- device_name: string (nullable = true)
# |-- device_manufacturer: string (nullable = true)
explode
creates a separate record for each element of the array-valued column, repeating the value(s) of the other column(s). The column.*
selector turns all fields of the struct-valued column into separate columns.