Home > Software design >  Explode struct column which isn't array in Pyspark
Explode struct column which isn't array in Pyspark

Time:12-24

I got a struct column (from json) with this schema:

struct_col_name
 |    |-- 2053771759: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771760: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)
 |    |-- 2053771761: struct (nullable = true)
 |    |    |-- col1: long (nullable = true)
 |    |    |-- col2: string (nullable = true)
 |    |    |-- col3: long (nullable = true)
 |    |    |-- col4: string (nullable = true)

Since all of the inner strcuts have the same fields, I want to convert the schema to something like this, and add the id (2053771759 for example) as a field for each element.

By doing it I'll be able to explode the column into rows.

struct_col_name
 |    |-- element: struct: struct (nullable = true)
 |    |    |    |-- col1: long (nullable = true)
 |    |    |    |-- col2: string (nullable = true)
 |    |    |    |-- col3: long (nullable = true)
 |    |    |    |-- col4: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)

Any idea how can I do it? Or explode the column by any other way?

CodePudding user response:

First, you can get the list of IDs by using the schema of df.select("struct_col_name.*") dataframe. Then, you can loop over that list to update each struct by adding the id field to the existing fields and create an array of struct column. Finally, explode the array column to get one struct by row.

Something like this:

from pyspark.sql import functions as F

inner_fields = ["col1", "col2", "col3", "col4"]
ids = df.select("struct_col_name.*").columns

df = df.select(
    F.explode(F.array(*[
        F.struct(*[
            F.col(f"struct_col_name.{i}.{c}") for c in inner_fields
        ], F.lit(i).alias(i))
        for i in ids
    ])).alias("struct_col_name")
)
  • Related