Home > Net >  Pyspark - Expand column with struct of arrays into new columns
Pyspark - Expand column with struct of arrays into new columns

Time:01-27

I have a DataFrame with a single column which is a struct type and contains an array.

users_tp_df.printSchema()
root
 |-- x: struct (nullable = true)
 |    |-- ActiveDirectoryName: string (nullable = true)
 |    |-- AvailableFrom: string (nullable = true)
 |    |-- AvailableFutureAllocation: long (nullable = true)
 |    |-- AvailableFutureHours: double (nullable = true)
 |    |-- CreateDate: string (nullable = true)
 |    |-- CurrentAllocation: long (nullable = true)
 |    |-- CurrentAvailableHours: double (nullable = true)
 |    |-- CustomFields: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)

I'm trying to convert the CustomFields array column in 3 three columns:

  1. Country;
  2. isExternal;
  3. Service.

So for example, I've these values:

enter image description here

and the final dataframe output excepted for that row will be:

enter image description here

Can anyone please help me in achieving this?

Thank you!

CodePudding user response:

This would work:

initial_expansion= df.withColumn("id", F.monotonically_increasing_id()).select("id","x.*");

final_df = initial_expansion\
     .join(initial_expansion.withColumn("CustomFields", F.explode("CustomFields"))\
           .select("*", "CustomFields.*")\
           .groupBy("id").pivot("Name").agg(F.first("Value")), \
        "id").drop("CustomFields")

Sample Input: Json - {'x': {'CurrentAvailableHours': 2, 'CustomFields': [{'Name': 'Country', 'Value': 'Italy'}, {'Name': 'Service', 'Value':'Dev'}]}}

Input

Input Structure:

root
 |-- x: struct (nullable = true)
 |    |-- CurrentAvailableHours: integer (nullable = true)
 |    |-- CustomFields: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Name: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)

Output: Output

Output Structure (Id can be dropped):

root
 |-- id: long (nullable = false)
 |-- CurrentAvailableHours: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Service: string (nullable = true)

CodePudding user response:

Considering the mockup structure below, similar with the one from your example, you can do it the sql way by using the inline function:

with alpha as (
select named_struct("alpha", "abc", "beta", 2.5, "gamma", 3, "delta"
                    , array(  named_struct("a", "x", "b", "y", "c", "z")
                            , named_struct("a", "xx", "b", "yy", "c","zz"))
                   ) root
)
select root.alpha, root.beta, root.gamma, inline(root.delta) as (a, b, c) 
from alpha

The result: enter image description here

Mockup structure:

mockup structure

  • Related