Pypsark: Convert two array type columns to a single column of type ArrayType(Struct())-CodePudding

Here is sample data:

id sub_id       score      
A  [1, 4]   [0.9, 0.2]
B  [5, 7]   [0.1, 0.5]

I'd like my resulting column to look like this:

id sub_id       score        result
A  [1, 4]   [0.9, 0.2]   [Struct{id = 1, score = 0.9}, Struct{id = 4 , score = 0.2}]
B  [5, 7]   [0.1, 0.5]   [Struct{id = 5, score = 0.1}, Struct{id = 7 , score = 0.5}]

The only way I know how to do this is to:

Explode both columns
Create a struct of both exploded columns
Group by id to create the result column.

I'm wondering if there is a more efficient way to do this.

CodePudding user response：

The arrays_zip function, zips two array columns and creates a array struct.


from pyspark.sql import functions as F

data = [("A", [1, 4], [0.9, 0.2],),
        ("B", [5, 7], [0.1, 0.5],), ]

df = spark.createDataFrame(data, ("id", "sub_id", "score", ))

result = df.withColumn("result", F.arrays_zip(F.col("sub_id").alias("id"), F.col("score")))

result.printSchema()

result.show()

Output

root
 |-- id: string (nullable = true)
 |-- sub_id: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- score: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- result: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id: long (nullable = true)
 |    |    |-- score: double (nullable = true)



 --- ------ ---------- -------------------- 
| id|sub_id|     score|              result|
 --- ------ ---------- -------------------- 
|  A|[1, 4]|[0.9, 0.2]|[{1, 0.9}, {4, 0.2}]|
|  B|[5, 7]|[0.1, 0.5]|[{5, 0.1}, {7, 0.5}]|
 --- ------ ---------- --------------------