Here is sample data:
id sub_id score
A [1, 4] [0.9, 0.2]
B [5, 7] [0.1, 0.5]
I'd like my resulting column to look like this:
id sub_id score result
A [1, 4] [0.9, 0.2] [Struct{id = 1, score = 0.9}, Struct{id = 4 , score = 0.2}]
B [5, 7] [0.1, 0.5] [Struct{id = 5, score = 0.1}, Struct{id = 7 , score = 0.5}]
The only way I know how to do this is to:
- Explode both columns
- Create a struct of both exploded columns
- Group by
id
to create theresult
column.
I'm wondering if there is a more efficient way to do this.
CodePudding user response:
The arrays_zip
function, zips two array columns and creates a array struct.
from pyspark.sql import functions as F
data = [("A", [1, 4], [0.9, 0.2],),
("B", [5, 7], [0.1, 0.5],), ]
df = spark.createDataFrame(data, ("id", "sub_id", "score", ))
result = df.withColumn("result", F.arrays_zip(F.col("sub_id").alias("id"), F.col("score")))
result.printSchema()
result.show()
Output
root
|-- id: string (nullable = true)
|-- sub_id: array (nullable = true)
| |-- element: long (containsNull = true)
|-- score: array (nullable = true)
| |-- element: double (containsNull = true)
|-- result: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- id: long (nullable = true)
| | |-- score: double (nullable = true)
--- ------ ---------- --------------------
| id|sub_id| score| result|
--- ------ ---------- --------------------
| A|[1, 4]|[0.9, 0.2]|[{1, 0.9}, {4, 0.2}]|
| B|[5, 7]|[0.1, 0.5]|[{5, 0.1}, {7, 0.5}]|
--- ------ ---------- --------------------