How to transfrom a few columns of a dataframe based on a metching field to an array-CodePudding

I have a dataframe that looks like this

 --- ---- ------ ------- ------ 
| Id|fomrid|values|occ| comments
 --- ---- ------ ------- ------ 
|  1| x1  |   22.0|   1|  text1|
|  1| x1  |   test|   2| text2 |
|  1| x1  |     11|   3| text3 |
|  1| x2  |    21 |  0 | text4 |
|  2| p1  |     1 |   1| text5 |
 --- ---- ------ ------- ------

How can I transform it to the below dataframe? Essentially, I want to create a list of values and occ based on the formId.

 --- ------ -------------- -------- ------ 
| Id|fomrid|List_values   |List_occ| comments
 --- ------ -------------- -------- ------ 
|  1| x1  |[22.0, test,11]|[1,2,3]|  text1|
|  1| x2  | [21]          | [0]   | text4 |
|  2| p1  | [1]           | [1]   | text5 |
 --- ----- --------------- ------- -------

CodePudding user response：

You may use collect_list to achieve this.

Using spark sql

Creating a temporary view and running this on your spark session

input_df.createOrReplaceTempView("my_temp_table_or_view")
output_df = sparkSession.sql("<insert sql below here>")

SELECT
    Id,
    fomrid,
    collect_list(values) as List_values,
    collect_list(occ) as List_occ,
    MIN(comments) as comments
FROM
   my_temp_table_or_view
GROUP BY
   Id, formrid

Using the pyspark api

from pyspark.sql import functions as F

output_df = (
    input_df.groupBy(["Id","fomrid"])
            .agg(
                F.collect_list("values").alias("List_values"),
                F.collect_list("occ").alias("List_occ"),
                F.min("comments").alias("comments")               
            )
)

Using scala

val output_df = input_df.groupBy("Id","fomrid")
                        .agg(
                            collect_list("values").alias("List_values"),
                            collect_list("occ").alias("List_occ"),
                            min("comments").alias("comments") 
                        )

Let me know if this works for you.