I have a dataset like below
------- -------- ---------- ---- ----------
|cytid |itemcode|percentile|mean|DTXN |
--------------- ----------- --- ----------
|800 | 7 | x4| y# |2/1/2017 |
|700 | 13 | y#| oz |3/1/2017 |
|800 | 24 | oz| oz |8/1/2017 |
|900 | 747 | oz| IH |27/9/2017 |
|700 | 798 | IH| yc |19/10/2017|
|800 | 841 | yc| g5 |8/11/2017 |
|800 | 867 | g5| IH |19/11/2017|
|900 | 911 | IH| x4 |2/12/2017 |
|800 | 945 | x4| ko |13/12/2017|
|800 | 984 | ko| cd |27/12/2017|
------- ------- ---------- ---- ----------
I need to produce an output like below
------- --------------------------------------
|cytid | ipm |
----------------------------------------------
|800 | x4,y#,oz,oz,yc,g5,g5,IH,x4,ko,ko,cd|
|700 | y#,oz,IH,yc |
|900 | oz,IH,IH,x4 |
------- --------------------------------------
To be precise i need to merge the columns itemcode, percentile, mean. So how to do this in scala
CodePudding user response:
Given your input called data
as below:
----- -------- ---------- ----
|cytid|itemcode|percentile|mean|
----- -------- ---------- ----
|800 |7 |x4 |y# |
|700 |13 |y# |oz |
|800 |24 |oz |oz |
|900 |747 |oz |IH |
|700 |798 |IH |yc |
|800 |841 |yc |g5 |
|800 |867 |g5 |IH |
|900 |911 |IH |x4 |
|800 |945 |x4 |ko |
|800 |984 |ko |cd |
----- -------- ---------- ----
You can achieve this output (a bit different from yours, because you did not take into account item_code
) through:
data = data.groupBy("cytid")
.agg(collect_set(array("itemcode", "percentile", "mean")).as("ipm")
)
.withColumn("ipm", concat_ws(",", flatten(col("ipm"))))
Good luck!