spark/scala: How to merge values of a single key into a column-CodePudding

I have a dataset like below

 ------- -------- ---------- ---- ----------  
|cytid  |itemcode|percentile|mean|DTXN      |
 --------------- ----------- --- ---------- 
|800    |   7   |        x4| y# |2/1/2017  |
|700    |   13  |        y#| oz |3/1/2017  |
|800    |   24  |        oz| oz |8/1/2017  |
|900    |   747 |        oz| IH |27/9/2017 |
|700    |   798 |        IH| yc |19/10/2017| 
|800    |   841 |        yc| g5 |8/11/2017 |
|800    |   867 |        g5| IH |19/11/2017| 
|900    |   911 |        IH| x4 |2/12/2017 |
|800    |   945 |        x4| ko |13/12/2017| 
|800    |   984 |        ko| cd |27/12/2017| 
 ------- ------- ---------- ---- ----------

I need to produce an output like below

 ------- -------------------------------------- 
|cytid  |             ipm                      |
 ---------------------------------------------- 
|800    |   x4,y#,oz,oz,yc,g5,g5,IH,x4,ko,ko,cd|
|700    |   y#,oz,IH,yc                        |
|900    |   oz,IH,IH,x4                        |
 ------- --------------------------------------

To be precise i need to merge the columns itemcode, percentile, mean. So how to do this in scala

CodePudding user response：

Given your input called data as below:

 ----- -------- ---------- ---- 
|cytid|itemcode|percentile|mean|
 ----- -------- ---------- ---- 
|800  |7       |x4        |y#  |
|700  |13      |y#        |oz  |
|800  |24      |oz        |oz  |
|900  |747     |oz        |IH  |
|700  |798     |IH        |yc  |
|800  |841     |yc        |g5  |
|800  |867     |g5        |IH  |
|900  |911     |IH        |x4  |
|800  |945     |x4        |ko  |
|800  |984     |ko        |cd  |
 ----- -------- ---------- ----

You can achieve this output (a bit different from yours, because you did not take into account item_code) through:

data = data.groupBy("cytid")
  .agg(collect_set(array("itemcode", "percentile", "mean")).as("ipm")
  )
  .withColumn("ipm", concat_ws(",", flatten(col("ipm"))))

Good luck!