how I can pair the elements of each row with respect of a group?
id title comp
1 'A' 45
1 'B' 32
1 'C' 1
2 'D' 5
2 'F' 6
out put:
I wanna pair row if they have the same 'id'
output:
id title comp
1 'A','B' 45,32
1 'B','C' 32,1
2 'D','F' 5,6
CodePudding user response:
Use window function. Collect list of the immediate consecutive elements in the target columns. Remove the array brackets by converting resultant arrays into strings using array_join. Last row will have less elements. filter out where the list has less more than 0ne elements.
from pyspark.sql.functions import *
from pyspark.sql import Window
win=Window.partitionBy().orderBy(F.asc('title')).rowsBetween(0,1)
df.select("id", *[F.array_join(F.collect_list(c).over(win),',').alias(c) for c in df.drop('id').columns]).filter(length(col('comp'))>1).show()
--- ----- -----
| id|title| comp|
--- ----- -----
| 1| A,B|45,32|
| 1| B,C| 32,1|
| 1| C,D| 1,5|
| 2| D,F| 5,6|
--- ----- -----