Home > Back-end >  How to merge multiple rows removing duplicates and concatenate other column values
How to merge multiple rows removing duplicates and concatenate other column values

Time:05-25

I have the following dataframe containing 4 columns (A,B,C,D):

df =

A B C D
123 001 ABC DEF
123 001 ABC DEG
256 002 XXY DSA
256 002 XXX DSA

From the above dataframe I would like to group by column A & B and aggregate and concatenate by semicolon ';' by column C & D while making sure that repetitive values are not being concatenated. Below is the expected result.

df_agg =

A B C D
123 001 ABC DEF;DEG;
256 002 XXY;XXX; DSA

I've currently have the groupby function below, but it does not seem to remove the repetitive strings and rather concatenates the entire string.

df_agg = df.groupby(groupby_list).agg( *[array_join(collect_list(column), ";").alias(column) for column in df.columns if column not in groupby_list] )

In addition, how do I make sure to add ';' after the end of the loop?

CodePudding user response:

You need to aggregate using collect_set - it removes duplicates. array_join adds delimiter between elements. concat concatenates the result with either ; or empty string at the end.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(123, '001', 'ABC', 'DEF'),
     (123, '001', 'ABC', 'DEG'),
     (256, '002', 'XXY', 'DSA'),
     (256, '002', 'XXX', 'DSA')],
    ['A', 'B', 'C', 'D']
)

df_agg = df.groupBy('A', 'B').agg(
    *[F.concat(F.array_join(F.collect_set(c), ';'),
             F.when(F.size(F.collect_set(c)) > 1, ';').otherwise('')).alias(c)
     for c in {'C', 'D'}]
)
df_agg.show()
#  --- --- -------- -------- 
# |  A|  B|       C|       D|
#  --- --- -------- -------- 
# |123|001|     ABC|DEF;DEG;|
# |256|002|XXX;XXY;|     DSA|
#  --- --- -------- -------- 

CodePudding user response:

groupby and collect_set, set only takes unique values. Use array_join function to concatenate the array elements with a desired seperator. Code below

df.groupby('A','B').agg(*[array_join(collect_set(x),';').alias(x) for x in df.drop('A','B').columns]).show(truncate=0)

 --- --- ------- ------- 
|A  |B  |C      |D      |
 --- --- ------- ------- 
|123|001|ABC    |DEF;DEG|
|256|002|XXX;XXY|DSA    |
 --- --- ------- ------- 
  • Related