I have the following dataframe containing 4 columns (A,B,C,D):
df =
A | B | C | D |
---|---|---|---|
123 | 001 | ABC | DEF |
123 | 001 | ABC | DEG |
256 | 002 | XXY | DSA |
256 | 002 | XXX | DSA |
From the above dataframe I would like to group by column A & B and aggregate and concatenate by semicolon ';' by column C & D while making sure that repetitive values are not being concatenated. Below is the expected result.
df_agg =
A | B | C | D |
---|---|---|---|
123 | 001 | ABC | DEF;DEG; |
256 | 002 | XXY;XXX; | DSA |
I've currently have the groupby function below, but it does not seem to remove the repetitive strings and rather concatenates the entire string.
df_agg = df.groupby(groupby_list).agg( *[array_join(collect_list(column), ";").alias(column) for column in df.columns if column not in groupby_list] )
In addition, how do I make sure to add ';' after the end of the loop?
CodePudding user response:
You need to aggregate using collect_set
- it removes duplicates. array_join
adds delimiter between elements. concat
concatenates the result with either ;
or empty string at the end.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(123, '001', 'ABC', 'DEF'),
(123, '001', 'ABC', 'DEG'),
(256, '002', 'XXY', 'DSA'),
(256, '002', 'XXX', 'DSA')],
['A', 'B', 'C', 'D']
)
df_agg = df.groupBy('A', 'B').agg(
*[F.concat(F.array_join(F.collect_set(c), ';'),
F.when(F.size(F.collect_set(c)) > 1, ';').otherwise('')).alias(c)
for c in {'C', 'D'}]
)
df_agg.show()
# --- --- -------- --------
# | A| B| C| D|
# --- --- -------- --------
# |123|001| ABC|DEF;DEG;|
# |256|002|XXX;XXY;| DSA|
# --- --- -------- --------
CodePudding user response:
groupby
and collect_set
, set only takes unique values. Use array_join
function to concatenate the array elements with a desired seperator. Code below
df.groupby('A','B').agg(*[array_join(collect_set(x),';').alias(x) for x in df.drop('A','B').columns]).show(truncate=0)
--- --- ------- -------
|A |B |C |D |
--- --- ------- -------
|123|001|ABC |DEF;DEG|
|256|002|XXX;XXY|DSA |
--- --- ------- -------