I'm trying to chain a join
and groupby
operation together. The inputs and operations I want to do look like below. I want to groupby
all the columns except the one used in agg
. Is there a way of doing this without listing out all the column names like groupby("colA","colB")
? I tried groupby(df1.*)
but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.
Input1:
colA | ColB
--------------
A | 100
B | 200
Input2:
colAA | ColBB
--------------
A | Group1
B | Group2
A | Group2
df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
#Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns.
Output:
colA | ColB | collect_set
--------------
A | 100 | (Group1,Group2)
B | 200 | (Group2)
CodePudding user response:
Just simple:
.groupby(df1.columns)
CodePudding user response:
Based on your clarifying comments, use df1.columns
df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
---- ---- ----------------
|colA|ColB| new|
---- ---- ----------------
| A| 100|[Group2, Group1]|
| B| 200| [Group2]|
---- ---- ----------------