Home > Blockchain >  PySpark - how to select all columns to be used in groupby
PySpark - how to select all columns to be used in groupby

Time:03-26

I'm trying to chain a join and groupby operation together. The inputs and operations I want to do look like below. I want to groupby all the columns except the one used in agg. Is there a way of doing this without listing out all the column names like groupby("colA","colB")? I tried groupby(df1.*)but that didn't work. In this case I know that I'd like to group by all the columns in df1. Many thanks.

Input1:
colA |  ColB
--------------
 A   | 100
 B   | 200

Input2:
colAA |  ColBB
--------------
 A   | Group1
 B   | Group2
 A   | Group2

df1.join(df2, df1colA==df2.colAA,"left").drop("colAA").groupby("colA","colB"),agg(collect_set("colBB"))
 #Is there a way that I do not need to list ("colA","colB") in groupby? there will be many cloumns. 

Output:
 colA |  ColB | collect_set
--------------
 A   | 100    | (Group1,Group2)
 B   | 200    | (Group2)

CodePudding user response:

Just simple:

.groupby(df1.columns)

CodePudding user response:

Based on your clarifying comments, use df1.columns

 df1.join(df2, df1.colA==df2.colAA,"left").drop("colAA").groupby(df1.columns).agg(collect_set("colBB").alias('new')).show()
 ---- ---- ---------------- 
|colA|ColB|             new|
 ---- ---- ---------------- 
|   A| 100|[Group2, Group1]|
|   B| 200|        [Group2]|
 ---- ---- ---------------- 
  • Related