I have the following Pyspark DF
col_a col_b col_c col_d col_f
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_1
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_2
-------------------------------------
val_5 |val_6 |val_7 |val_8 | integer_3
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_4
-------------------------------------
The dataframe I'm trying to generate is:
col_a col_b col_c col_d col_f
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_1 integer_2 integer_4
-------------------------------------
val_5 |val_6 |val_7 |val_8 | integer_3
-------------------------------------
The goal is to sum up all col_f values if the col_a, col_b, col_c, and col_c are equal, but also to keep other rows that are unique.
How can this be achieved? Thank you!
CodePudding user response:
Using this sample DF
:
>>> df
----- ----- ----- ----- -----
|col_a|col_b|col_c|col_d|col_f|
----- ----- ----- ----- -----
| a| b| c| d| 1|
| a| b| c| d| 2|
| j| h| k| l| 3|
| a| b| c| d| 4|
----- ----- ----- ----- -----
You can groupBy.agg
, in a very similar fashion as you would do in pandas
:
g = ['col_a','col_b','col_c','col_d']
df.groupBy(g).agg({'col_f':'sum'}).withColumnRenamed('sum(col_f)','total_col_f').show()
prints:
----- ----- ----- ----- -----------
|col_a|col_b|col_c|col_d|total_col_f|
----- ----- ----- ----- -----------
| a| b| c| d| 7|
| j| h| k| l| 3|
----- ----- ----- ----- -----------
CodePudding user response:
You should check out pandas.groupby.
In your case it would look something like (pandas):
df.groupby([col_a,col_b,_col_c,col_d]).sum()
Or with PySpark:
df.groupBy(col_a,col_b,_col_c,col_d).sum(col_f)