Sum column values if the rows are identical, keep unique rows (Pyspark)-CodePudding

I have the following Pyspark DF

col_a  col_b  col_c  col_d   col_f
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_1
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_2
-------------------------------------
val_5 |val_6 |val_7 |val_8 | integer_3
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_4
-------------------------------------

The dataframe I'm trying to generate is:

col_a  col_b  col_c  col_d   col_f
-------------------------------------
val_1 |val_2 |val_3 |val_4 | integer_1   integer_2   integer_4
-------------------------------------
val_5 |val_6 |val_7 |val_8 | integer_3
-------------------------------------

The goal is to sum up all col_f values if the col_a, col_b, col_c, and col_c are equal, but also to keep other rows that are unique.

How can this be achieved? Thank you!

CodePudding user response：

Using this sample DF:

>>> df
 ----- ----- ----- ----- ----- 
|col_a|col_b|col_c|col_d|col_f|
 ----- ----- ----- ----- ----- 
|    a|    b|    c|    d|    1|
|    a|    b|    c|    d|    2|
|    j|    h|    k|    l|    3|
|    a|    b|    c|    d|    4|
 ----- ----- ----- ----- -----

You can groupBy.agg, in a very similar fashion as you would do in pandas:

g = ['col_a','col_b','col_c','col_d']
df.groupBy(g).agg({'col_f':'sum'}).withColumnRenamed('sum(col_f)','total_col_f').show()

prints:

 ----- ----- ----- ----- ----------- 
|col_a|col_b|col_c|col_d|total_col_f|
 ----- ----- ----- ----- ----------- 
|    a|    b|    c|    d|          7|
|    j|    h|    k|    l|          3|
 ----- ----- ----- ----- -----------

CodePudding user response：

You should check out pandas.groupby.

In your case it would look something like (pandas):

df.groupby([col_a,col_b,_col_c,col_d]).sum()

Or with PySpark:

df.groupBy(col_a,col_b,_col_c,col_d).sum(col_f)