Home > Net >  Group By Sum Multiple Columns in Pandas (Ignoring duplicates)
Group By Sum Multiple Columns in Pandas (Ignoring duplicates)

Time:10-08

I have the following code where my dataframe contains 3 columns

  toBeSummed toBeSummed2 toBesummed3  someColumn
0          X           X           Y         NaN
1          X           Y           Z         NaN
2          Y           Y           Z         NaN
3          Z           Z           Z         NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()


temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0]  = "SameName"

final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"

The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in

  Letter  Sum
0      X    3
1      Y    4
2      Z    5

However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X) Meaning the desired output is

  Letter  Sum
0      X    2
1      Y    3
2      Z    3

I can update or add more comments if this is confusing.

CodePudding user response:

Given df:

  toBeSummed toBeSummed2 toBesummed3  someColumn
0          X           X           Y         NaN
1          X           Y           Z         NaN
2          Y           Y           Z         NaN
3          Z           Z           Z         NaN

Doing:

sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']

out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))

Output:

   Sum
Y    3
Z    3
X    2
  • Related