Home > Mobile >  Pandas Groupby Based on Values in Multiple Columns
Pandas Groupby Based on Values in Multiple Columns

Time:09-28

I have a dataframe that I am trying to use pandas.groupby on to get the cumulative sum. The values that I am grouping by show up in two different columns, and I am having trouble getting the groupby to work correctly. My starting dataframe is:

df = pd.DataFrame({'col_A': ['red', 'red', 'blue', 'red'], 'col_B': ['blue', 'red', 'blue', 'red'], 'col_A_qty': [1, 1, 1, 1], 'col_B_qty': [1, 1, 1, 1]})

col_A   col_B   col_A_qty   col_B_qty
red      blue      1           1
red      red       1           1
blue    blue       1           1
red      red       1           1

The result I am trying to get is:

col_A   col_B   col_A_qty   col_B_qty   red_cumsum  blue_cumsum
red     blue       1            1           1           1
red     red        1            1           3           1
blue    blue       1            1           3           3
red     red        1            1           5           3

I've tried:

df.groupby(['col_A', 'col_B'])['col_A_qty'].cumsum()

but this groups on the combination of col_A and col_B. How can I use pandas.groupby to calculate the cumulative sum of red and blue, regardless of if it's in col_A or col_B?

CodePudding user response:

Try two pivot

out = pd.pivot(df,columns='col_A',values='col_A_qty').fillna(0).cumsum().add(pd.pivot(df,columns='col_B',values='col_B_qty').fillna(0).cumsum(),fill_value=0)
Out[404]: 
col_A  blue  red
0       1.0  1.0
1       1.0  3.0
2       3.0  3.0
3       3.0  5.0
df = df.join(out)

CodePudding user response:

A simple method is to define each cumsum column by two Series.cumsum, as follows:

df['red_cumsum'] = df['col_A'].eq('red').cumsum()   df['col_B'].eq('red').cumsum()
df['blue_cumsum'] = df['col_A'].eq('blue').cumsum()   df['col_B'].eq('blue').cumsum()

In each column col_A and col_B, check for values equal 'red' / 'blue' (results are boolean series). Then, we use Series.cumsum on these resultant boolean series to get the cumulative counts. You don't really need to use pandas.groupby in this use case.

If you have multiple items in col_A and col_B, you can also iterate through the unique item list, as follows:

for item in pd.unique(df['col_A'].tolist()   df['col_B'].tolist()):
    df[f'{item}_cumsum'] = df['col_A'].eq(item).cumsum()   df['col_B'].eq(item).cumsum()

Result:

print(df)

  col_A col_B  col_A_qty  col_B_qty  red_cumsum  blue_cumsum
0   red  blue          1          1           1            1
1   red   red          1          1           3            1
2  blue  blue          1          1           3            3
3   red   red          1          1           5            3
  • Related