Home > Blockchain >  How to perform conditional aggregation AFTER groupby in python?
How to perform conditional aggregation AFTER groupby in python?

Time:01-23

My 1st column is 'year'. My 2nd column is 'bank'. My last column is a "value". I want to get the SUM of VALUE for year=2003 for each 'bank' and display that in a new column. ie if my starting dataframe can be represented by the following code:

df = pd.DataFrame({'year' : [2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003],
               'bank' : ['sbi', 'sbi', 'sbi', 'sbi', 'sbi', 'sbi', 'icici', 'icici', 'icici', 'icici', 'icici', 'icici'],
                   'amt' : [1000, 2000, 3000, 4000, 5000, 6000,1,2,3,4,5,6]
                  })

Then the final output can be represented by the following code:

df = pd.DataFrame({'year' : [2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003, 2001, 2002, 2003],
                   'bank' : ['sbi', 'sbi', 'sbi', 'sbi', 'sbi', 'sbi', 'icici', 'icici', 'icici', 'icici', 'icici', 'icici'],
                       'amt' : [1000, 2000, 3000, 4000, 5000, 6000,1,2,3,4,5,6],
                       'amt_2003': [9000, 9000, 9000, 9000, 9000, 9000, 9, 9, 9, 9, 9, 9]
                      })

ie - For 'sbi', the total 'value' in 'year'=2003 is 3000 6000=9000, which is displayed against all rows for 'sbi'. Similarly, I get 9 against all rows of 'icici'.

I am unable to use the conditional sum after using a groupby('bank') statement.

CodePudding user response:

You might want to perform a groupby.sum, then unstack and merge:

df2 = df.groupby(['year', 'bank']).sum().unstack('year')
df2.columns = df2.columns.map(lambda x: f'{x[0]}_{x[1]}')

out = df.merge(df2.reset_index())

Output:

    year   bank   amt  amt_2001  amt_2002  amt_2003
0   2001    sbi  1000      5000      7000      9000
1   2002    sbi  2000      5000      7000      9000
2   2003    sbi  3000      5000      7000      9000
3   2001    sbi  4000      5000      7000      9000
4   2002    sbi  5000      5000      7000      9000
5   2003    sbi  6000      5000      7000      9000
6   2001  icici     1         5         7         9
7   2002  icici     2         5         7         9
8   2003  icici     3         5         7         9
9   2001  icici     4         5         7         9
10  2002  icici     5         5         7         9
11  2003  icici     6         5         7         9

NB. this works for more than just "amt" if you have several input columns.

Intermediate df2:

       amt_2001  amt_2002  amt_2003
bank                               
icici         5         7         9
sbi        5000      7000      9000

To limit the years to keep, slice with loc:

keep = [2003]

df2 = df.groupby(['year', 'bank']).sum().loc[keep].unstack('year')
df2.columns = df2.columns.map(lambda x: f'{x[0]}_{x[1]}')

out = df.merge(df2.reset_index())

Output:

    year   bank   amt  amt_2003
0   2001    sbi  1000      9000
1   2002    sbi  2000      9000
2   2003    sbi  3000      9000
3   2001    sbi  4000      9000
4   2002    sbi  5000      9000
5   2003    sbi  6000      9000
6   2001  icici     1         9
7   2002  icici     2         9
8   2003  icici     3         9
9   2001  icici     4         9
10  2002  icici     5         9
11  2003  icici     6         9

CodePudding user response:

Use Series.map by filtered rows for 2003 with aggregate sum for column amt:

df['amt_2003'] = df['bank'].map(df[df['year'].eq(2003)].groupby('bank')['amt'].sum())

Or use Series.where for repalce non 2003 values to 0 and for new column use GroupBy.transform:

df['amt_2003'] = df['amt'].where(df['year'].eq(2003),0).groupby(df['bank']).transform('sum')
print (df)
    year   bank   amt  amt_2003
0   2001    sbi  1000      9000
1   2002    sbi  2000      9000
2   2003    sbi  3000      9000
3   2001    sbi  4000      9000
4   2002    sbi  5000      9000
5   2003    sbi  6000      9000
6   2001  icici     1         9
7   2002  icici     2         9
8   2003  icici     3         9
9   2001  icici     4         9
10  2002  icici     5         9
11  2003  icici     6         9

Fora all columns use DataFrame.pivot_table with DataFrame.add_prefix and DataFrame.join:

df1 = (df.join(df.pivot_table(index='bank', columns='year',values='amt', aggfunc='sum')
                 .add_prefix('amt_'), on='bank'))
print (df1)
    year   bank   amt  amt_2001  amt_2002  amt_2003
0   2001    sbi  1000      5000      7000      9000
1   2002    sbi  2000      5000      7000      9000
2   2003    sbi  3000      5000      7000      9000
3   2001    sbi  4000      5000      7000      9000
4   2002    sbi  5000      5000      7000      9000
5   2003    sbi  6000      5000      7000      9000
6   2001  icici     1         5         7         9
7   2002  icici     2         5         7         9
8   2003  icici     3         5         7         9
9   2001  icici     4         5         7         9
10  2002  icici     5         5         7         9
11  2003  icici     6         5         7         9
  • Related