Home > database >  Python Pandas calculate standard deviation excluding current group, with vectorization solution
Python Pandas calculate standard deviation excluding current group, with vectorization solution

Time:11-21

So i want to calculate standard deviation excluding current group using groupby. Here an example of the data:

import pandas as pd
df = pd.DataFrame ({
                      'group' : ['A','A','A','A','A','A','B','B','B','B','B','B'],
                      'team'  : ['1','1','2','2','3','3','1','1','2','2','3','3',]
                      'value' : [1,2,5,7,2,3,7,8,8,9,6,4]
                  })

For example, for group A team 1, i want to calculate the std dev of team 2 and 3, for group A team 2, i want to calculate the std dev of group 1 and 3, and so on.

I managed to do it using groupby and apply but when using it on real data with literally milion of rows, it takes too long. So i am looking for a solution with vectorization.

def std(row, data):
   data = data.loc[data['group']==row['group]]
   return data.groupby(['team']).filter(lambda x:(x['tool]!=row['team']).all())['value'].std()
df['std_exclude'] = df.apply(lambda x: std(data=df),axis=1)

CodePudding user response:

You can use transform after combining group and team as a list:

df['std'] = (df.assign(new=df[['group', 'team']].values.tolist())['new'].transform(
             lambda x: df[df['group'].eq(x[0]) & df['team'].ne(x[1])]['value'].std())) 

Output:

group team value    std
0   A   1   1   2.217356
1   A   1   2   2.217356
2   A   2   5   0.816497
3   A   2   7   0.816497
4   A   3   2   2.753785
5   A   3   3   2.753785
6   B   1   7   2.217356
7   B   1   8   2.217356
8   B   2   8   1.707825
9   B   2   9   1.707825
10  B   3   6   0.816497
11  B   3   4   0.816497

There are some equal std values across different groups but you can verify that their std values are indeed equal.

  • Related