So i want to calculate standard deviation excluding current group using groupby. Here an example of the data:
import pandas as pd
df = pd.DataFrame ({
'group' : ['A','A','A','A','A','A','B','B','B','B','B','B'],
'team' : ['1','1','2','2','3','3','1','1','2','2','3','3',]
'value' : [1,2,5,7,2,3,7,8,8,9,6,4]
})
For example, for group A team 1, i want to calculate the std dev of team 2 and 3, for group A team 2, i want to calculate the std dev of group 1 and 3, and so on.
I managed to do it using groupby and apply but when using it on real data with literally milion of rows, it takes too long. So i am looking for a solution with vectorization.
def std(row, data):
data = data.loc[data['group']==row['group]]
return data.groupby(['team']).filter(lambda x:(x['tool]!=row['team']).all())['value'].std()
df['std_exclude'] = df.apply(lambda x: std(data=df),axis=1)
CodePudding user response:
You can use transform
after combining group
and team
as a list:
df['std'] = (df.assign(new=df[['group', 'team']].values.tolist())['new'].transform(
lambda x: df[df['group'].eq(x[0]) & df['team'].ne(x[1])]['value'].std()))
Output:
group team value std
0 A 1 1 2.217356
1 A 1 2 2.217356
2 A 2 5 0.816497
3 A 2 7 0.816497
4 A 3 2 2.753785
5 A 3 3 2.753785
6 B 1 7 2.217356
7 B 1 8 2.217356
8 B 2 8 1.707825
9 B 2 9 1.707825
10 B 3 6 0.816497
11 B 3 4 0.816497
There are some equal std
values across different groups
but you can verify that their std
values are indeed equal.