Here's my dataframe:
sample_df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
'values_1':[2,4,6,8,12,13,13,17],
'values_2':[3,6,7,9,3,2,2,5]})
I would like to add a new column with some summary statistics such as the correlation coefficient between values_1 and values_2, within groups.
For the mean this is clear, either of these work:
sample_df['overall_mean'] = sample_df.groupby('id')['values_1'].transform('mean')
sample_df['overall_mean'] = sample_df.groupby('id')['values_1'].mean()
And so I would assume from that, that to get the correlations, something lie this would work:
sample_df['overall_cor'] = sample_df.groupby('id')['values_1','values_2'].transform('corr')
sample_df['overall_cor'] = sample_df.groupby('id')['values_1','values_2'].corr()
But niether do work. Does anyone have a solution to this, and can anyone elaborate upon why the same approaches that return the mean don't work for the correlation coefficient?
CodePudding user response:
The transform()
method transforms data in a way, but can't compute statistics such as the correlation coefficient. You can use apply()
to apply a function to the data.
sample_df['overall_cor'] = sample_df.groupby('id')[['values_1', 'values_2']].apply(lambda x: x['values_1'].corr(x['values_2']))
CodePudding user response:
If need new append new columns to existing use:
print (sample_df.groupby('id')[['values_1','values_2']].corr())
values_1 values_2
id
A values_1 1.000000 0.981156
values_2 0.981156 1.000000
B values_1 1.000000 0.850390
values_2 0.850390 1.000000
df = sample_df.groupby('id')[['values_1','values_2']].corr().unstack()
df = df.loc[:, df.columns.get_level_values(0) != df.columns.get_level_values(1)]
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
values_1_values_2 values_2_values_1
id
A 0.981156 0.981156
B 0.850390 0.850390
out = sample_df.join(df, on='id')
print (out)
id values_1 values_2 values_1_values_2 values_2_values_1
0 A 2 3 0.981156 0.981156
1 A 4 6 0.981156 0.981156
2 A 6 7 0.981156 0.981156
3 A 8 9 0.981156 0.981156
4 B 12 3 0.850390 0.850390
5 B 13 2 0.850390 0.850390
6 B 13 2 0.850390 0.850390
7 B 17 5 0.850390 0.850390
New columns are same, you can simplify solution:
sample_df['corr'] = (sample_df['id'].map(sample_df.groupby('id')[['values_1','values_2']]
.corr()
.iloc[::2, -1]
.droplevel(1)))
print (sample_df)
id values_1 values_2 corr
0 A 2 3 0.981156
1 A 4 6 0.981156
2 A 6 7 0.981156
3 A 8 9 0.981156
4 B 12 3 0.850390
5 B 13 2 0.850390
6 B 13 2 0.850390
7 B 17 5 0.850390
CodePudding user response:
You can't use transform
when you want to apply an operation on several columns.
Instead compute the correlation per group and map
the result:
corr = sample_df.groupby('id').apply(lambda d: d['values_1'].corr(d['values_2']))
sample_df['corr'] = sample_df['id'].map(corr)
Output:
id values_1 values_2 corr
0 A 2 3 0.981156
1 A 4 6 0.981156
2 A 6 7 0.981156
3 A 8 9 0.981156
4 B 12 3 0.850390
5 B 13 2 0.850390
6 B 13 2 0.850390
7 B 17 5 0.850390