Home > Software engineering >  Pandas new column with groupby correlation results
Pandas new column with groupby correlation results

Time:12-14

Here's my dataframe:

sample_df = pd.DataFrame({'id': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
                          'values_1':[2,4,6,8,12,13,13,17],
                          'values_2':[3,6,7,9,3,2,2,5]})

I would like to add a new column with some summary statistics such as the correlation coefficient between values_1 and values_2, within groups.

For the mean this is clear, either of these work:

sample_df['overall_mean'] = sample_df.groupby('id')['values_1'].transform('mean')
sample_df['overall_mean'] = sample_df.groupby('id')['values_1'].mean()

And so I would assume from that, that to get the correlations, something lie this would work:

sample_df['overall_cor'] = sample_df.groupby('id')['values_1','values_2'].transform('corr')
sample_df['overall_cor'] = sample_df.groupby('id')['values_1','values_2'].corr()

But niether do work. Does anyone have a solution to this, and can anyone elaborate upon why the same approaches that return the mean don't work for the correlation coefficient?

CodePudding user response:

The transform() method transforms data in a way, but can't compute statistics such as the correlation coefficient. You can use apply() to apply a function to the data.

sample_df['overall_cor'] = sample_df.groupby('id')[['values_1', 'values_2']].apply(lambda x: x['values_1'].corr(x['values_2']))

CodePudding user response:

If need new append new columns to existing use:

print (sample_df.groupby('id')[['values_1','values_2']].corr())
             values_1  values_2
id                             
A  values_1  1.000000  0.981156
   values_2  0.981156  1.000000
B  values_1  1.000000  0.850390
   values_2  0.850390  1.000000

df = sample_df.groupby('id')[['values_1','values_2']].corr().unstack()
df = df.loc[:, df.columns.get_level_values(0) != df.columns.get_level_values(1)]
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
    values_1_values_2  values_2_values_1
id                                      
A            0.981156           0.981156
B            0.850390           0.850390

out = sample_df.join(df, on='id')
print (out)
  id  values_1  values_2  values_1_values_2  values_2_values_1
0  A         2         3           0.981156           0.981156
1  A         4         6           0.981156           0.981156
2  A         6         7           0.981156           0.981156
3  A         8         9           0.981156           0.981156
4  B        12         3           0.850390           0.850390
5  B        13         2           0.850390           0.850390
6  B        13         2           0.850390           0.850390
7  B        17         5           0.850390           0.850390

New columns are same, you can simplify solution:

sample_df['corr'] = (sample_df['id'].map(sample_df.groupby('id')[['values_1','values_2']]
                                                  .corr()
                                                  .iloc[::2, -1]
                                                  .droplevel(1)))


print (sample_df)
  id  values_1  values_2      corr
0  A         2         3  0.981156
1  A         4         6  0.981156
2  A         6         7  0.981156
3  A         8         9  0.981156
4  B        12         3  0.850390
5  B        13         2  0.850390
6  B        13         2  0.850390
7  B        17         5  0.850390

CodePudding user response:

You can't use transform when you want to apply an operation on several columns.

Instead compute the correlation per group and map the result:

corr = sample_df.groupby('id').apply(lambda d: d['values_1'].corr(d['values_2']))
sample_df['corr'] = sample_df['id'].map(corr)

Output:

  id  values_1  values_2      corr
0  A         2         3  0.981156
1  A         4         6  0.981156
2  A         6         7  0.981156
3  A         8         9  0.981156
4  B        12         3  0.850390
5  B        13         2  0.850390
6  B        13         2  0.850390
7  B        17         5  0.850390
  • Related