How to calculate pearsonr (and correlation significance) with pandas groupby?-CodePudding

I would like to do a groupby correlation using pandas and pearsonr.

Currently I have:

df = pd.DataFrame(np.random.randint(0,10,size=(1000, 4)), columns=list('ABCD'))

df.groupby(['A','B'])[['C','D']].corr().unstack().iloc[:,1]

However I would like to calculate the correlation significance using pearsonr (scipy package) like this:

from scipy.stats import pearsonr
corr,pval= pearsonr(df['C'],df['D'])

How do I combine the groupby with the pearsonr, something like this:

corr,val=df.groupby(['A','B']).agg(pearsonr(['C','D']))

CodePudding user response：

If I understand, you need to perform the Pearson's test between C and D for any combination of A and B.

To carry out this task you need to groupby(['A','B']) as you already done. Now your grouped dataframe is a "set" of dataframes (one dataframe for each A,B combination), so you can apply the stats.pearsonr to any of these dataframes through the apply method. In order to have two distinct columns for the test-statistic (r, correlation index) and for the p-value, you can also include the output from pearsonr in a pd.Series.

from scipy import stats

df.groupby(['A','B']).apply(lambda d:pd.Series(stats.pearsonr(d.C, d.D), index=["corr", "pval"]))

The output is:

         corr      pval
A B                    
0 0 -0.318048  0.404239
  1  0.750380  0.007804
  2 -0.536679  0.109723
  3 -0.160420  0.567917
  4 -0.479591  0.229140
..        ...       ...
9 5  0.218743  0.602752
  6 -0.114155  0.662654
  7  0.053370  0.883586
  8 -0.436360  0.091069
  9 -0.047767  0.882804

[100 rows x 2 columns]

In jupyter:

Another advice I can give you is to adjust the p-values to avoid false-positives, since you are replicating the experiment several times:

corr_df["qval"] = p_adjust_bh(corr_df.pval)

I used the p_adjust_bh function from here (answer from @Eric Talevich)