Test for significance on whole dataframe-CodePudding

I calculated correlations of the following example dataframe df1:

df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1] })
df1.head()

    A   B   C
0   1   2   5
1   2   5   2
2   3   3   1

The correlation dataframe looks as follows:

df2=df1.corr()
df2

        A           B           C
A   1.000000    0.327327    -0.960769
B   0.327327    1.000000    -0.576557
C  -0.960769   -0.576557     1.000000

How can I test, if a value is significant from other values in my dataframe/column? For example: I want to know if the high correlation between A and C is significant or if high correlations in my data are normal.

Edit: desired output:

        A               B              C
A   p-value(A/A)    p-value(A/B)    p-value(A/C)
B   p-value(B/A)    p-value(B/B)    p-value(B/C)
C   p-value(C/A)    p-value(C/B)    p-value(C/C)

I know pearsonr() returns p-values, however it does not take the other values of df2 in account: I want to compare each correlation (for example correlation between A and C) against every other correlations of df2.

CodePudding user response：

please try scipy package. The correlation above that's calculated is pearson's correlation. With scipy package, the same can be calculated with 'p-value':

Ref : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

from scipy.stats import pearsonr
pearson_corr_BC = pearsonr(df1['B'],df1['C'])
print("pearson correlation:",pearson_corr_BC[0])
print("p-value:",pearson_corr_BC[1])

To calculate p-value matrix, you can try the following code:

import numpy as np
row_index = df2.index
col_index = df2.columns
p_value_array = np.zeros(shape=(len(df1.columns),len(df1.columns)))
for i,a in enumerate(row_index):
  for j,b in enumerate(col_index):
    p_value_array[i,j] = pearsonr(df1[a],df1[b])[1]
pvalue_df = pd.DataFrame(p_value_array,index=row_index,columns=col_index)

Now to compare the correlation values of two features with other correlation values, for eg, try this:

df2[abs(df2.loc['B','C'])>df2.abs()]

this will give a matrix with absolute correlation values less than that of absolute correlation value bw 'B' and 'C'. Similarly one can compare the significance also.