Home > Software engineering >  Test for significance on whole dataframe
Test for significance on whole dataframe

Time:03-24

I calculated correlations of the following example dataframe df1:

df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1] })
df1.head()
    A   B   C
0   1   2   5
1   2   5   2
2   3   3   1

The correlation dataframe looks as follows:

df2=df1.corr()
df2
        A           B           C
A   1.000000    0.327327    -0.960769
B   0.327327    1.000000    -0.576557
C  -0.960769   -0.576557     1.000000

How can I test, if a value is significant from other values in my dataframe/column? For example: I want to know if the high correlation between A and C is significant or if high correlations in my data are normal.

Edit: desired output:

        A               B              C
A   p-value(A/A)    p-value(A/B)    p-value(A/C)
B   p-value(B/A)    p-value(B/B)    p-value(B/C)
C   p-value(C/A)    p-value(C/B)    p-value(C/C)

I know pearsonr() returns p-values, however it does not take the other values of df2 in account: I want to compare each correlation (for example correlation between A and C) against every other correlations of df2.

CodePudding user response:

please try scipy package. The correlation above that's calculated is pearson's correlation. With scipy package, the same can be calculated with 'p-value':

Ref : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

from scipy.stats import pearsonr
pearson_corr_BC = pearsonr(df1['B'],df1['C'])
print("pearson correlation:",pearson_corr_BC[0])
print("p-value:",pearson_corr_BC[1])

To calculate p-value matrix, you can try the following code:

import numpy as np
row_index = df2.index
col_index = df2.columns
p_value_array = np.zeros(shape=(len(df1.columns),len(df1.columns)))
for i,a in enumerate(row_index):
  for j,b in enumerate(col_index):
    p_value_array[i,j] = pearsonr(df1[a],df1[b])[1]
pvalue_df = pd.DataFrame(p_value_array,index=row_index,columns=col_index)

Now to compare the correlation values of two features with other correlation values, for eg, try this:

df2[abs(df2.loc['B','C'])>df2.abs()]

this will give a matrix with absolute correlation values less than that of absolute correlation value bw 'B' and 'C'. Similarly one can compare the significance also.

  • Related