I calculated correlations of the following example dataframe df1:
df1 = pd.DataFrame({'A':[1,2,3], 'B':[2,5,3], 'C':[5,2,1] })
df1.head()
A B C
0 1 2 5
1 2 5 2
2 3 3 1
The correlation dataframe looks as follows:
df2=df1.corr()
df2
A B C
A 1.000000 0.327327 -0.960769
B 0.327327 1.000000 -0.576557
C -0.960769 -0.576557 1.000000
How can I test, if a value is significant from other values in my dataframe/column? For example: I want to know if the high correlation between A and C is significant or if high correlations in my data are normal.
Edit: desired output:
A B C
A p-value(A/A) p-value(A/B) p-value(A/C)
B p-value(B/A) p-value(B/B) p-value(B/C)
C p-value(C/A) p-value(C/B) p-value(C/C)
I know pearsonr() returns p-values, however it does not take the other values of df2 in account: I want to compare each correlation (for example correlation between A and C) against every other correlations of df2.
CodePudding user response:
please try scipy package. The correlation above that's calculated is pearson's correlation. With scipy package, the same can be calculated with 'p-value':
Ref : https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
from scipy.stats import pearsonr
pearson_corr_BC = pearsonr(df1['B'],df1['C'])
print("pearson correlation:",pearson_corr_BC[0])
print("p-value:",pearson_corr_BC[1])
To calculate p-value matrix, you can try the following code:
import numpy as np
row_index = df2.index
col_index = df2.columns
p_value_array = np.zeros(shape=(len(df1.columns),len(df1.columns)))
for i,a in enumerate(row_index):
for j,b in enumerate(col_index):
p_value_array[i,j] = pearsonr(df1[a],df1[b])[1]
pvalue_df = pd.DataFrame(p_value_array,index=row_index,columns=col_index)
Now to compare the correlation values of two features with other correlation values, for eg, try this:
df2[abs(df2.loc['B','C'])>df2.abs()]
this will give a matrix with absolute correlation values less than that of absolute correlation value bw 'B' and 'C'. Similarly one can compare the significance also.