I have a DataFrame where some columns are columns are correlated and some are not. I want to display only the uncorrelated columns as output. can anyone help me out in solving this.I dont want to plot but display the uncorrelated column names.
CodePudding user response:
You can first compute correlation with df.corr()
then find column name like below.
try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.RandomState(0).rand(10, 10))
corr = df.corr()
# 0 1 2 3 4 5 6 7 8 9
#0 1.000000 0.347533 0.398948 0.455743 0.072914 -0.233402 -0.731222 0.477978 -0.442621 0.015185
#1 0.347533 1.000000 -0.284056 0.571003 -0.285483 0.382480 -0.362842 0.642578 0.252556 0.190047
#2 0.398948 -0.284056 1.000000 -0.523649 0.152937 -0.139176 -0.092895 0.016266 -0.434016 -0.383585
#3 0.455743 0.571003 -0.523649 1.000000 -0.225343 -0.227577 -0.481548 0.473286 0.279258 0.446650
#4 0.072914 -0.285483 0.152937 -0.225343 1.000000 -0.104438 -0.147477 -0.523283 -0.614603 -0.189916
#5 -0.233402 0.382480 -0.139176 -0.227577 -0.104438 1.000000 -0.030252 0.417640 0.205851 0.095084
#6 -0.731222 -0.362842 -0.092895 -0.481548 -0.147477 -0.030252 1.000000 -0.494440 0.381407 -0.353652
#7 0.477978 0.642578 0.016266 0.473286 -0.523283 0.417640 -0.494440 1.000000 0.375873 0.417863
#8 -0.442621 0.252556 -0.434016 0.279258 -0.614603 0.205851 0.381407 0.375873 1.000000 0.150421
#9 0.015185 0.190047 -0.383585 0.446650 -0.189916 0.095084 -0.353652 0.417863 0.150421 1.000000
threshold = 0.2
uncorr = (corr[(corr.abs() > threshold)].fillna('True').apply(lambda row: row[row == 'True'].index.tolist(), axis=1))
uncorr_df = uncorr.to_frame('col_name_uncorrelated')
# 0 with 4,9 uncorrelated
# 1 with 9 uncorrelated
...
# 9 with 0, 1, 4, 5, 8 uncorrelated
Output:
>>> uncorr_df
col_name_uncorrelated
0 [4, 9]
1 [9]
2 [4, 5, 6, 7]
3 []
4 [0, 2, 5, 6, 9]
5 [2, 4, 6, 9]
6 [2, 4, 5]
7 [2]
8 [9]
9 [0, 1, 4, 5, 8]
CodePudding user response:
First of all calculate the correlation:
import pandas as pd
myDataFrame=pd.DataFrame(data)
correl=myDataFrame.corr()
Define what you mean by "uncorrelated". I will use an absolute value of 0.5 here
uncor_level=0.5
The following code will give you the names of the pairs that are uncorrelated
pairs=np.full([len(correl)**2,2],None) #define an empty array to store the results
z=0
for x in range(0,len(correl)): #loop for each row(index)
for y in range(0,len(correl)): #loop for each column
if abs(correl.iloc[x,y])<uncor_level:
pair=[correl.index[x],correl.columns[y]]
pairs[z]=pair
z=z 1