Keep only one feature, among those ones which are correlated between each other, based on higher cor-CodePudding

I've a dataset like this

Target  Politics    Medicine    School  Hospital    Domestic
        -1  0   0   0   0   0
        -1  0   1   0   0   1
        -1  0   0   0   1   0
        -1  0   0   0   0   0
        1   0   0   0   0   1
        1   0   0   0   0   0

Target is my target variable and it can take values -1 or 1; the other columns are boolean, and they can take values 0/1. I am calculating correlation among these variables as follows:

import seaborn as sn

corr = df.corr(method = 'pearson')
sn.heatmap(corr, annot = True)
plt.show()

This generates a correlation matrix which includes both Target and Boolean variables. I have some correlation among my variables, e.g., Medicine and School (imagine the below output generated by the above code, using the full dataset):

    index       Target   Politics   Medicine    School   Hospital   Domestic
   Target       1         0.02          -0.08     0.04      0           0.001
   Politics     0.02      1             0         0.002     0           0
   Medicine     -0.08     0.            1         0.76      0           0
   School       0.04      0.002          0.76       1      0.24         0
   Hospital     0            0              0       0       1           0
   Domestic     0.001        0              0     0.24      0           1

(the above table is just for providing you with an example of correlated variables). I set a threshold above 0.5, so School and Medicine are correlated. I'd like to keep only one of them, based on the best correlation with the Target variable (Medicine has value -0.08, whereas School has 0.04). I would like to generalize this approach as follows:

if two variables are correlated to each other, then a further analysis is required with the Target variable, in order to keep only that variable which shows higher correlation with the Target.

I've tried as follows:

threshold=0.5
a=abs(corr)
result=a[a>threshold]
result=pd.DataFrame(data=a).reset_index()

but result does not show my expected output (School and Medicine only) that I would need to compare with the corresponding correlation values with Target.

My expected output would be a final dataset, like the first one above, with Target, Politics, Medicine, Hospital and Domestic columns, i.e., School would be excluded because of its correlation with Medicine and lower absolute value with the target compared to Medicine.

I'd like to create a function, if it does not exist yet, that does this check automatically.

CodePudding user response：

assuming df is the original dataframe and corr_df is the correlation dataframe, you can use np.argwhere to get the coordinates of over th cells, drop identities and find the less correlated to target and drop it:

th = 0.5
target_col = 'Target'
corr_df = corr_df.abs()

high_cor_xy = set([tuple(sorted(x)) for x in np.argwhere(corr_df.values>=th) if x[0]!=x[1]])
high_cor_col_idx_raw = [(corr_df.columns[x[0]], corr_df.columns[x[1]]) for x in high_cor_xy]
high_cor_col_idx = [t for t in high_cor_col_idx_raw if target_col not in t]
target_corr_dict = corr_df[target_col].to_dict()
cols_to_drop = [sorted(list(t), key=target_corr_dict.get)[::-1][0] for t in high_cor_col_idx]

df = df.drop(cols_to_drop, axis=1)