when trying to improve the print view of correlation matrix
def view_corr(df):
df = df.unstack()
corr_f = df.sort_values(kind="quicksort", ascending=False)
corr_f = corr_f.dropna()
corr_f = corr_f[corr_f<1]
print(corr_f[corr_f>0.10])
data = {'A': [2,3,4,3,2],
'B': [2,1,6,4,2],
'C': [2,9,1,2,6],
'D': [8,2,3,7,2],
'E': [3,0,9,1,4]}
data = pd.DataFrame(data)
df_corr_matrix = data.corr.abs()
view_corr(df_corr_matrix)
i would like to remove redundand and empty rows (marked '<-' but i don't know how to deal with this in a multiindex dataframe
B C 0.774070
C B 0.774070 <-
E B 0.748474
B E 0.748474 <-
A 0.747018 <-
A B 0.747018
D C 0.639723
C D 0.639723 <-
E C 0.567548
C E 0.567548 <-
E A 0.460079
A E 0.460079 <-
D A 0.269665
A D 0.269665 <-
C A 0.264340
A C 0.264340 <-
D E 0.217736
E D 0.217736 <-
D B 0.086776
B D 0.086776 <-
CodePudding user response:
IIUC, you want to use drop_duplicates
, so that the function looks like:
def view_corr(df):
df = df.unstack()
corr_f = df.sort_values(kind="quicksort", ascending=False)
corr_f = corr_f.dropna().drop_duplicates() # <<<---- here
corr_f = corr_f[corr_f<1]
print(corr_f[corr_f>0.10])
FYI, you can write the above function as a one-liner. For example, instead of filtering twice, you can filter once using &
operator:
corr_f = (pd.DataFrame(data).corr().abs().stack()
.sort_values(kind="quicksort", ascending=False)
.drop_duplicates()
.where(lambda x: (x<1) & (x>0.1)).dropna())
Output:
B C 0.774070
E B 0.748474
B A 0.747018
D C 0.639723
E C 0.567548
A 0.460079
D A 0.269665
C A 0.264340
D E 0.217736
dtype: float64