Home > database >  Correlation matrix improving print view removing duplicates
Correlation matrix improving print view removing duplicates

Time:03-24

when trying to improve the print view of correlation matrix

def view_corr(df):
    df = df.unstack()
    corr_f = df.sort_values(kind="quicksort", ascending=False)
    corr_f = corr_f.dropna()
    corr_f = corr_f[corr_f<1]
    print(corr_f[corr_f>0.10])


 data = {'A': [2,3,4,3,2],
         'B': [2,1,6,4,2],
         'C': [2,9,1,2,6],
         'D': [8,2,3,7,2],
         'E': [3,0,9,1,4]}

 data = pd.DataFrame(data)
 df_corr_matrix = data.corr.abs()
 view_corr(df_corr_matrix)

i would like to remove redundand and empty rows (marked '<-' but i don't know how to deal with this in a multiindex dataframe

B  C    0.774070
C  B    0.774070 <-
E  B    0.748474
B  E    0.748474 <-
   A    0.747018 <-
A  B    0.747018
D  C    0.639723
C  D    0.639723 <-
E  C    0.567548
C  E    0.567548 <-
E  A    0.460079
A  E    0.460079 <-
D  A    0.269665 
A  D    0.269665 <-
C  A    0.264340
A  C    0.264340 <-
D  E    0.217736
E  D    0.217736 <-
D  B    0.086776
B  D    0.086776 <-

CodePudding user response:

IIUC, you want to use drop_duplicates, so that the function looks like:

def view_corr(df):
    df = df.unstack()
    corr_f = df.sort_values(kind="quicksort", ascending=False)
    corr_f = corr_f.dropna().drop_duplicates() # <<<---- here
    corr_f = corr_f[corr_f<1]
    print(corr_f[corr_f>0.10])

FYI, you can write the above function as a one-liner. For example, instead of filtering twice, you can filter once using & operator:

corr_f = (pd.DataFrame(data).corr().abs().stack()
          .sort_values(kind="quicksort", ascending=False)
          .drop_duplicates()
          .where(lambda x: (x<1) & (x>0.1)).dropna())

Output:

B  C    0.774070
E  B    0.748474
B  A    0.747018
D  C    0.639723
E  C    0.567548
   A    0.460079
D  A    0.269665
C  A    0.264340
D  E    0.217736
dtype: float64
  • Related