Home > database >  How to Store correlation matrix's values in dataframe
How to Store correlation matrix's values in dataframe

Time:10-20

I have a df. I want to apply a correlation matrix between the columns and store the result values in another dataframe, however, I do not want to keep the upper part of the matrix, which is useless given that it's duplicate from the other part, I also don't want the diagonal (1,1,1,1). So i applied a mask, and then append the result to a dataframe, It

my basic df :

A B C name
0 0.5 0 8 test1
1 5 1 5 test2
2 3 2 6 test3
3 6 3 10 test4
4 18 14 12 test5

And I would like to do the correlation between those parameters and return the result in another dataframe, but only keeping the upper part to not have the duplicate values. The code is :

dft=pd.DataFrame()
for i in df['name']:
    corrmat = df[df['name']==i].corr()
    corr_triu = corrmat.where(~np.tril(np.ones(corrmat.shape)).astype(np.bool))
    corr_triu = corr_triu.stack()
    corr_triu["name","name"]=i
    dft=dft.append(corr_triu,ignore_index=True)

and that's the result of my code :

(A,B) (B,C) (C,A) (name,name)
0 0.5 0.2 0.2 test1
1 0.6 0.8 0.3 test2
2 0.7 0.23 0.25 test3
3 0.5 0.9 0.65 test1
4 0.6 0.4 0.12 test2

then I am trying to use .melt in order to restructure the dataframe but I can't do it :

dft.melt(id_vars=['(name, name)'], 
        var_name="Var1_Var2", 
        value_name="Value")

the error is :

KeyError: "The following 'id_vars' are not present in the DataFrame: ['(name, name)']"

but when I looked for the column name got :

Index([('A', 'B'),
       ('A', 'C'),
       ('B', 'C'),
       ('name', 'name')], dtype='object')

It seems weird that the name is not found, I do not get it.

Here is what I would like to get as a result :

Var1_Var2 Value name
0 (A,B) 0.2 test1
1 (A,B) 0.8 test2
2 (A,B) 0.23 test3
3 (B,C) 0.9 test1
4 (B,C) 0.4 test2
5 (B,C) 0.4 test3
6 (C,A) 0.4 test1
7 (C,A) 0.4 test2
8 (C,A) 0.4 test3

There is probably an easier way leading to the result but I can't find it

CodePudding user response:

Tuples in columns are problematic, here is solution for change tuple for scalar name:

dft.columns = ['name' if x == ('name', 'name') else x for x in dft.columns]

df = dft.melt(id_vars='name', 
          var_name="Var1_Var2", 
          value_name="Value")
print (df)
     name Var1_Var2  Value
0   test1    (A, B)   0.50
1   test2    (A, B)   0.60
2   test3    (A, B)   0.70
3   test1    (A, B)   0.50
4   test2    (A, B)   0.60
5   test1    (A, C)   0.20
6   test2    (A, C)   0.80
7   test3    (A, C)   0.23
8   test1    (A, C)   0.90
9   test2    (A, C)   0.40
10  test1    (B, C)   0.20
11  test2    (B, C)   0.30
12  test3    (B, C)   0.25
13  test1    (B, C)   0.65
14  test2    (B, C)   0.12
  • Related