I have a df. I want to apply a correlation matrix between the columns and store the result values in another dataframe, however, I do not want to keep the upper part of the matrix, which is useless given that it's duplicate from the other part, I also don't want the diagonal (1,1,1,1). So i applied a mask, and then append the result to a dataframe, It
my basic df :
A | B | C | name | |
---|---|---|---|---|
0 | 0.5 | 0 | 8 | test1 |
1 | 5 | 1 | 5 | test2 |
2 | 3 | 2 | 6 | test3 |
3 | 6 | 3 | 10 | test4 |
4 | 18 | 14 | 12 | test5 |
And I would like to do the correlation between those parameters and return the result in another dataframe, but only keeping the upper part to not have the duplicate values. The code is :
dft=pd.DataFrame()
for i in df['name']:
corrmat = df[df['name']==i].corr()
corr_triu = corrmat.where(~np.tril(np.ones(corrmat.shape)).astype(np.bool))
corr_triu = corr_triu.stack()
corr_triu["name","name"]=i
dft=dft.append(corr_triu,ignore_index=True)
and that's the result of my code :
(A,B) | (B,C) | (C,A) | (name,name) | |
---|---|---|---|---|
0 | 0.5 | 0.2 | 0.2 | test1 |
1 | 0.6 | 0.8 | 0.3 | test2 |
2 | 0.7 | 0.23 | 0.25 | test3 |
3 | 0.5 | 0.9 | 0.65 | test1 |
4 | 0.6 | 0.4 | 0.12 | test2 |
then I am trying to use .melt in order to restructure the dataframe but I can't do it :
dft.melt(id_vars=['(name, name)'],
var_name="Var1_Var2",
value_name="Value")
the error is :
KeyError: "The following 'id_vars' are not present in the DataFrame: ['(name, name)']"
but when I looked for the column name got :
Index([('A', 'B'), ('A', 'C'), ('B', 'C'), ('name', 'name')], dtype='object')
It seems weird that the name is not found, I do not get it.
Here is what I would like to get as a result :
Var1_Var2 | Value | name | |
---|---|---|---|
0 | (A,B) | 0.2 | test1 |
1 | (A,B) | 0.8 | test2 |
2 | (A,B) | 0.23 | test3 |
3 | (B,C) | 0.9 | test1 |
4 | (B,C) | 0.4 | test2 |
5 | (B,C) | 0.4 | test3 |
6 | (C,A) | 0.4 | test1 |
7 | (C,A) | 0.4 | test2 |
8 | (C,A) | 0.4 | test3 |
There is probably an easier way leading to the result but I can't find it
CodePudding user response:
Tuples in columns are problematic, here is solution for change tuple for scalar name
:
dft.columns = ['name' if x == ('name', 'name') else x for x in dft.columns]
df = dft.melt(id_vars='name',
var_name="Var1_Var2",
value_name="Value")
print (df)
name Var1_Var2 Value
0 test1 (A, B) 0.50
1 test2 (A, B) 0.60
2 test3 (A, B) 0.70
3 test1 (A, B) 0.50
4 test2 (A, B) 0.60
5 test1 (A, C) 0.20
6 test2 (A, C) 0.80
7 test3 (A, C) 0.23
8 test1 (A, C) 0.90
9 test2 (A, C) 0.40
10 test1 (B, C) 0.20
11 test2 (B, C) 0.30
12 test3 (B, C) 0.25
13 test1 (B, C) 0.65
14 test2 (B, C) 0.12