Here is my data. I want to transform this multi-dimensional symmetric dataframe to a 2-column dataframe without repeating.
#coding=utf-8
import pandas as pd
import numpy as np
#########
cor = [('E1',0,0,0.8,-0.8),
('E2',0,0,1.0,1.0),
('E3',0.8,1.0,0,1.0),
('E4',-0.8,1.0,1.0,0)]
label = ['','E1','E2', 'E3', 'E4']
R = pd.DataFrame.from_records(cor, columns=label)
print(R)
I want to transform it to a 2-column matrix without repeating. The main result I want is like:
CodePudding user response:
Use DataFrame.stack
for reshape (first convert empty string column to index
), set new index names by DataFrame.rename_axis
and remove duplicates in MultiIndex
by mapping frozenset
s:
R = R.set_index('').stack().rename_axis(['x','y'])
R = R[~R.index.map(frozenset).duplicated()].reset_index(name='cor')
print(R)
x y cor
0 E1 E1 0.0
1 E1 E2 0.0
2 E1 E3 0.8
3 E1 E4 -0.8
4 E2 E2 0.0
5 E2 E3 1.0
6 E2 E4 1.0
7 E3 E3 0.0
8 E3 E4 1.0
9 E4 E4 0.0
Or convert lower triangle values to missing values in DataFrame.where
with numpy.triu
, so stack
remove missing values:
R = R.set_index('')
print(R.where(np.triu(np.ones(R.shape, dtype=bool))))
E1 E2 E3 E4
E1 0.0 0.0 0.8 -0.8
E2 NaN 0.0 1.0 1.0
E3 NaN NaN 0.0 1.0
E4 NaN NaN NaN 0.0
R = (R.where(np.triu(np.ones(R.shape, dtype=bool)))
.stack()
.rename_axis(['x','y'])
.reset_index(name='cor'))
print(R)
x y cor
0 E1 E1 0.0
1 E1 E2 0.0
2 E1 E3 0.8
3 E1 E4 -0.8
4 E2 E2 0.0
5 E2 E3 1.0
6 E2 E4 1.0
7 E3 E3 0.0
8 E3 E4 1.0
9 E4 E4 0.0