Home > Blockchain >  Efficient correlation dataframe sorting
Efficient correlation dataframe sorting

Time:07-29

Let's say I have a correlation matrix df.corr():

  A        B        C
A 1.000000,0.500670,0.429114
B 0.500670,1.000000,0.392397
C 0.429114,0.392397,1.000000

And I would like to sort the correlations in a descending order, so that the output is showing me something like:

  1. A/B -> 0.5
  2. A/C -> 0.43
  3. B/C -> 0.39

The thing is, I want to avoid hardcoding it with a for loop, but instead do it in an efficient way (I'm dealing with a lot of data in my project). Should I do it with some pandas function or is there something more recommendable? Would you mind sharing some code?

CodePudding user response:

A numpy approach:

# convert to numpy array
corr_np = df.to_numpy()

# extract upper triangular values, excluding diagonal
rows, cols = np.triu_indices_from(corr_np, k=1)

# flat the array and get the values
flat = corr_np[rows, cols]

# get the resulting labels
labels = df.columns[rows]   "/"   df.columns[cols]

# do argsort to get the final position
indices = np.argsort(flat)[::-1]

# create Series for result
res = pd.Series(data=flat[indices], index=labels[indices])
print(res)

Output

A/B    0.500670
A/C    0.429114
B/C    0.392397
dtype: float64
  • Related