python: groupby from three columns and generate symmetric matrix-CodePudding

There is a data frame:

data = pd.DataFrame({'o':[1,1,2,1,2],
                 'd':[1,2,1,3,3],
                 'distance':[3,5,10,6,8]})

I want to average the third column when the indexes of the first two columns are the same or reversed. For example, if o = 1, d = 2, distance = 5 or o = 2, d = 1, distance = 10, there is o = 1, d = 2, distance = 7.5 or o = 2, d = 1, distance = 7.5.
Next, according to this three columns dataframe, I want to generate a symmetric matrix, where o and d is index and column, and the value is distance. In addition, the diagonal of the matrix is set to 0.

The ideal result is:

a = np.array([[0,7.5,6],
        [7.5,0,8],
        [6,8,0]])

CodePudding user response：

import pandas as pd
import numpy as np

data = pd.DataFrame({'o':[1,1,2,1,2],
                 'd':[1,2,1,3,3],
                 'distance':[3,5,10,6,8]})

data['query'] = data.apply(lambda row: tuple(sorted([row.o, row.d])), axis=1)  # ignore order

avg = data.groupby('query').distance.mean()

N = max(data.o.max(), data.d.max())  # square matrix
result = np.zeros((N, N), dtype=np.float32)

for (row, col), v in avg.items():
    row, col = row - 1, col -1  # index start from 0
    if row == col: continue  # diagonal set to 0
    result[row][col] = result[col][row] = v  # symmetric

print(result)

CodePudding user response：

Use:

#assigned sorted values to columns back
data[['o','d']] = np.sort(data[['o','d']], axis=1)

#pivoting with aggregate mean
df = data.pivot_table(index='o',columns='d', values='distance', aggfunc='mean')

#create matrix
df = df.combine_first(df.T)

#set 0 to diagonal
np.fill_diagonal(df.to_numpy(), 0)

print (df)
     1    2    3
1  0.0  7.5  6.0
2  7.5  0.0  8.0
3  6.0  8.0  0.0