Faster way to perform a function on each row with every other row in a DataFrame?-CodePudding

I want to perform an operation of each row with every other row in a dataframe. The obvious way is to use nested for loops and that is expectedly very slow.

Seeking suggestions on faster and better way to achieve the same thing?

This is dataframe where each row is a user vector, with index set as usernames. In actual there can be hundreds of usernames

import pandas as pd
df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])

Nested Loop Method

import numpy as np
def some_func(u1_vec,u2_vec):
    # this could be any function using above 2 user vectors
    return np.minimum(u1_vec, u2_vec).sum()/np.maximum(u1_vec, u2_vec).sum()


index_list = list(df1.index) # contains usernames
vector_cols = list(df1.columns) # contains colnames

min_max_all = {} # will be used to store the vector interaction 
for index_u1 in index_list:
    u1_vec = df1.loc[index_u1, vector_cols]
    min_max_all[index_u1] = {}
    for index_u2 in index_list:
        u2_vec = df1.loc[index_u2, vector_cols]
        min_max_all[index_u1][index_u2] = some_func(u1_vec, u2_vec)

Result - min_max_all

{
'U1': {'U1': 1.0, 'U2': 0.5416666666666666, 'U3': 0.5384615384615384},
'U2': {'U1': 0.5416666666666666, 'U2': 1.0, 'U3': 0.8333333333333334},
'U3': {'U1': 0.5384615384615384, 'U2': 0.8333333333333334, 'U3': 1.0}
}

CodePudding user response：

I think the best way is with numpy, and write one code for one purpose.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])
df1_np = df1.to_numpy()

x = np.minimum(df1_np[:, np.newaxis], df1_np).sum(axis=2)
y = np.maximum(df1_np[:, np.newaxis], df1_np).sum(axis=2)

print(x/y)
array([[1.        , 0.54166667, 0.53846154],
       [0.54166667, 1.        , 0.83333333],
       [0.53846154, 0.83333333, 1.        ]])

To make a dictionary like yours in the question

z = x/y
{ci: {cj: z[i][j] for j, cj in enumerate(df1.columns)} 
    for i, ci in enumerate(df1.columns)}