I want to perform an operation of each row with every other row in a dataframe. The obvious way is to use nested for loops and that is expectedly very slow.
Seeking suggestions on faster and better way to achieve the same thing?
This is dataframe where each row is a user vector, with index set as usernames. In actual there can be hundreds of usernames
import pandas as pd
df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])
Nested Loop Method
import numpy as np
def some_func(u1_vec,u2_vec):
# this could be any function using above 2 user vectors
return np.minimum(u1_vec, u2_vec).sum()/np.maximum(u1_vec, u2_vec).sum()
index_list = list(df1.index) # contains usernames
vector_cols = list(df1.columns) # contains colnames
min_max_all = {} # will be used to store the vector interaction
for index_u1 in index_list:
u1_vec = df1.loc[index_u1, vector_cols]
min_max_all[index_u1] = {}
for index_u2 in index_list:
u2_vec = df1.loc[index_u2, vector_cols]
min_max_all[index_u1][index_u2] = some_func(u1_vec, u2_vec)
Result - min_max_all
{
'U1': {'U1': 1.0, 'U2': 0.5416666666666666, 'U3': 0.5384615384615384},
'U2': {'U1': 0.5416666666666666, 'U2': 1.0, 'U3': 0.8333333333333334},
'U3': {'U1': 0.5384615384615384, 'U2': 0.8333333333333334, 'U3': 1.0}
}
CodePudding user response:
I think the best way is with numpy, and write one code for one purpose.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"A":[11,2,3], "B":[4,5,6], "C":[7,8,9]}, index=["U1","U2", "U3"])
df1_np = df1.to_numpy()
x = np.minimum(df1_np[:, np.newaxis], df1_np).sum(axis=2)
y = np.maximum(df1_np[:, np.newaxis], df1_np).sum(axis=2)
print(x/y)
array([[1. , 0.54166667, 0.53846154],
[0.54166667, 1. , 0.83333333],
[0.53846154, 0.83333333, 1. ]])
To make a dictionary like yours in the question
z = x/y
{ci: {cj: z[i][j] for j, cj in enumerate(df1.columns)}
for i, ci in enumerate(df1.columns)}