I am trying to build a similarity matrix while using a custom similarity function. The problem is that the code runs very slow.
I have a dataframe which looks like this:
col1 col2 col3
'car' 'A' 'cat'
'car' 'C' 'dog'
'bike' 'A' 'cat'
...
and I have a series of weights which attribute importance to a certain column [0.1, 0.5, 0.4]
I want to compute similarity between rows in a custom similarity matrix where pairs of rows are similar if they have the same values (given the weights which make some columns more important than others)
My current similarity takes as an input two arrays and checks how many elements are identical between them using some weights (which is an array with the same length as x and y)
def custom_similarity(x, y, weights):
similarity = np.dot((x == y).values*1,weights)
return(similarity)
given a dataframe where each row represents one of the array to compare I would like to generate a similarity matrix of the dataframe using the function.
at the moment I am doing something like this (so filling an empty matrix), which it works but it is super slow:
sim_matrix = np.zeros((len(df),len(df)))
for i in tqdm(range(len(df))):
obs_i = df.iloc[i,:]
for j in range(i, len(df)):
obs_j = df.iloc[j,:]
sim_matrix[i,j] = sim_matrix[j,i] = custom_similarity(obs_i, obs_j, weights)
how can I make this more efficient and speed it up?
CodePudding user response:
Try this :
from functools import partial
custom_similarity = partial(custom_similarity, weights=weights)
sim_matrix = df.T.corr(custom_similarity)
(Depending on dataframe you might need to modify custom_similarity
by removing .values
. Also assuming that weights are normalized)
Explanation:
partial is a funtion which handles additional arguments that cannot easily be passed to another function. In this case we handle weights.
Next we have to transpose dataframe because corr calculates coefficients based on columns, and lastly we pass custom_similarity
as "rule" to correlate with.
CodePudding user response:
One way is to use scipy.spatial
to create the distance matrix for you. That is already a little more efficient than what you have rolled yourself. In particular, you could do the following, using pdist
and a custom metric function:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
def sim_mat(df, weights):
mat = squareform(pdist(df.values,
metric=lambda x, y: (x == y) @ weights))
np.fill_diagonal(mat, sum(weights))
return mat