Dataframe faster cosine similarity-CodePudding

I have a dataframe consisting of individual tweets (id, text, author_id, nn_list) where nn_list is a list of other tweet indices which were previously identified as potential nearest neighbours. Now I have to calculate the cosine similarity of the index and every single entry of this list by looking at the index in the tfidf matrix to compare the vectors but with my current approach this is kind of slow. The current code looks something like this:

for index, row in data_df.iterrows():
    for candidate in row["nn_list"]:
        candidate_cos = float("%.2f" % pairwise_distances(tfidf_matrix[candidate], tfidf_matrix[index], metric='cosine'))

        if candidate_cos < nn_distance:
            current_nn_candidate = candidate
            nn_distance = candidate_cos

Is there a significantly faster way to calculate this?

CodePudding user response：

The following code should work assuming you have not a too large range of IDs:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame({"nn_list": [[1, 2], [1,2,3], [1,2,3,7], [11, 12, 13], [2,1]]})

# Data consistent with https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html
df["data"] = df["nn_list"].apply(lambda x: np.repeat(1, len(x)))
df["row"] = df.index
df["row_ind"] = df[['row', 'nn_list']].apply(lambda x: np.repeat(x[0], len(x[1])), axis=1)
df["col_ind"] = df['nn_list'].apply(lambda x: np.array(x))

m = csr_matrix(
    (np.concatenate(df['data']), 
    (np.concatenate(df['row_ind']), np.concatenate(df['col_ind']))))

cosine_similarity(m)

Will return:

array([[1.        , 0.81649658, 0.70710678, 0.        , 1.        ],
       [0.81649658, 1.        , 0.8660254 , 0.        , 0.81649658],
       [0.70710678, 0.8660254 , 1.        , 0.        , 0.70710678],
       [0.        , 0.        , 0.        , 1.        , 0.        ],
       [1.        , 0.81649658, 0.70710678, 0.        , 1.        ]])

If you have a larger range of IDs I recommend to use spark or have look to cosine similarity on large sparse matrix with numpy.