Cosine Simiarlity scores for each array combination in a list of arrays Python-CodePudding

I have list of arrays and I want to calculate the cosine similarity for each combination of arrays in my list of arrays.

My full list comprises 20 arrays with 3 x 25000. A small selection below

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances


C = np.array([[-127, -108, -290],
       [-123,  -83, -333],
       [-126,  -69, -354],
       [-146, -211, -241],
       [-151, -209, -253],
       [-157, -200, -254]])



D = np.array([[-129, -146, -231],
       [-127, -148, -238],
       [-132, -157, -231],
       [ -93, -355, -112],
       [ -95, -325, -137],
       [ -99, -282, -163]])



E = np.array(([[-141, -133, -200],
       [-132, -123, -202],
       [-119, -117, -204],
       [-107, -210, -228],
       [-101, -194, -243],
       [-105, -175, -244]]))


ArrayList = (C,D,E)

My first problem is I am getting a pairwise result for each element of each array, however, what I am trying to achieve is the result looking at the arrays as a whole.

For example I try

scores = cosine_similarity(C,D)
scores
array([[0.98078461, 0.98258287, 0.97458466, 0.643815  , 0.71118811,
        0.7929595 ],
       [0.95226207, 0.95528395, 0.9428837 , 0.55905221, 0.63291722,
        0.7240552 ],
       [0.9363733 , 0.93972303, 0.9255921 , 0.51752531, 0.59402196,
        0.68918496],
       [0.98998438, 0.98903931, 0.99377116, 0.85494921, 0.8979725 ,
        0.9449272 ],
       [0.99335622, 0.99255262, 0.99635952, 0.84106771, 0.88619755,
        0.93616556],
       [0.9955969 , 0.99463213, 0.99794805, 0.82706302, 0.8738389 ,
        0.92640196]])

What I am expecting is a singular value 0.989... (this is a made up number) The next challenge is how to iterate over each array in my list of arrays to get a pairwise result of the array something like this

     C    D       E
C  1.0    0.97   0.95
 
D  0.97   1.0    0.96

E  0.95  0.95    1.0

As a beginner to python I am not sure how to proceed. Any help appreciated.

CodePudding user response：

If I understand correctly, what you are trying to do is to get he cosine distance when using each matrix as an 1Xn dimensional vector. The easiest thing in my opinion will be to vectorially implement the cosine similarity with numpy functions. As a reminder, given two 1D vectors x and y, the cosine similarity is given by:

cosine_similarity = x.dot(y) / (np.linalg.norm(x, 2) * np.linalg.norm(y, 2))

To do this with the three metrices, we will first flatten them into 1D representation and stack them together:

matrices_1d = temp = np.vstack((C.reshape((1, -1)), D.reshape(1, -1), E.reshape(1,-1)))

Now that we have the vector-representation of each matrix, we can compute the L2 norm using np.linalg.norm(read on this functions here) as follows:

norm_vec = np.linalg.norm(matrices_1d , ord=2, axis=1)

And finally, we can compute the cosine distances as follows:

cos_sim = matrices_1d .dot(matrices_1d .T) / np.outer(norm_vec ,norm_vec)
# array([[1.        , 0.9126993 , 0.9699609 ],
#        [0.9126993 , 1.        , 0.93485159],
#        [0.9699609 , 0.93485159, 1.        ]])

Note that as a sanity check, the diagonal values are 1 since the cosine distance of a vector from itself is 1.

The cosine distance if defined to be 1-cos_sim and is easy to computeonce you have the similarity.