Computing pairwise accuracy/comparison between many arrays-CodePudding

Let's say I have several arrays, where each array is the same length. I am working with binary-valued (values are 0 or 1) arrays which might simplify the problem, so it's okay if the proposed solution makes use of this property.

I want to compute pairwise accuracies between each pair of arrays, where accuracy can be thought of as the proportion of times the elements in two arrays are equal. So here is a simple example where I am using a list of lists format. Let's say A = [[1,1,1], [0,1,0], [1,1,0]]. We would want to output:

1. , 1/3, 2/3
1/3,  1., 2/3
2/3, 2/3,   1.

I can compute this using multiple loops (iterating over each pair of arrays, and over each index). However, is there are built-in functionalities or library (e.g numpy) that can help do this more cleanly and efficiently?

CodePudding user response：

You can use broadcasting:

import numpy as np

A = np.array([[1,1,1], [0,1,0], [1,1,0]])

output = A[:,None,:] == A[None,:,:]
output = output.sum(axis=2) / 3

print(output)
# [[1.         0.33333333 0.66666667]
#  [0.33333333 1.         0.66666667]
#  [0.66666667 0.66666667 1.        ]]

CodePudding user response：

I'd suggest

A = np.array(A)
-1 * np.linalg.norm(A[:, None, :] - A[None, :, :], axis=-1, ord=1)/len(A)   1

that leverages NumPy's linalg.norm.

Since pairwise accuracies seemingly refers to the relative number of coinciding elements in between two vectors. In this case, you compute

1 - HammingDistance(v1, v2) / len(v2)

where the Hamming distance counts the (absolute) number of indices of non-equal values. This is emulated by using the 1-norm through ord=1.

However, if you'd prefer to leverage the binary structure of your vectors without invoking the linear algebra in NumPy but merely is broadcasting capability,

A = np.array(A)
-1 * (A[:, None, :] != A).sum(2)/len(A)   1

will equally do.

Naturally, both code snippets require the lists (i.e. vectors) in your code to have the same length. However, it is non-trivial in a mathematically rigorous way to measure distance (and in turn, similarity) anyway when this is not the case.