Given a nxm matrix (n > m) of integers, I'd like to identify rows that are a multiple of a single other row, so not a linear combination of multiple other rows. I could scale all rows to their length and find unique rows, but that is prone to numerical issues on floating points and would also not detect vectors being opposite (pointing in the other directon) of each other. Any ideas?
Example
A = array([[-1, -1, 0, 0],
[-1, -1, 0, 1],
[-1, 0, -1, 0],
[-1, 0, 0, 0],
[-1, 0, 0, 1],
[-1, 0, 1, 1],
[-1, 1, -1, 0],
[-1, 1, 0, 0],
[-1, 1, 1, 0],
[ 0, -1, 0, 0],
[ 0, -1, 0, 1],
[ 0, -1, 1, 0],
[ 0, -1, 1, 1],
[ 0, 0, -1, 0],
[ 0, 0, 0, 1],
[ 0, 0, 1, 0],
[ 0, 1, -1, 0],
[ 0, 1, 0, 0],
[ 0, 1, 0, 1],
[ 0, 1, 1, 0],
[ 0, 1, 1, 1],
[ 1, -1, 0, 0],
[ 1, -1, 1, 0],
[ 1, 0, 0, 0],
[ 1, 0, 0, 1],
[ 1, 0, 1, 0],
[ 1, 0, 1, 1],
[ 1, 1, 0, 0],
[ 1, 1, 0, 1],
[ 1, 1, 1, 0]])
For example Rows 0 and -3 just point in the opposite direction (multiply one by -1 to make them equal).
CodePudding user response:
You can take advantage of the fact that inner product of two normalized linearly dependent vectors gives 1 or -1, so the code could look like this:
>>> A_normalized = (A.T/np.linalg.norm(A, axis=-1)).T
>>> M = np.absolute(np.einsum('ix,jx->ij', A_normalized, A_normalized))
>>> i, j = np.where(np.isclose(M, 1))
>>> i, j = i[i < j], j[i < j] # Remove repetitions
>>> print(i, j)
output: [ 0 2 3 6 7 9 11 13] [27 25 23 22 21 17 16 15]
CodePudding user response:
You can normalize each row dividing it by its GCD:
import numpy as np
def normalize(a):
return a // np.gcd.reduce(a, axis=1, keepdims=True)
And you can define a distance that considers opposite vectors as equal:
def distance(a, b):
equal = np.all(a == b) or np.all(a == -b)
return 0 if equal else 1
Then you can use standard clustering methods:
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, fcluster
def cluster(a):
norm_a = normalize(a)
distances = pdist(norm_a, metric=distance)
return fcluster(linkage(distances), t=0.5)
For example:
>>> A = np.array([( 1, 2, 3, 4),
... ( 0, 2, 4, 8),
... (-1, -2, -3, -4),
... ( 0, 1, 2, 4),
... (-1, 2, -3, 4),
... ( 2, -4, 6, -8)])
>>> cluster(A)
array([2, 3, 2, 3, 1, 1], dtype=int32)
Interpretation: cluster 1 is formed by rows 4 and 5, cluster 2 by rows 0 and 2, and cluster 3 by rows 1 and 3.