Home > Back-end >  Find the closest group of vectors, one vector from each set?
Find the closest group of vectors, one vector from each set?

Time:08-10

I have k sets of vectors. The vectors are all the same length m. The sets are not all the same length, but let's say they have an average length of n vectors in each. I need to find the group of vectors, one from each set, that has the minimum distance (L2 norm) to each other. This is similar to the "closest pair" problem, but that's for just 2 sets, whereas I have k sets.

The naïve way is to cross join all the values and search through all O(n^k) distances. Is there a better way/algorithm?

Example  
Set A [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]  
Set B [[0.5, 0.9], [0.1, 0.3], [0.9, 0.1]]  
Set C [[0.2, 0.2], [0.8, 0.4], [0.5, 0.1]]  
Result - A [0.1, 0.2], B [0.1, 0.3], C [0.2, 0.2] with L2 distance 0.14  

CodePudding user response:

The Curse of Dimensionality is not your friend. If m is reasonably large, odds are good that all pairs of vectors are at similar distances from each other. And therefore there is no clump of good vectors to find. And therefore your search - even through reasonable candidates - is going to have to look at a lot of options.

That said, I will describe a more efficient algorithm. But note that it also consumes a lot of memory. So it will scale better than your current algorithm at first, but it will also top out at memory limits. Hopefully it is good enough for your needs.

First I'm going to describe a hypothetical directed graph. (We are going to try to avoid actually constructing most of it, but conceptually it exists.) It will be in layers. the l'th layer will be the set of all collections of l points, one from each of the first l matrices. We connect the l'th layer to the l 1th if they agree on the first l points, and the weight of the edge connecting them is the sum of the distances from the last to all other. The 0th node is the empty set. And we join the kth layer to our target with edges of weight 1.

Verify the following. A path from the empty set to the target has as its total weight the total L2 norm of its k'th level. Therefore the lowest weight path gives you the answer you are looking for.

Therefore this is a good candidate for A* search. The heuristic to use for the rest of the path is:

c = sum(points) / len(points)
s = sum(distance(p, c) for p in points)
heuristic s * (k - len(points))

In other words the remaining points can do no better than to be at the center of mass, and the heuristic is what the sum of the remaining distances would be if this was true.

If there is a natural clump to find, this can find the answer with as little as O(m * n * (n k) * log(n*k))) work. If there are no natural clumps, there are likely examples that come arbitrarily close to the worst case bound, which is something like O(n^k * m * k^2 * log(n)). And, of course, it can require O(n^k) memory, at which point you simply crash.

So this might or might not work, but it is worth trying.

CodePudding user response:

so you have k sets each with ~n vectors and each vector has m dimensions.

For tasks like this I usually use maps of all grid aligned "directions" and compute only on them instead of using the direction present in input data this usually significantly reduce the problem combinations count and also can be done recursively with increasing precision (near last found solution with coarser grid). I would attack your case like this:

  1. create map of major vector "directions" O(m*(s^m))

    in case you have coordinate range <0,1> then create a map of all vector combinations with some s steps for example 11 steps will in 2D lead to

    (0.0,0),(0.0,0.1)...(1.0,0.9),(1.0,1.0)
    

    hold them in m dimensional array of vectors for easy access again 2D example:

     vec2 dir[s][s] = 
        { 
        { (0.0,0.0),(0.0,0.1)...(0.0,1.0) },
        { (0.1,0.0),(0.1,0.1)...(0.1,1.0) },
        ...
        { (1.0,0.0),(1.0,0.1)...(1.0,1.0) }
        };
    
  2. for each vector of dir compute max distance to each set O(n*k*(s^m))

    so simply loop through all vectors of dir and for each compute smallest distance each set and remember only the biggest one. Hold the result in m dimensional array dis so:

    dis[i1][i2]...[im] = max( |dir[i1][i2]...[im],set[1]| ,
                              |dir[i1][i1]...[im],set[2]| ,
                              ... ,
                              |dir[i1][i2]...[im],set[k]| )
    
  3. find min distance in dis O(s^m)

  4. find closest element in each set to found vector in dir corresponding to min distance in dis O(m*n*k*m)

    this is your wanted output. this might be speed up by binary search if n is really big to O(m*log(n)*k*m).

Beware the s should be chosen big enough so a valid solution is not set. The resulting complexity is around O(m*n*k*(s^m)) assuming m is not too big a number. The this will be faster only if:

O(m*(n^k)) > O(n*k*(s^m))

See these QAs using very similar technique (map of "directions"):

  • Related