Home > Back-end >  Finding nearest values in two vectors
Finding nearest values in two vectors

Time:07-20

Let's say I've got two sorted vectors of floating point numbers - let's call them A and B. Is there a clever way of selecting an element of A and an element of B such that those elements are nearest in value of all the pairs? Right now I'm brute forcing it with two loops, but it seems like there ought to be a better way.

CodePudding user response:

This is another version, which has shorter/cleaner code and is also (on my benchmarks) slightly quicker:

function safe_min_dist(L1, L2)
    @assert !isempty(L1) && !isempty(L2)
    @assert issorted(L1) && issorted(L2)
    return unsafe_min_dist(L1, L2)
end

function unsafe_min_dist(L1, L2)
    swapped = L1[1] > L2[1]
    LL, HL = swapped ? (L2, L1) : (L1, L2)
    LLlen, HLlen = length(LL), length(HL)

    li, hi = 1, 1
    curmin = HL[1]-LL[1]
    curli, curhi, curswapped = li, hi, swapped

    @inbounds while true
        li  = 1
        if li > LLlen || LL[li] > HL[hi]
            if HL[hi]-LL[li-1] < curmin
                curmin = HL[hi] - LL[li-1]
                curli, curhi, curswapped = li-1, hi, swapped
                curmin == 0 && break
            end
            li > LLlen && break 
            LL, HL, LLlen, HLlen = HL, LL, HLlen, LLlen
            li, hi, swapped = hi, li, !swapped
        end
    end
    return curswapped ? (curmin, curhi, curli) : (curmin, curli, curhi)
end

After defining these functions and find_min_dist from Picaud's answer and its unsafe_ version. We can run a benchmark as follows:

using BenchmarkTools
using Random
Random.seed!(0x1234)

X = sort(rand(100))
Y = sort(rand(200))
julia> @btime find_min_dist($X, $Y)
  945.846 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)

julia> @btime unsafe_find_min_dist($X, $Y)
  590.511 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)

julia> @btime safe_min_dist($X, $Y)
  842.310 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)

julia> @btime unsafe_min_dist($X, $Y)
  458.737 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)

CodePudding user response:

Here is the demo :

julia> X=sort(rand(6)) # data example X
6-element Vector{Float64}:
 0.03229732486724901
 0.14661947289585864
 0.28060083090585386
 0.35640311556807047
 0.8995421870143288
 0.9063824527540892

julia> Y=sort(rand(5))  # data example Y (note: X and Y sizes can be different)
5-element Vector{Float64}:
 0.40423308422286974
 0.7126138483715454
 0.8721509236032997
 0.9193793976271042
 0.9827581490910492

julia> minimum( (abs(xi-yi) for xi in X, yi in Y) ) # a short, but O(N^2) approach
0.012996944873014948

julia> (d,i,j)=find_min_dist(X,Y) # <- the proposed solution (see code below)
(0.012996944873014948, 6, 4)

julia> abs(X[i]-Y[j]) # <- check that the 2 methods give the same result
0.012996944873014948

Here is the complete code : this is not that short, sorry. However observe that there is only one pass, the complexity is O(N)

function find_min_dist(X::AbstractVector{T}, Y::AbstractVector{T}) where {T<:AbstractFloat}
    @assert !isempty(X) && !isempty(Y)

    if Y[1]<X[1]
        (d,j,i) = find_min_dist(Y,X)
        return (d,i,j)
    end
    
    @assert issorted(X) && issorted(Y)
    
    X_n = length(X)
    Y_n = length(Y)

    min_dist = typemax(T)
    min_dist_X_i = 0
    min_dist_Y_j = 0
    
    i = 1
    j = 1

    @inbounds while i ≤ X_n
        # Find first j, such that Y[j] ≥ X[i]
        # and check dist(X[i],Y[j-1]) && dist(X[i],Y[j])
        #
        if (j ≤ Y_n) && (Y[j]<X[i])
            j =1
        else
            if j > 1, 
               min_dist_candidate = abs(X[i]-Y[j-1])
               if min_dist > min_dist_candidate 
                   min_dist = min_dist_candidate
                   min_dist_X_i = i
                   min_dist_Y_j = j-1
               end
            end

            if j ≤ Y_n
               min_dist_candidate = abs(X[i]-Y[j])
               if min_dist > min_dist_candidate 
                   min_dist = min_dist_candidate
                   min_dist_X_i = i
                   min_dist_Y_j = j
               end
            end

            i =1
        end
    end
    
    (min_dist,min_dist_X_i,min_dist_Y_j)
end 

Here are some explanations :

The idea is to loop once over X and Y and find the first j such that Y[j] ≥ X[i]. In order to do not miss the first Y[1] we permute X and Y if necessary. As soon as Y[j] ≥ X[i], we check dist(X[i],Y[j-1]) and dist(X[i],Y[j]). If a better min_dist is found, we record it with the i,j indices defining the best pair.


update: benchmark for length(X)=100 & length(Y)=200

With all pairs comparison :

julia> @benchmark minimum( (abs(xi-yi) for xi in $X, yi in $Y) )
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  36.840 μs … 197.413 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     37.811 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   41.021 μs ±   9.457 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█ ▆  ▄  ▃          ▁▁▁    ▁▁▁▁                              ▂
  ██▄█▅▅██▃█▃▃▁▁▄▃▄▄█████████████████▇▇▆▇▇▆▅▇▆▅▆▅▆▅▅▅▅▆▅▅▅▆▆▆▅ █
  36.8 μs       Histogram: log(frequency) by time      81.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

With find_min_dist :

julia> @benchmark find_min_dist($X,$Y)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
 Range (min … max):  583.718 ns …   2.140 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     598.771 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   655.339 ns ± 134.850 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▄  ▂▁▁▂▂▂▃▃▃▁▁   ▁▁ ▁                                       ▁
  ███▇█████████████▇██████▆▆▇▇▇▇▇▇▇▇▇▇▆▇▆▆▆▇▆▆▆▆▅▆▅▆▅▅▆▅▅▄▅▄▃▄▄ █
  584 ns        Histogram: log(frequency) by time       1.26 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The proposed solution is 60 times faster, and you can gain a little bit more by removing the @assert issorted.

  • Related