Let's say I've got two sorted vectors of floating point numbers - let's call them A and B. Is there a clever way of selecting an element of A and an element of B such that those elements are nearest in value of all the pairs? Right now I'm brute forcing it with two loops, but it seems like there ought to be a better way.
CodePudding user response:
This is another version, which has shorter/cleaner code and is also (on my benchmarks) slightly quicker:
function safe_min_dist(L1, L2)
@assert !isempty(L1) && !isempty(L2)
@assert issorted(L1) && issorted(L2)
return unsafe_min_dist(L1, L2)
end
function unsafe_min_dist(L1, L2)
swapped = L1[1] > L2[1]
LL, HL = swapped ? (L2, L1) : (L1, L2)
LLlen, HLlen = length(LL), length(HL)
li, hi = 1, 1
curmin = HL[1]-LL[1]
curli, curhi, curswapped = li, hi, swapped
@inbounds while true
li = 1
if li > LLlen || LL[li] > HL[hi]
if HL[hi]-LL[li-1] < curmin
curmin = HL[hi] - LL[li-1]
curli, curhi, curswapped = li-1, hi, swapped
curmin == 0 && break
end
li > LLlen && break
LL, HL, LLlen, HLlen = HL, LL, HLlen, LLlen
li, hi, swapped = hi, li, !swapped
end
end
return curswapped ? (curmin, curhi, curli) : (curmin, curli, curhi)
end
After defining these functions and find_min_dist
from Picaud's answer and its unsafe_
version. We can run a benchmark as follows:
using BenchmarkTools
using Random
Random.seed!(0x1234)
X = sort(rand(100))
Y = sort(rand(200))
julia> @btime find_min_dist($X, $Y)
945.846 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)
julia> @btime unsafe_find_min_dist($X, $Y)
590.511 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)
julia> @btime safe_min_dist($X, $Y)
842.310 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)
julia> @btime unsafe_min_dist($X, $Y)
458.737 ns (0 allocations: 0 bytes)
(1.0869188864059964e-5, 37, 78)
CodePudding user response:
Here is the demo :
julia> X=sort(rand(6)) # data example X
6-element Vector{Float64}:
0.03229732486724901
0.14661947289585864
0.28060083090585386
0.35640311556807047
0.8995421870143288
0.9063824527540892
julia> Y=sort(rand(5)) # data example Y (note: X and Y sizes can be different)
5-element Vector{Float64}:
0.40423308422286974
0.7126138483715454
0.8721509236032997
0.9193793976271042
0.9827581490910492
julia> minimum( (abs(xi-yi) for xi in X, yi in Y) ) # a short, but O(N^2) approach
0.012996944873014948
julia> (d,i,j)=find_min_dist(X,Y) # <- the proposed solution (see code below)
(0.012996944873014948, 6, 4)
julia> abs(X[i]-Y[j]) # <- check that the 2 methods give the same result
0.012996944873014948
Here is the complete code : this is not that short, sorry. However observe that there is only one pass, the complexity is O(N)
function find_min_dist(X::AbstractVector{T}, Y::AbstractVector{T}) where {T<:AbstractFloat}
@assert !isempty(X) && !isempty(Y)
if Y[1]<X[1]
(d,j,i) = find_min_dist(Y,X)
return (d,i,j)
end
@assert issorted(X) && issorted(Y)
X_n = length(X)
Y_n = length(Y)
min_dist = typemax(T)
min_dist_X_i = 0
min_dist_Y_j = 0
i = 1
j = 1
@inbounds while i ≤ X_n
# Find first j, such that Y[j] ≥ X[i]
# and check dist(X[i],Y[j-1]) && dist(X[i],Y[j])
#
if (j ≤ Y_n) && (Y[j]<X[i])
j =1
else
if j > 1,
min_dist_candidate = abs(X[i]-Y[j-1])
if min_dist > min_dist_candidate
min_dist = min_dist_candidate
min_dist_X_i = i
min_dist_Y_j = j-1
end
end
if j ≤ Y_n
min_dist_candidate = abs(X[i]-Y[j])
if min_dist > min_dist_candidate
min_dist = min_dist_candidate
min_dist_X_i = i
min_dist_Y_j = j
end
end
i =1
end
end
(min_dist,min_dist_X_i,min_dist_Y_j)
end
Here are some explanations :
The idea is to loop once over X
and Y
and find the first j
such that Y[j] ≥ X[i]
. In order to do not miss the first Y[1]
we permute X
and Y
if necessary. As soon as Y[j] ≥ X[i]
, we check dist(X[i],Y[j-1])
and dist(X[i],Y[j])
. If a better min_dist
is found, we record it with the i,j
indices defining the best pair.
update: benchmark for length(X)=100
& length(Y)=200
With all pairs comparison :
julia> @benchmark minimum( (abs(xi-yi) for xi in $X, yi in $Y) )
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 36.840 μs … 197.413 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 37.811 μs ┊ GC (median): 0.00%
Time (mean ± σ): 41.021 μs ± 9.457 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█ ▆ ▄ ▃ ▁▁▁ ▁▁▁▁ ▂
██▄█▅▅██▃█▃▃▁▁▄▃▄▄█████████████████▇▇▆▇▇▆▅▇▆▅▆▅▆▅▅▅▅▆▅▅▅▆▆▆▅ █
36.8 μs Histogram: log(frequency) by time 81.7 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
With find_min_dist
:
julia> @benchmark find_min_dist($X,$Y)
BenchmarkTools.Trial: 10000 samples with 181 evaluations.
Range (min … max): 583.718 ns … 2.140 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 598.771 ns ┊ GC (median): 0.00%
Time (mean ± σ): 655.339 ns ± 134.850 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▄ ▂▁▁▂▂▂▃▃▃▁▁ ▁▁ ▁ ▁
███▇█████████████▇██████▆▆▇▇▇▇▇▇▇▇▇▇▆▇▆▆▆▇▆▆▆▆▅▆▅▆▅▅▆▅▅▄▅▄▃▄▄ █
584 ns Histogram: log(frequency) by time 1.26 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
The proposed solution is 60 times faster, and you can gain a little bit more by removing the @assert issorted
.