If I have two arrays, how can I count the number of matching elements?
E.g. with
x = [1,2,3,4,5]
y = [3,4,5,6]
I'd like to get the count (3
) of the three matching elements 3
,4
,and 5
.
CodePudding user response:
You can use intersect
:
julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
1
2
3
4
5
julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
3
4
5
6
julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
5
4
3
julia> length(intersect(Set(x), Set(y)))
3
CodePudding user response:
The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n)
complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m n)
linear complexity. So, the overall algorithm complexity can be O(n log n)
.
This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.
function count_matches(x,y)
sort!(x) # or x = sort(x)
sort!(y) # or y = sort(y)
i = j = 1
matches = 0
while i <= length(x) && j <= length(y)
if x[i] == y[j]
i = 1
j = 1
matches = 1
elseif x[i] < y[j]
i = 1
else
j = 1
end
end
matches
end
Comparing with:
function count_matches0(x,y)
length(intersect(Set(x), Set(y)))
end
and timing with n = 10000
arrays, we get:
@btime count_matches(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
@btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
246.700 μs (31 allocations: 338.31 KiB)
63.200 μs (2 allocations: 15.88 KiB)
CodePudding user response:
A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:
function count_matches2(x, y)
count(n -> any(==(n), x), y)
end
@btime count_matches(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
@btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
@btime count_matches2(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)
The simplicity advantage vanishes with arrays of size > 1000.