Home > Software design >  How to count matches in two arrays?
How to count matches in two arrays?

Time:08-24

If I have two arrays, how can I count the number of matching elements?

E.g. with

x = [1,2,3,4,5]
y = [3,4,5,6]

I'd like to get the count (3) of the three matching elements 3,4,and 5.

CodePudding user response:

You can use intersect:

julia> x = [1, 2, 3, 4, 5]
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> y = [3, 4, 5, 6]
4-element Vector{Int64}:
 3
 4
 5
 6

julia> intersect(Set(x), Set(y))
Set{Int64} with 3 elements:
  5
  4
  3

julia> length(intersect(Set(x), Set(y)))
3

CodePudding user response:

The following algorithm can be near 4X faster than Set intersection. The idea is to sort the arrays first, that has O(n log n) complexity for each array. Then merge-compare the sorted versions for equal elements, that has O(m n) linear complexity. So, the overall algorithm complexity can be O(n log n).

This algorithm counts duplicate elements into the final matches result, but can be modified with a small overhead to behave similarly to sets. The modification can include adding a variable to keep track of the last matched elements and increment the number of matches only for new different matched pairs.

function count_matches(x,y)
    sort!(x) # or x = sort(x) 
    sort!(y) # or y = sort(y)
    i = j = 1
    matches = 0
    while i <= length(x) && j <= length(y)
        if x[i] == y[j]
            i  = 1 
            j  = 1
            matches  = 1
        elseif x[i] < y[j]
            i  = 1
        else 
            j  = 1
        end
    end 
    matches
end

Comparing with:

function count_matches0(x,y)
    length(intersect(Set(x), Set(y)))
end

and timing with n = 10000 arrays, we get:

@btime count_matches(x, y)  setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
@btime count_matches0(x, y) setup=(x = rand(1:1000,10000); y = rand(1:1000,10000)) evals=1
  246.700 μs (31 allocations: 338.31 KiB)
  63.200 μs (2 allocations: 15.88 KiB)

CodePudding user response:

A lot depends on the sizes of the arrays. If the arrays are just a few dozen integers in length, a simple O(N^2) count wins over the count_matches sorting method and the intersect count_matches0 methods above, because of zero allocation setup time:

function count_matches2(x, y)
    count(n -> any(==(n), x), y)
end

@btime count_matches(x, y)  setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
@btime count_matches0(x, y) setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1
@btime count_matches2(x, y)  setup=(x = rand(1:100,50); y = rand(1:100,50)) evals=1

2.400 μs (0 allocations: 0 bytes)
3.700 μs (10 allocations: 3.59 KiB)
1.500 μs (0 allocations: 0 bytes)

The simplicity advantage vanishes with arrays of size > 1000.

  • Related