which solution is the most performing and why in order to find the number of duplicates in a complex-CodePudding

I have the following arrays:

a = [1, 1, 1, 1, 3]
b = [2, 3, 2, 3, 3]
c = [1, 1, 1, 1, 3]

my goal is to calculate the amount of extra repetitions for each column. Meaning in this case that [1,2,1] appears twice, meaning 1 duplicate, and likewise for [1,3,1] so in total the amount of duplicates is 2, once for [1,2,1] and once for [1,3,1]. I have developed the following 2 solutions but I don't know to be honest which one is the most performing and why:

Solution 1:

sum = 0
zip = a.zip(b, c)
zip.group_by { |e| e}
.select { |_, value| value.size > 1 }
.each_value { |value| sum  = (value.size - 1) }
return sum

Solution 2:

zip = a.zip(b, c)
hash = Hash.new(0)
zip.each { |e| hash.store(e, hash[e] 1) }
hash.each{|e, _| hash[e] -= 1} 
return hash.sum {|e, _| hash[e] }

thanks in advance

CodePudding user response：

Illustrating Bench-marking :

require 'benchmark'
    
v1 = [1, 1, 1, 1]
v2 = [2, 3, 2, 3]
v3 = [1, 1, 1, 1 ]
    
def sol_1(a,b,c)
  sum = 0
  zip = a.zip(b, c)
  zip.group_by { |e| e}
  .select { |_, value| value.size > 1 }
  .each_value { |value| sum  = (value.size - 1) }
  return sum
end

    
def sol_2(a,b,c)
  zip = a.zip(b, c)
  hash = Hash.new(0)
  zip.each { |e| hash.store(e, hash[e] 1) }
  hash.each{|e, _| hash[e] -= 1} 
  return hash.sum {|e, _| hash[e] }
end
    
n=1_000
Benchmark.bmbm do |x|
  x.report("sol_1"){n.times{sol_1(v1, v2, v3)} }
  x.report("sol_2"){n.times{sol_2(v1, v2, v3)} }
end

Results in:

Rehearsal -----------------------------------------
sol_1   0.011076   0.000000   0.011076 (  0.011091)
sol_2   0.012276   0.000000   0.012276 (  0.012355)
-------------------------------- total: 0.023352sec

            user     system      total        real
sol_1   0.007206   0.000000   0.007206 (  0.007212)
sol_2   0.011452   0.000000   0.011452 (  0.011453)

CodePudding user response：

So, just by reading it both solutions are very similar in approach. While I am not 100% sure what you mean by most performing, but I'll guess that you mean computational complexity of both solutions - so computational cost for large inputs. When there is a lot of columns, the only element of the solution that takes the time is iterating over the array of columns - everything else will take very little time in comparison.

So in first solution, you're iterating 3 times - once to group the columns, second to select ones with duplicates and then 3rd time to count the repetitions (however here, in the worse case scenario, the array you iterate over has at most N/2 elements). So, in total you have 2.5 iterations over array of columns.

In second solution, you're also iterating 3 times. Firstly, over the array of columns to count how many times they appear, then over the result (which in a worst case scenario has same amount of elements) to subtract one from each number and finally to sum the numbers - this gives roughly 3 iterations.

So, first solution might be just slightly more performant - however when dealing with complexity we look at the type of function ignoring the number in front of it - in this case both solutions are linear. Additionally, different methods are optimized in different way in ruby. So the only hope of determining which one is more performant would go with benchmarks - repeating those algorithms 100 times for (the same) 10000 columns takes 10.5s for first solution and 18s for the second solution.

CodePudding user response：

Here is a slightly (20%) faster solution to @steenslag's benchmark:

require 'matrix'
def sol_3(matrix)
  Matrix.
    columns(matrix).
    to_a.
    each_with_object({}) { |e, a|
      digest = e.hash
      a[digest] = a[digest].nil? ? 1 : a[digest]   1
    }.sum { |_, v| v > 1 ? 1 : 0 }
end

            user     system      total        real
sol_1   0.006908   0.000008   0.006916 (  0.006917)
sol_2   0.011866   0.000018   0.011884 (  0.011902)
sol_3   0.005532   0.000008   0.005540 (  0.005555)

Complete script: https://gist.github.com/jaredbeck/edc708df10fcc0267db80bf1c31c8298