Sorting hash fast runtime in Ruby-CodePudding

I'm trying to build a method in Ruby (2.6.3) to return the top 100 objects with the highest count from an array with 1_000_000 elements. But I need this method to be really fast. I have benchmark two possible solutions:

arr = [] 
1_000_000.times { arr << Faker::Internet.public_ip_v4_address}
 
def each_counter(arr)
    counter = Hash.new(0)
    arr.each do |item|
      counter[item]  = 1
    end
    counter.sort_by{|k,v| -v } [0..99] 
end

def each_with_obj_counter(arr)
    results = arr.each_with_object (Hash.new(0)) { |e, h| h[e]  = 1}
    results.sort_by{|k,v| -v } [0..99] 
end

And the results are:

       user     system      total        real
each:  1.360000   0.141000   1.501000 (1.505116)
each_with_obj:  0.859000   0.047000   0.906000 (0.910877)

Is there a better solution for this? I've been asked to upgrade the method in order to provide a quick response under 300ms. Is that possible?

CodePudding user response：

Try this.

def doit(arr, top_n)
  arr.each_with_object(Hash.new(0)) { |n,h| h[n]  = 1 }
     .max_by(top_n, &:last)
end

For example,

a = %w| a b c d e f g h i j |
  #=> ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
arr = 50.times.map { a.sample }
  #=> ["i", "d", "i", "g", "f", "f", "e", "h", "g", "i",
  #    "b", "c", "b", "b", "a", "g", "j", "f", "e", "h",
  #    "b", "d", "h", "g", "d", "d", "g", "j", "b", "h",
  #    "g", "g", "g", "f", "d", "b", "h", "j", "c", "e",
  #    "d", "d", "c", "b", "f", "h", "g", "j", "d", "h"]

doit(arr, 5)
  #=> [["g", 9], ["d", 8], ["h", 7], ["b", 7], ["f", 5]]

Note that in this example,

arr.each_with_object(Hash.new(0)) { |n,h| h[n]  = 1 }
  #=> {"i"=>3, "d"=>8, "g"=>9, "f"=>5, "e"=>3,
  #    "h"=>7, "b"=>7, "c"=>3, "a"=>1, "j"=>4}

Let's try a poor man's benchmark.

def each_with_obj_counter(arr, top_n)
  results = arr.each_with_object (Hash.new(0)) { |e, h| h[e]  = 1}
  results.sort_by{|k,v| -v } [0..top_n-1] 
end

require 'time'

top_n = 20

n = 1_000_000
arr = n.times.map { rand(1_000) }
arr.size
  #=> 1000000
arr.first(10)
  #=> [68, 259, 168, 79, 809, 38, 912, 398, 243, 850]

t = Time.now
each_with_obj_counter(arr, top_n)
Time.now - t
  #=> 0.140723

t = Time.now
doit(arr, top_n)
Time.now - t
  #=> 0.132715

Both methods returned the array

[[516, 1090], [740, 1086], [788, 1085], [392, 1085], [440, 1083],
 [568, 1081], [890, 1081], [688, 1080], [306, 1078], [982, 1078],
 [841, 1075], [447, 1074], [897, 1074], [630, 1072], [600, 1072],
 [500, 1071], [20, 1071], [282, 1071], [410, 1070], [918, 1070]]

With repeated runs (and for larger array sizes) there was no clear winner. While the results were not quite what I was hoping for, they may be of general interest. I've noticed before that, for Ruby, sort is blazingly fast.

CodePudding user response：

I would use Enumerable#tally to do the calculation, hoping that this build-in Ruby method is better optimized than summing the elements with Ruby hashes. This method was added in Ruby 2.7.

And just returning the elements with the maximum counters with Enumerable#max_by is certainly faster than completely sorting the hash first.

arr.tally.max_by(100, &:last)

CodePudding user response：

Make Sure You're Testing the Right Things

First of all, your methodology is flawed because you don't have a fixture. As a result, the contents of arr will vary each time you run the experiment, which will cause your benchmarks to vary.

Secondly, there's so much variance that a million IPv4 records might yield as few as one or two duplicates for each record. As an example, using the same array-building you used:

arr.count
#=> 1000000

arr.tally.invert.count
#=> 2

1_000_000 - arr.tally.keys.count
#=> 135

That means out of a million records, this test instance found only about 135 non-unique records, and no more than two duplicates of any IP address on this particular run. I'm not sure how you define "the top 100" in such a case.

Getting to 50 Milliseconds with Different Data

That said, given a more sensible test it isn't too hard to get to about 0.05 seconds with MRI, especially on a fast machine. For example, with Ruby 3.0.2 and IRB's built-in measure command:

arr = []
1_000_000.times { arr << (1..1_000).to_a.sample }

measure
top_100 = arr.tally.invert.sort.reverse.lazy.take(100); nil
processing time: 0.054442s

Note: With Ruby 3.0.2, the conversion back to a Hash no longer seems to guarantee insertion order, so it's up to you whether there's any point in calling #to_h on the result. It didn't affect my time much; it simply made the results less useful.

Using #sort_by isn't much faster, although in some cases it might perform slightly better depending on the size of the Hash or the data it contains. For example:

top_100 = arr.tally.sort_by { _2 }.reverse.take(100); nil
processing time: 0.052266s

TruffleRuby managed the same jobs in about 10 milliseconds less than MRI. If milliseconds count, consider using a different Ruby engine than MRI.

Next Steps

It's not hard to get the type of processing you want under 300 milliseconds. The real problem seems to be that you have to do the equivalent of a full table scan (e.g. processing a million mostly-unique records) in a way that provides few opportunities for differentiation of the results, or optimizations for parsing the data.

If you really need that sort of number crunching for slow-to-parse data, delegate to your database, storage engine, external map-reduce process, or other tools. Even if you could figure a way to use Ractors or otherwise split the job into some sort of multi-core threaded algorithm, it's unlikely that you'll be able to process a million records in pure Ruby much faster without it being a lot more trouble than it's worth. YMMV, though.

Use External Tools to Get to Under a Millisecond

It's also worth pointing out that the features of higher-level languages like Ruby, as well as formatting and terminal I/O, are often more expensive than the data crunching itself. Just as a point of comparison, you can take the same million numbers and parse them in the shell faster than in Ruby, although Ruby does a much better job of generating large volumes of data, manipulating complex data structures, and providing convenience methods. For example:

File.open('numbers.txt', ?w) do |f|
  1_000_000.times { f.puts Random.rand(1_000) }
end
#=> 1000000

~> time printf "Count: %s, Value: %s\n" \
    (string split -nr " " \
        (gsort -rn numbers.txt | guniq -cd | ghead -100))
Count: 962, Value: 999
Count: 1000, Value: 998
Count: 968, Value: 997
Count: 1006, Value: 996
# remainder of items trimmed for space

________________________________________________________
Executed in    1.45 millis    fish           external
   usr time  492.00 micros  492.00 micros    0.00 micros
   sys time  964.00 micros  964.00 micros    0.00 micros

Even with the relatively expensive I/O to the terminal, this took only 1.45 milliseconds total. For further comparison, storing the output in a variable for later formatting took only 20 microseconds.

~> time set top_100 (string split -nr " " \
    (gsort -rn numbers.txt | guniq -cd | ghead -100))

________________________________________________________
Executed in   20.00 micros    fish           external
   usr time   20.00 micros   20.00 micros    0.00 micros
   sys time    2.00 micros    2.00 micros    0.00 micros

The point here isn't to avoid using Ruby, especially since several of the answers are already well under your 300 millisecond threshold. Rather, the point is that if you need more raw speed then you can get it, but you may need to think outside the box if you want to get down into the realm of microseconds.