I'm trying to build a method in Ruby (2.6.3) to return the top 100 objects with the highest count from an array with 1_000_000 elements. But I need this method to be really fast. I have benchmark two possible solutions:
arr = []
1_000_000.times { arr << Faker::Internet.public_ip_v4_address}
def each_counter(arr)
counter = Hash.new(0)
arr.each do |item|
counter[item] = 1
end
counter.sort_by{|k,v| -v } [0..99]
end
def each_with_obj_counter(arr)
results = arr.each_with_object (Hash.new(0)) { |e, h| h[e] = 1}
results.sort_by{|k,v| -v } [0..99]
end
And the results are:
user system total real
each: 1.360000 0.141000 1.501000 (1.505116)
each_with_obj: 0.859000 0.047000 0.906000 (0.910877)
Is there a better solution for this? I've been asked to upgrade the method in order to provide a quick response under 300ms. Is that possible?
CodePudding user response:
Try this.
def doit(arr, top_n)
arr.each_with_object(Hash.new(0)) { |n,h| h[n] = 1 }
.max_by(top_n, &:last)
end
For example,
a = %w| a b c d e f g h i j |
#=> ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"]
arr = 50.times.map { a.sample }
#=> ["i", "d", "i", "g", "f", "f", "e", "h", "g", "i",
# "b", "c", "b", "b", "a", "g", "j", "f", "e", "h",
# "b", "d", "h", "g", "d", "d", "g", "j", "b", "h",
# "g", "g", "g", "f", "d", "b", "h", "j", "c", "e",
# "d", "d", "c", "b", "f", "h", "g", "j", "d", "h"]
doit(arr, 5)
#=> [["g", 9], ["d", 8], ["h", 7], ["b", 7], ["f", 5]]
Note that in this example,
arr.each_with_object(Hash.new(0)) { |n,h| h[n] = 1 }
#=> {"i"=>3, "d"=>8, "g"=>9, "f"=>5, "e"=>3,
# "h"=>7, "b"=>7, "c"=>3, "a"=>1, "j"=>4}
Let's try a poor man's benchmark.
def each_with_obj_counter(arr, top_n)
results = arr.each_with_object (Hash.new(0)) { |e, h| h[e] = 1}
results.sort_by{|k,v| -v } [0..top_n-1]
end
require 'time'
top_n = 20
n = 1_000_000
arr = n.times.map { rand(1_000) }
arr.size
#=> 1000000
arr.first(10)
#=> [68, 259, 168, 79, 809, 38, 912, 398, 243, 850]
t = Time.now
each_with_obj_counter(arr, top_n)
Time.now - t
#=> 0.140723
t = Time.now
doit(arr, top_n)
Time.now - t
#=> 0.132715
Both methods returned the array
[[516, 1090], [740, 1086], [788, 1085], [392, 1085], [440, 1083],
[568, 1081], [890, 1081], [688, 1080], [306, 1078], [982, 1078],
[841, 1075], [447, 1074], [897, 1074], [630, 1072], [600, 1072],
[500, 1071], [20, 1071], [282, 1071], [410, 1070], [918, 1070]]
With repeated runs (and for larger array sizes) there was no clear winner. While the results were not quite what I was hoping for, they may be of general interest. I've noticed before that, for Ruby, sort
is blazingly fast.
CodePudding user response:
I would use Enumerable#tally
to do the calculation, hoping that this build-in Ruby method is better optimized than summing the elements with Ruby hashes. This method was added in Ruby 2.7.
And just returning the elements with the maximum counters with Enumerable#max_by
is certainly faster than completely sorting the hash first.
arr.tally.max_by(100, &:last)
CodePudding user response:
Make Sure You're Testing the Right Things
First of all, your methodology is flawed because you don't have a fixture. As a result, the contents of arr will vary each time you run the experiment, which will cause your benchmarks to vary.
Secondly, there's so much variance that a million IPv4 records might yield as few as one or two duplicates for each record. As an example, using the same array-building you used:
arr.count
#=> 1000000
arr.tally.invert.count
#=> 2
1_000_000 - arr.tally.keys.count
#=> 135
That means out of a million records, this test instance found only about 135 non-unique records, and no more than two duplicates of any IP address on this particular run. I'm not sure how you define "the top 100" in such a case.
Getting to 50 Milliseconds with Different Data
That said, given a more sensible test it isn't too hard to get to about 0.05 seconds with MRI, especially on a fast machine. For example, with Ruby 3.0.2 and IRB's built-in measure
command:
arr = []
1_000_000.times { arr << (1..1_000).to_a.sample }
measure
top_100 = arr.tally.invert.sort.reverse.lazy.take(100); nil
processing time: 0.054442s
Note: With Ruby 3.0.2, the conversion back to a Hash no longer seems to guarantee insertion order, so it's up to you whether there's any point in calling #to_h on the result. It didn't affect my time much; it simply made the results less useful.
Using #sort_by isn't much faster, although in some cases it might perform slightly better depending on the size of the Hash or the data it contains. For example:
top_100 = arr.tally.sort_by { _2 }.reverse.take(100); nil
processing time: 0.052266s
TruffleRuby managed the same jobs in about 10 milliseconds less than MRI. If milliseconds count, consider using a different Ruby engine than MRI.
Next Steps
It's not hard to get the type of processing you want under 300 milliseconds. The real problem seems to be that you have to do the equivalent of a full table scan (e.g. processing a million mostly-unique records) in a way that provides few opportunities for differentiation of the results, or optimizations for parsing the data.
If you really need that sort of number crunching for slow-to-parse data, delegate to your database, storage engine, external map-reduce process, or other tools. Even if you could figure a way to use Ractors or otherwise split the job into some sort of multi-core threaded algorithm, it's unlikely that you'll be able to process a million records in pure Ruby much faster without it being a lot more trouble than it's worth. YMMV, though.
Use External Tools to Get to Under a Millisecond
It's also worth pointing out that the features of higher-level languages like Ruby, as well as formatting and terminal I/O, are often more expensive than the data crunching itself. Just as a point of comparison, you can take the same million numbers and parse them in the shell faster than in Ruby, although Ruby does a much better job of generating large volumes of data, manipulating complex data structures, and providing convenience methods. For example:
File.open('numbers.txt', ?w) do |f|
1_000_000.times { f.puts Random.rand(1_000) }
end
#=> 1000000
~> time printf "Count: %s, Value: %s\n" \
(string split -nr " " \
(gsort -rn numbers.txt | guniq -cd | ghead -100))
Count: 962, Value: 999
Count: 1000, Value: 998
Count: 968, Value: 997
Count: 1006, Value: 996
# remainder of items trimmed for space
________________________________________________________
Executed in 1.45 millis fish external
usr time 492.00 micros 492.00 micros 0.00 micros
sys time 964.00 micros 964.00 micros 0.00 micros
Even with the relatively expensive I/O to the terminal, this took only 1.45 milliseconds total. For further comparison, storing the output in a variable for later formatting took only 20 microseconds.
~> time set top_100 (string split -nr " " \
(gsort -rn numbers.txt | guniq -cd | ghead -100))
________________________________________________________
Executed in 20.00 micros fish external
usr time 20.00 micros 20.00 micros 0.00 micros
sys time 2.00 micros 2.00 micros 0.00 micros
The point here isn't to avoid using Ruby, especially since several of the answers are already well under your 300 millisecond threshold. Rather, the point is that if you need more raw speed then you can get it, but you may need to think outside the box if you want to get down into the realm of microseconds.