Best practice for writing complex, three-part, interchangeable "uniq" ruby block-CodePudding

I have an array of hashes:

array = [
{foo: 1, bar1: 2 bar2: 3, bar3: 4},
{foo: 2, bar1: 3 bar2: 4, bar3: 5},
{foo: 3, bar1: 4 bar2: 5, bar4: 6},
etc
]

I want to eliminate some redundant results from this array. Specifically, I want to eliminate any results where foo, bar1, and bar2 are identical across multiple objects, which can easily be done like so:

array.uniq! { |object| [object.foo, object.bar1, object.bar2] }

However, there is an additional edge case where I must also eliminate one of the following objects, which I don't know how to solve:

{foo: 1, bar1: 3 bar2: 2,...}
{foo: 1, bar1: 2 bar2: 3,...}

Specifically, bar1 and bar2 may be switched in some of the data, and I want to only have unique results where those two are collectively the same pair. (2, 3 should be considered redundant as 3, 2).

CodePudding user response：

After fully writing up this question I realized I had an answer, but I'm not sure how ideal it is. I simply combined the two interchangeable variables into a single array and then sorted them, which guarantees that they will always be identical even if they two values are switched:

array.uniq! { |object| [ object.foo, [object.bar1, object.bar2].sort ] }

I'd love to know if anyone has better solutions.

Also, unsurprisingly, inserting a uniq! method into a large sorting action is causing some performance issues, so I'm exploring ways to further optimize it by adding additional filters etc. This is all for a cache for an API endpoint.

CodePudding user response：

Since you have special equality rules, it seems like the most performant solution would be to override the Object#hash and Object#eql? functions as these are what is used by Array#uniq. If you have millions of records this may well be necessary for adequate performance.

require 'pp'
class MyHash < Hash
  def hash
    # Note that the XOR operator is commutative, so the three values
    # can be in any order and still output the same hash.
    self[:foo].hash ^ self[:bar1].hash ^ self[:bar2].hash
  end

  def eql?(other)
    # I think this is a bit ugly, and welcome suggestions for better 
    # performance and readability.
    self[:foo] == other[:foo] && (
      self[:bar1] == other[:bar1] && self[:bar2] == other[:bar2]
    ) || (
      self[:bar1] == other[:bar2] && self[:bar2] == other[:bar1]
    )
  end
end

a = MyHash[foo: 10, bar1: 2, bar2: 3, ignored: 'a']
b = MyHash[foo: 10, bar1: 3, bar2: 2, ignored: 'b']
c = MyHash[foo: 20, bar1: 2, bar2: 3, ignored: 'c']
d = MyHash[foo: 20, bar1: 3, bar2: 2, ignored: 'd']
e = MyHash[foo: 2, bar1: 20, bar2: 3, ignored: 'e']
f = MyHash[foo: 3, bar1: 2, bar2: 20, ignored: 'f']


puts a.hash #=> 3556565295874809176
puts b.hash #=> 3556565295874809176
puts c.hash #=> 2914353897173641784
puts d.hash #=> 2914353897173641784
puts e.hash #=> 2914353897173641784
puts f.hash #=> 2914353897173641784

array = [a, b, c, d, e, f]

pp array      #=> [{:foo=>10, :bar1=>2, :bar2=>3, :ignored=>"a"},
              #    {:foo=>10, :bar1=>3, :bar2=>2, :ignored=>"b"},
              #    {:foo=>20, :bar1=>2, :bar2=>3, :ignored=>"c"},
              #    {:foo=>20, :bar1=>3, :bar2=>2, :ignored=>"d"},
              #    {:foo=>2, :bar1=>20, :bar2=>3, :ignored=>"e"},
              #    {:foo=>3, :bar1=>2, :bar2=>20, :ignored=>"f"}]

pp array.uniq #=> [{:foo=>10, :bar1=>2, :bar2=>3, :ignored=>"a"},
              #    {:foo=>20, :bar1=>2, :bar2=>3, :ignored=>"c"},
              #    {:foo=>2, :bar1=>20, :bar2=>3, :ignored=>"e"},
              #    {:foo=>3, :bar1=>2, :bar2=>20, :ignored=>"f"}]

If you just have thousands of records then the solution you proposed should be completely fine.

array.uniq! { |object| [ object[:foo], [object[:bar1], object[:bar2]].sort ] }