Home > Software engineering >  Ruby 2.7: How to merge a hash of arrays of hashes and eliminate the duplicates based on one key:valu
Ruby 2.7: How to merge a hash of arrays of hashes and eliminate the duplicates based on one key:valu

Time:05-28

I'm trying to complete a project-based assessment for a job interview, and they only offer it in Ruby on Rails, which I know little to nothing about. I'm trying to take one hash that contains two or more hashes of arrays and combine the arrays into one array of hashes, while eliminating duplicate hashes based on an "id":value pair.

So I'm trying to take this:

h = {
  'first' =>
      [
        { 'authorId' => 12, 'id' => 2, 'likes' => 469 },
        { 'authorId' => 5, 'id' => 8, 'likes' => 735 },
        { 'authorId' => 8, 'id' => 10, 'likes' => 853 }
      ],
  'second' =>
      [
        { 'authorId' => 9, 'id' => 1, 'likes' => 960 },
        { 'authorId' => 12, 'id' => 2, 'likes' => 469 },
        { 'authorId' => 8, 'id' => 4, 'likes' => 728 }
      ]
}

And turn it into this:

[
  { 'authorId' => 12, 'id' => 2, 'likes' => 469 },
  { 'authorId' => 5, 'id' => 8, 'likes' => 735 },
  { 'authorId' => 8, 'id' => 10, 'likes' => 853 },
  { 'authorId' => 9, 'id' => 1, 'likes' => 960 },
  { 'authorId' => 8, 'id' => 4, 'likes' => 728 }

]

CodePudding user response:

Ruby has many ways to achieve this.

My first instinct is to group them by id it and pick only first item from the array.

h.values.flatten.group_by{|x| x["id"]}.map{|k,v| v[0]}

Much cleaner approach is to pick the distinct item based on id after flattening the array of hash which is what Cary Swoveland suggested in the comments

h.values.flatten.uniq { |h| h['id'] }

CodePudding user response:

TL;DR

The simplest solution to the problem that fits the data you posted is h.values.flatten.uniq. You can stop reading here unless you want to understand why you don't need to care about duplicate IDs with this particular data set, or when you might need to care and why that's often less straightforward than it seems.

Near the end I also mention some features of Rails that address edge cases that you don't need for this specific data. However, they might help with other use cases.

Skip ID-Specific Deduplication; Focus on Removing Duplicate Hashes Instead

First of all, you have no duplicate id keys that aren't also part of duplicate Hash objects. Despite the fact that Ruby implementations preserve entry order of Hash objects, a Hash is conceptually unordered. Pragmatically, that means two Hash objects with the same keys and values (even if they are in a different insertion order) are still considered equal. So, perhaps unintuitively:

{'authorId' => 12, 'id' => 2, 'likes' => 469} ==
  {'id' => 2, 'likes' => 469, 'authorId' => 12}
#=> true

Given your example input, you don't actually have to worry about unique IDs for this exercise. You just need to eliminate duplicate Hash objects from your merged Array, and you have only one of those.

duplicate_ids =
  h.values.flatten.group_by { _1['id'] }
    .reject { _2.one? }.keys
#=> [2]

unique_hashes_with_duplicate_ids =
  h.values.flatten.group_by { _1['id'] }
    .reject { _2.uniq.one? }.count
#=> 0

As you can see, 'id' => 2 is the only ID found in both Hash values, albeit in identical Hash objects. Since you have only one duplicate Hash, the problem has been reduced to flattening the Array of Hash values stored in h so that you can remove any duplicate Hash elements (not duplicate IDs) from the combined Array.

Solution to the Posted Problem

There might be uses cases where you need to handle the uniqueness of Hash keys, but this is not one of them. Unless you want to sort your result by some key, all you really need is:

h.values.flatten.uniq

Since you aren't being asked to sort the Hash objects in your consolidated Array, you can avoid the need for another method call that (in this case, anyway) is a no-op.

"Uniqueness" Can Be Tricky Absent Additional Context

The only reason to look at your id keys at all would be if you had duplicate IDs in multiple unique Hash objects, and if that were the case you'd then have to worry about which Hash was the correct one to keep. For example, given:

[ {'id' => 1, 'authorId' => 9, 'likes' => 1_920},
  {'id' => 1, 'authorId' => 9, 'likes' => 960} ]

which one of these records is the "duplicate" one? Without other data such as a timestamp, simply chaining uniq { h['id' } or merging the Hash objects will either net you the first or last record respectively. Consider:

[
  {'id' => 1, 'authorId' => 9, 'likes' => 1_920},
  {'id' => 1, 'authorId' => 9, 'likes' => 960}
].uniq { _1['id'] }
#=> [{"id"=>1, "authorId"=>9, "likes"=>1920}]

[
  {'id' => 1, 'authorId' => 9, 'likes' => 1_920},
  {'id' => 1, 'authorId' => 9, 'likes' => 960}
].reduce({}, :merge)
#=> {"id"=>1, "authorId"=>9, "likes"=>960}

Leveraging Context Like Rails-Specific Timestamp Features

While the uniqueness problem described above may seem out of scope for the question you're currently being asked, understanding the limitations of any kind of data transformation is useful. In addition, knowing that Ruby on Rails supports ActiveRecord::Timestamp and the creation and management of timestamp-related columns within database migrations may be highly relevant in a broader sense.

You don't need to know these things to answer the original question. However, knowing when a given solution fits a specific use case and when it doesn't is important too.

  • Related