Home > Enterprise >  Rails Job to Merge Records With Similar Phone Numbers
Rails Job to Merge Records With Similar Phone Numbers

Time:12-01

I have a Postgres database with tens of thousands of User records, each having a phone number. However, there are many duplicate Users with the same phone number, but one record might be missing the country code. For example:

{name: "User1", phone: "1-123-456-7890"} {name: "User1", phone: "123-456-7890"}

Both these users are the same person, but one record has the country code, the other does not. I would want to merge these records, keeping the phone number with the country code. My goal is to create a Rails Job to find all similar matches based off the phone number. However, user's phone numbers vary in length depending on the country code (USA/Canada = 1, Argentina = 54). Since all phone numbers have a minimum length of 10 digits, I intended on querying by the LAST 10 digits of each phone number. If there are any matches, I would then merge these records.

I'm in the process of building the query, but am having some difficulty getting all matches. Here is the Active Job I have so far:

class MergeRecordsJob < ApplicationJob
  queue_as :default

  def perform(user)
    matches = User.where("phone LIKE ?", user[user.length * -1..-1])
    p "There are #{matches.count}"
  end
end

Am I headed in the right direction with this? I can call this Job in my terminal with a sample size of about 1,000 records that I know contains duplicates. Once these duplicates are found, how do you suggest I go about merging them?

Any help or suggestions would be greatly appreciated!

CodePudding user response:

To find your duplicates you should be able to do this:

dup_accts = User.group('RIGHT(phone, 10)').having("count(RIGHT(phone, 10)) > 1").count

assuming "phone" is the name of the column where your phone numbers are stored. This matches on the right 10 digits so it will ignore country codes. You will get a hash with any duplicate numbers and the count of how many duplicates there are. i.e. 2035551212 => 3. You then can loop through each duplicate number to find the actual users that are duplicated.

User.where('RIGHT(phone, 10) = ?', 2035551212)

This returns an array of the duplicate users records...

=> [#<User:0x00007f7bd4ab5c50
    id: 13625,
    phone: '12035551212'
    ...>,

   #<User:0x00007f7bd4ab5ac0
    id: 13645,
    phone: '2035551212'
    ...>]

Now you just have to figure out how you want to deal with them. I would recommend a job so it doesn't kill your server resources. I'd also process one at a time rather than trying to load them all into memory. Depending on how big the dup_accts comes back, you can either grab that, then iterate over them by phone number. Or if it's huge you can grab them in slices like...

 dup_accts.each_slice(1000) do |dup_chunk|
   dup_chunk.each do |k,v|   #you only really need the key but it's a hash
     User.where('RIGHT(phone, 10) = ?', k)
       #returns an array of user records as noted above.
       #this is where you do your duplicate removal procedure, whatever
       #you determine that to be. 
   end
 end

If the phone number is the determining factor on who a user is (i.e. not email address or username) then you could try to notify them via phone or email that they have duplicate accounts. But if you had two distinct users who put in the same number in two formats, accidentally or on purpose, you have a bigger issue. If you have no problem deleting the duplicates, then you just need to come up with a procedure for that to attach any child records to the record you are keeping. But this answer at least gets you the list of duplicate phone numbers and the associated account.

  • Related