Home > front end >  Terminal AWK Script to Remove Duplicates in .CSV File. Edit Script
Terminal AWK Script to Remove Duplicates in .CSV File. Edit Script

Time:09-06

For years I have spent an hour or more removing duplicates from .csv/xls files. I found a great AWK script on this site that I can use to find duplicate email addresses in .csv files and remove the row of any duplicates (Except the first one) and it creates a new .csv file with the results. You can do it in Excel by using filter/advanced to show only unique records but showing only unique records in a file with 200k or 300k records means I lose thousands of records because along with deleting the duplicate record it deletes the first original record too. So this AWK script is fantastic it keeps the original record and changes an hour of work down to 2 or 3 minutes.

I am very happy that I found the script on this site. there is one thing that would make the script totally awesome. When it finds duplicate records instead of deleting all the duplicates and keeping only the first record it would be amazing if it could keep the first 2 or 3 records and remove the rest. SO basically allowing the original and one duplicate but deleting the entire row of any more than one or two duplicates.

Here is the script

awk -F, '!x[$7]  ' 'business-records.csv' > 'business-records-deduped.csv'

So the script above checks column #7 which is column G which has email records in my database, it checks for duplicates and keeps the first record and deletes the entire row of any duplicates.

If anyone could help edit the above script so when it finds duplicate records in column #7 it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate.

I have found great stuff on this website, thank you.. this is my first time posting though. Thanks Stackoverflow team.

CodePudding user response:

so when it finds duplicate records in column #7 it keeps the original record and the first duplicate and deletes the entire rows of any more than the first duplicate.

You can use awk like this:

awk -F, '  x[$7] <= 2' business-records.csv > business-records-deduped.csv

This will keep 2 duplicate records for 7th column and will delete any more dupes as you desire.

CodePudding user response:

I propose following minimal ameloration of your code

awk -F, '2>x[$7]  ' 'business-records.csv' > 'business-records-deduped.csv'

Explanation: is post-increment operation so execution order might be somewhat counter-intuitive

  • x[$7] gets value from array x for key being content of 7th field, if not present assume 0
  • 2> is test deciding about printing, if this condition does hold line is printed
  • does increase value inside array x, therefore next time you encounter same 7th field content value will be bigger by 1

Observe that sole thing altered is test, regarding non-negative integers ! is true for zero and false for values above 0.

  • Related