Home > Net >  Compare two rows and print only one if a pattern repeats between two columns in any order row
Compare two rows and print only one if a pattern repeats between two columns in any order row

Time:09-17

This should be fairly simple (hopefully) using awk, but I can't find a solution. I have a file and I want to compare each row to one another if the string combination of column 1 and column 2 repeats in any other row I want to print only the first match:

cat file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
alpha_47,alpha_3,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_14,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113


#command
This seems to be working but I have to extract the first two columns,
and I can't print the first instance of the match 

awk -F "," '{print $1 , $2}' file.csv | awk -F' ' '!seen[$2 FS $1]; {seen[$0]  }' 
alpha_3 alpha_47
beta_86 beta_12
beta_86 beta_14
beta_12 beta_14

But it doesn't print the whole line and if I try without selecting the first two columns it doesn't work.

#desired output
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113

I am learning awk (still) so if someone can provide a solution and explain their code that will be even better!

CodePudding user response:

The general solution when wanting to compare compound values regardless of order is to sort the keys used to create the array index. Given just 2 keys that reduces to just comparing them and always concatenating them in same order (e.g. biggest first) regardless of their input order:

$ awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]  ' file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
  • Related