This should be fairly simple (hopefully) using awk, but I can't find a solution. I have a file and I want to compare each row to one another if the string combination of column 1 and column 2 repeats in any other row I want to print only the first match:
cat file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
alpha_47,alpha_3,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_14,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
#command
This seems to be working but I have to extract the first two columns,
and I can't print the first instance of the match
awk -F "," '{print $1 , $2}' file.csv | awk -F' ' '!seen[$2 FS $1]; {seen[$0] }'
alpha_3 alpha_47
beta_86 beta_12
beta_86 beta_14
beta_12 beta_14
But it doesn't print the whole line and if I try without selecting the first two columns it doesn't work.
#desired output
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
I am learning awk (still) so if someone can provide a solution and explain their code that will be even better!
CodePudding user response:
The general solution when wanting to compare compound values regardless of order is to sort the keys used to create the array index. Given just 2 keys that reduces to just comparing them and always concatenating them in same order (e.g. biggest first) regardless of their input order:
$ awk -F, '!seen[$1>$2 ? $1 FS $2 : $2 FS $1] ' file.csv
alpha_3,alpha_47,100,60,0,0,1,60,1,60,8.21E-29,111
beta_86,beta_12,100,61,0,0,1,61,1,61,2.33E-29,113
beta_86,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113
beta_12,beta_14,100,61,0,0,1,61,1,61,2.33E-29,113