removing duplicated strings within a column with shell-CodePudding

I have a file with two columns separated by tabs as follows:

OG0000000   PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

I tried to start this by using awk.

awk 'BEGIN{RS=ORS=","} !seen[$0]  ' file.txt

But my output looks like this, where there are still some duplicates if the duplicated string occurs first.

OG0000000   PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF07690,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!

CodePudding user response：

This awk should work for you:

awk -F '[\t,]' '
{
   printf "%s", $1 "\t"
   for (i=2; i<=NF;   i) {
      if (!seen[$i]  )
         printf "%s,", $i
   }
   print ""
   delete seen
}' file

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

CodePudding user response：

Using GNU sed

$ sed -E ':a;s/([^ \t]*[ \t] )?(([[:alnum:]] ,).*)\3/\1\2/;ta' input_file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

CodePudding user response：

Another approach using the same spit of $2 into an array and keeping a separate counter for the position of the non-duplicated values posted could be done as:

awk '
  { 
    printf "%s\t", $1
    delete seen
    n = split($2,arr,",")
    pos = 0
    for (i=1;i<=n;i  ) { 
      if (! (arr[i] in seen)) { 
        printf "%s%s", pos ? "," : "", arr[i]
        seen[arr[i]]=1
        pos   
      }
    }
    print ""
  }
' file.txt

Example Output

With your input in file.txt, the output is:

OG0000000       PF03169,MAC1_004431-T1,
OG0000002       PF07690,PF00083,
OG0000003       MAC1_000127-T1,
OG0000004       PF13246,PF00689,PF00690,
OG0000005       PF00012,PF01061,PF12697,

CodePudding user response：

With your shown samples and attempts, please try following awk code. We need not to set RS and ORS they are Record separator and Output record separator respectively, which we need not to set in this requirement. Set FS and OFS to , and printing fields accordingly.

awk '
BEGIN{ FS=","; OFS="\t" }
{
  val=""
  delete arr
  num=split($2,arr,",")
  for(i=1;i<=num;i  ){
   if(!arr[$i]  ){
      val=(val?val ",":"") $i
   }
  }
  print $1,val
}
' Input_file

CodePudding user response：

This might work for you (GNU sed):

sed -E ':a;s/(\s .*(\b\S ,).*)\2/\1/;ta' file

Iterate through a line removing any duplicate strings after whitespace.

CodePudding user response：

Here is a ruby:

ruby -ane 'puts "#{$F[0]}\t#{$F[1].split(/(?<=.),(?=.)/).uniq.join(",")}"' file
OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,