I have a file with two columns separated by tabs as follows:
OG0000000 PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
I tried to start this by using awk.
awk 'BEGIN{RS=ORS=","} !seen[$0] ' file.txt
But my output looks like this, where there are still some duplicates if the duplicated string occurs first.
OG0000000 PF03169,PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF07690,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,
I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!
CodePudding user response:
This awk
should work for you:
awk -F '[\t,]' '
{
printf "%s", $1 "\t"
for (i=2; i<=NF; i) {
if (!seen[$i] )
printf "%s,", $i
}
print ""
delete seen
}' file
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
CodePudding user response:
Using GNU sed
$ sed -E ':a;s/([^ \t]*[ \t] )?(([[:alnum:]] ,).*)\3/\1\2/;ta' input_file
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
CodePudding user response:
Another approach using the same spit of $2
into an array and keeping a separate counter for the position of the non-duplicated values posted could be done as:
awk '
{
printf "%s\t", $1
delete seen
n = split($2,arr,",")
pos = 0
for (i=1;i<=n;i ) {
if (! (arr[i] in seen)) {
printf "%s%s", pos ? "," : "", arr[i]
seen[arr[i]]=1
pos
}
}
print ""
}
' file.txt
Example Output
With your input in file.txt
, the output is:
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,
CodePudding user response:
With your shown samples and attempts, please try following awk
code. We need not to set RS
and ORS
they are Record separator and Output record separator respectively, which we need not to set in this requirement. Set FS and OFS to ,
and printing fields accordingly.
awk '
BEGIN{ FS=","; OFS="\t" }
{
val=""
delete arr
num=split($2,arr,",")
for(i=1;i<=num;i ){
if(!arr[$i] ){
val=(val?val ",":"") $i
}
}
print $1,val
}
' Input_file
CodePudding user response:
This might work for you (GNU sed):
sed -E ':a;s/(\s .*(\b\S ,).*)\2/\1/;ta' file
Iterate through a line removing any duplicate strings after whitespace.
CodePudding user response:
Here is a ruby:
ruby -ane 'puts "#{$F[0]}\t#{$F[1].split(/(?<=.),(?=.)/).uniq.join(",")}"' file
OG0000000 PF03169,MAC1_004431-T1,
OG0000002 PF07690,PF00083,PF00083,
OG0000003 MAC1_000127-T1,
OG0000004 PF13246,PF00689,PF00690,
OG0000005 PF00012,PF01061,PF12697,PF00012,