Making a KO annotation table from eggnog mapper output-CodePudding

I have an annotation table from eggnog mapper and need to make this KO table:

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001
Gene2   0.00005   KO0003
Gene3   0.000005  KO0002

This is the table I have (test.txt):

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001,KO0003
Gene3   0.000005  KO0002

I have ~17,000 rows and the output is in xlsx format. The first issue I am having is that when I save the output file as a txt and view in linux (head test.txt) some of the columns look like this:

Gene,evalue,KO
Gene1  0.00003  KO0000
Gene2  0.0005   "KO0001,KO0003"
Gene3  0.00005  KO0002

How can I remove the quotes around these values? And how can I make the annotation table above?

I have tried this script from this thread (How can I split comma separated values into multiple rows?)

awk ' BEGIN { OFS="\t" }
     { $1=$1;t=$0; }
     { while(index($0,",")) {
         gsub(/,[[:alnum:],]*/,""); print;
         $0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
       }
       print t
     }' file

But it seems to get stuck in an infinite loop because of the quotes around the values in the third column.

Thanks

CodePudding user response：

If your data is always structured like this, and multiple values are between double quotes separated by a comma, you might use FPAT to define the content of the fields using gnu awk

The contents of the fields is determined by the pattern:

"[^"]*"|[^[:space:],]

The pattern matches either from "..." OR 1 non whitespace chars other than a comma.

Then you can print the first 2 fields, and for the 3rd field check if there is a comma. If there is, the first part of the pattern matched with the double quotes, as that one it the only part that can contain a comma.

Then you can split on either " or , using a character class. Print all the values returned by the split, and discard the first and the last entry in the loop as those are caused by the double quotes.

awk -v OFS="\t" -v FPAT='"[^"]*"|[^[:space:],] ' '
{
  start = $1 OFS $2
  if (index($3, ",")) {
    n=split($3, a,/[,"]/)
    for(i=2;i<n;i  ) print start OFS a[i]
    next
  }
  print start OFS $3  
}
' file

Output

Gene    evalue  KO
Gene1   0.00003 KO0000
Gene2   0.0005  KO0001
Gene2   0.0005  KO0003
Gene3   0.00005 KO0002

CodePudding user response：

I could not reproduce your 'infinite loop' issue using your awk code. (In fact, there was no output generated at all from the code). Here is an alternative awk solution:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' input.txt

Output:

Gene    evalue  KO
Gene1  0.00003  KO0000
Gene2   0.0005  KO0003
Gene2   0.0005  KO0001
Gene3  0.00005  KO0002

To match your expected output formatting more exactly, pipe the awk output to column -t:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' tt.txt | column -t

Output:

Gene   evalue   KO
Gene1  0.00003  KO0000
Gene2  0.0005   KO0003
Gene2  0.0005   KO0001
Gene3  0.00005  KO0002