Home > Software engineering >  Making a KO annotation table from eggnog mapper output
Making a KO annotation table from eggnog mapper output

Time:11-01

I have an annotation table from eggnog mapper and need to make this KO table:

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001
Gene2   0.00005   KO0003
Gene3   0.000005  KO0002

This is the table I have (test.txt):

Gene    evalue    KO
Gene1   0.00003   KO0000
Gene2   0.00005   KO0001,KO0003
Gene3   0.000005  KO0002

I have ~17,000 rows and the output is in xlsx format. The first issue I am having is that when I save the output file as a txt and view in linux (head test.txt) some of the columns look like this:

Gene,evalue,KO
Gene1  0.00003  KO0000
Gene2  0.0005   "KO0001,KO0003"
Gene3  0.00005  KO0002

How can I remove the quotes around these values? And how can I make the annotation table above?

I have tried this script from this thread (How can I split comma separated values into multiple rows?)

awk '
BEGIN { OFS="\t" }
     { $1=$1;t=$0; }
     { while(index($0,",")) {
         gsub(/,[[:alnum:],]*/,""); print;
         $0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
       }
       print t
     }' file

But it seems to get stuck in an infinite loop because of the quotes around the values in the third column.

Thanks

CodePudding user response:

If your data is always structured like this, and multiple values are between double quotes separated by a comma, you might use FPAT to define the content of the fields using gnu awk

The contents of the fields is determined by the pattern:

"[^"]*"|[^[:space:],] 

The pattern matches either from "..." OR 1 non whitespace chars other than a comma.

Then you can print the first 2 fields, and for the 3rd field check if there is a comma. If there is, the first part of the pattern matched with the double quotes, as that one it the only part that can contain a comma.

Then you can split on either " or , using a character class. Print all the values returned by the split, and discard the first and the last entry in the loop as those are caused by the double quotes.

awk -v OFS="\t" -v FPAT='"[^"]*"|[^[:space:],] ' '
{
  start = $1 OFS $2
  if (index($3, ",")) {
    n=split($3, a,/[,"]/)
    for(i=2;i<n;i  ) print start OFS a[i]
    next
  }
  print start OFS $3  
}
' file

Output

Gene    evalue  KO
Gene1   0.00003 KO0000
Gene2   0.0005  KO0001
Gene2   0.0005  KO0003
Gene3   0.00005 KO0002

CodePudding user response:

I could not reproduce your 'infinite loop' issue using your awk code. (In fact, there was no output generated at all from the code). Here is an alternative awk solution:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' input.txt

Output:

Gene    evalue  KO
Gene1  0.00003  KO0000
Gene2   0.0005  KO0003
Gene2   0.0005  KO0001
Gene3  0.00005  KO0002

To match your expected output formatting more exactly, pipe the awk output to column -t:

awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' tt.txt | column -t

Output:

Gene   evalue   KO
Gene1  0.00003  KO0000
Gene2  0.0005   KO0003
Gene2  0.0005   KO0001
Gene3  0.00005  KO0002
  • Related