I have an annotation table from eggnog mapper and need to make this KO table:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.00005 KO0001
Gene2 0.00005 KO0003
Gene3 0.000005 KO0002
This is the table I have (test.txt):
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.00005 KO0001,KO0003
Gene3 0.000005 KO0002
I have ~17,000 rows and the output is in xlsx format. The first issue I am having is that when I save the output file as a txt and view in linux (head test.txt
) some of the columns look like this:
Gene,evalue,KO
Gene1 0.00003 KO0000
Gene2 0.0005 "KO0001,KO0003"
Gene3 0.00005 KO0002
How can I remove the quotes around these values? And how can I make the annotation table above?
I have tried this script from this thread (How can I split comma separated values into multiple rows?)
awk '
BEGIN { OFS="\t" }
{ $1=$1;t=$0; }
{ while(index($0,",")) {
gsub(/,[[:alnum:],]*/,""); print;
$0=t; gsub(OFS "[[:alnum:]]*,",OFS); t=$0;
}
print t
}' file
But it seems to get stuck in an infinite loop because of the quotes around the values in the third column.
Thanks
CodePudding user response:
If your data is always structured like this, and multiple values are between double quotes separated by a comma, you might use FPAT to define the content of the fields using gnu awk
The contents of the fields is determined by the pattern:
"[^"]*"|[^[:space:],]
The pattern matches either from "..."
OR 1 non whitespace chars other than a comma.
Then you can print the first 2 fields, and for the 3rd field check if there is a comma. If there is, the first part of the pattern matched with the double quotes, as that one it the only part that can contain a comma.
Then you can split on either "
or ,
using a character class. Print all the values returned by the split, and discard the first and the last entry in the loop as those are caused by the double quotes.
awk -v OFS="\t" -v FPAT='"[^"]*"|[^[:space:],] ' '
{
start = $1 OFS $2
if (index($3, ",")) {
n=split($3, a,/[,"]/)
for(i=2;i<n;i ) print start OFS a[i]
next
}
print start OFS $3
}
' file
Output
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.0005 KO0001
Gene2 0.0005 KO0003
Gene3 0.00005 KO0002
CodePudding user response:
I could not reproduce your 'infinite loop' issue using your awk
code. (In fact, there was no output generated at all from the code). Here is an alternative awk
solution:
awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' input.txt
Output:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.0005 KO0003
Gene2 0.0005 KO0001
Gene3 0.00005 KO0002
To match your expected output formatting more exactly, pipe the awk
output to column -t
:
awk 'BEGIN{OFS="\t"} \
NR==1{gsub(",",OFS,$0); print} \
(NR>1 && $0 ~ /,/) {gsub("\"","",$3); split($3,a,","); $3=""; for (i in a) {print $0 a[i]}} \
(NR>1 && $3!=""){print $0}' tt.txt | column -t
Output:
Gene evalue KO
Gene1 0.00003 KO0000
Gene2 0.0005 KO0003
Gene2 0.0005 KO0001
Gene3 0.00005 KO0002