Home > Software design >  How does this awk line that counts the number of nucleotides in a fasta file work?
How does this awk line that counts the number of nucleotides in a fasta file work?

Time:09-27

I am currently learning to use awk, and found an awk command that I needed, but do not fully understand what is happening in. This line of code takes a genome file called a fasta and returns all the length of each sequence in it. For those unfamiliar with fasta files, they are txt files that can contain multiple genetic sequences called contigs. It follows the general structure of:

>Nameofsequence
Sequencedata like: ATGCATCG
GCACGACTCGCTATATTATA
>Nameofsequence2
Sequencedata

The line is found here:

cat file.fa | awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c =length($0);} END { print c; }'

I understand that cat is opening the fasta file, checking if its the sequence name line, and at some point counting the number of characters in the data section. But I do not understand how it is breaking down the data section in substrings, nor how it is resetting the counts with each new sequence.


EDIT by Ed Morton: here's the above awk script formatted legibly by gawk -o-:

$0 ~ ">" {
    if (NR > 1) {
        print c
    }
    c = 0
    printf substr($0, 2, 100) "\t"
}

$0 !~ ">" {
    c  = length($0)
}

END {
    print c
}

CodePudding user response:

First format the command:

awk '
  $0 ~ ">" {
    if (NR > 1) {print c;}
    c=0;
    printf substr($0,2,100) "\t";
  }
  $0 !~ ">" {
    c =length($0);
  }
  END { print c; }
  ' file.fa

The code will use c for a character count.This count starts with value 0, and will be reset to 0 every time a line with > is parsed.
The length of the inputline is added to c when the inputline is without a >.
The value of c must be printed after a sequence, so when it finds a new > (not on the first line) or when the complete file is parsed (block with END).
As you might already understand now:
breaking down the data section in substrings is by matching the inputline with a >, and
resetting the counts with each new sequence is done by using c=0 in the block with $0 ~ ">".

Look at the comment of Ed: The printf statement is used wrong. I don't know haf often %s occurs in in fasta file, but that is not important: Use %s for input strings.

CodePudding user response:

@WalterA already answered your question by explaining what the script does but in case you're interested here's an improved version including a couple of small bug fixes for your use of printf input and printing of an empty line if the input file is empty and improvements over the redundant testing of the same condition twice and testing for > and removing it separately instead of all at once:

BEGIN { OFS="\t" }
sub(/^>/,"") {
    if (lgth) { print name, lgth }
    name = $0
    lgth = 0
    next
}
{ lgth  = length($0) }
END {
    if (lgth) { print name, lgth }
}

Alternatively you could do:

BEGIN { OFS="\t" }
sub(/^>/,"") {
    if (seq != "") { print name, length(seq) }
    name = $0
    seq = ""
    next
}
{ seq = seq $0 }
END {
    if (seq != "") { print name, length(seq) }
}

but appending to a variable is slow so calling length() for each line of the sequence may actually be more efficient.

  • Related