Count the number of occurrences of a char in each sequence in a fasta file with multiple sequences u-CodePudding

I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:

grep -o 'G' my_sequence.fasta | wc -l

Is there some way to do it with each of the sequences using the fasta file with multiple sequences?

the fasta file look as below

>sequence1
CCGTGGGTCAATCCCGTA
>sequence2
CCGTGGGGCACTCCCGTA
>sequence3
TTGTGGGTCAATCCCGTC
>sequence4
CCCGGGTGCACTCCCGTA

CodePudding user response：

Here's an awk that counts the number of G in each sequence:

awk -v ch=G '
    /^>/ {
        if (label != "") {
            gsub("[^"ch"]", "", sequence)
            print label, length(sequence)
            sequence = ""
        }
        label = $1
        next 
    }
    { sequence = sequence $0 }
    END {
        if (label != "") {
            gsub("[^"ch"]", "", sequence)
            print label, length(sequence)
        }
    }
' file.fasta

>sequence1 5
>sequence2 6
>sequence3 5
>sequence4 5

CodePudding user response：

{m,g,n}awk -F'^>' '(NF  =__= OFS = "")*(RS==ORS) ? ORS = " :: " \
                                  \
: $( (ORS = RS))=sprintf((_=" %s\47s = %4u |")(_)(_)_,
                          _="C", gsub(_,__), _="G", gsub(_,__),
                          _="A", gsub(_,__), _="T", length($ _))'

sequence1 ::  C's =    6 | G's =    5 | A's =    3 | T's =    4 |
sequence2 ::  C's =    7 | G's =    6 | A's =    2 | T's =    3 |
sequence3 ::  C's =    5 | G's =    5 | A's =    2 | T's =    6 |
sequence4 ::  C's =    8 | G's =    5 | A's =    2 | T's =    3 |