I want to count the number of occurrences of a char in each sequence in a fasta file with multiple sequences, but with the method I use I count the total of the char in the fasta file:
grep -o 'G' my_sequence.fasta | wc -l
Is there some way to do it with each of the sequences using the fasta file with multiple sequences?
the fasta file look as below
>sequence1
CCGTGGGTCAATCCCGTA
>sequence2
CCGTGGGGCACTCCCGTA
>sequence3
TTGTGGGTCAATCCCGTC
>sequence4
CCCGGGTGCACTCCCGTA
CodePudding user response:
Here's an awk
that counts the number of G
in each sequence:
awk -v ch=G '
/^>/ {
if (label != "") {
gsub("[^"ch"]", "", sequence)
print label, length(sequence)
sequence = ""
}
label = $1
next
}
{ sequence = sequence $0 }
END {
if (label != "") {
gsub("[^"ch"]", "", sequence)
print label, length(sequence)
}
}
' file.fasta
>sequence1 5
>sequence2 6
>sequence3 5
>sequence4 5
CodePudding user response:
{m,g,n}awk -F'^>' '(NF =__= OFS = "")*(RS==ORS) ? ORS = " :: " \ \ : $( (ORS = RS))=sprintf((_=" %s\47s = %4u |")(_)(_)_, _="C", gsub(_,__), _="G", gsub(_,__), _="A", gsub(_,__), _="T", length($ _))'
sequence1 :: C's = 6 | G's = 5 | A's = 3 | T's = 4 |
sequence2 :: C's = 7 | G's = 6 | A's = 2 | T's = 3 |
sequence3 :: C's = 5 | G's = 5 | A's = 2 | T's = 6 |
sequence4 :: C's = 8 | G's = 5 | A's = 2 | T's = 3 |