Home > Back-end >  Count specific character per every species of a Fasta file
Count specific character per every species of a Fasta file

Time:06-14

I have been trying to find the amount of 1s per each species in a fasta file that looks like this:

>111
1100101010
>102
1110000001

The desired output would be:

>111
5
>102
4

I know how to get the numbers of 1s in a file with:

grep -c 1 file

My problem is that I cannot find the way to keep track of the number of 1s per each species (instead of the total in the file).

CodePudding user response:

Assuming your fasta is formatted as you indicate, and assuming using awk would be acceptable, then the following might work:

while read -r one ; do 
    echo "${one}"
    read -r two
    awk -F"1" '{print NF-1}' <<< "${two}"
done <fasta.txt

(Note: The awk command is splitting the string by '1' and then printing the number of resulting fields minus 1)

fasta.txt:

>111
1100101010
>102
1110000001

Output:

>111
5
>102
4

CodePudding user response:

grep -c 1 will give you the number of matching lines, not the total number of 1s. You could use grep -o to make it print only the matching parts of each matching line on a separate line each and then wc -l to count the number of lines.

while read -r species
do
    echo "$species"
    read -r seq
    echo -n "$seq" | grep -o 1 | wc -l
done < fasta_file
  • Related