Count specific character per every species of a Fasta file-CodePudding

I have been trying to find the amount of 1s per each species in a fasta file that looks like this:

The desired output would be:

I know how to get the numbers of 1s in a file with:

grep -c 1 file

My problem is that I cannot find the way to keep track of the number of 1s per each species (instead of the total in the file).

CodePudding user response：

Assuming your fasta is formatted as you indicate, and assuming using awk would be acceptable, then the following might work:

while read -r one ; do 
    echo "${one}"
    read -r two
    awk -F"1" '{print NF-1}' <<< "${two}"
done <fasta.txt

(Note: The awk command is splitting the string by '1' and then printing the number of resulting fields minus 1)

fasta.txt:

Output:

CodePudding user response：

grep -c 1 will give you the number of matching lines, not the total number of 1s. You could use grep -o to make it print only the matching parts of each matching line on a separate line each and then wc -l to count the number of lines.

while read -r species
do
    echo "$species"
    read -r seq
    echo -n "$seq" | grep -o 1 | wc -l
done < fasta_file