Sorry for this question but I am a bioinformatician with not much programming experience and I don't know how to fix it.
I have several bam files in a server directory. I would like to count with grep
the number of occurrences both in the bam and in the corrisponding fastq file but, due to the lack of space, I also need to remove each fastq file after the analysis.
To count the number of occurrences in the BAM I tried with this (but any suggestions are welcome):
for i in *bam* ; do
echo $i >> TELO_BAM
grep 'TTAGGG' $i |
wc -l >> TELO_BAM
done
As you can see I also wanted to put the file name and the count in a txt file.
Now, to count the number in a fastq I tried with this but I don't know how to delete the fastq as soon as I've done the count (so it doesn't take up space):
for i in *bam*; do
samtools bam2fq $i >> $i\.fq |
echo $i >> TELO_FQ
grep 'TTAGGG' $i >> TELO_FQ
done
I don't know if it works, but I would like to convert the bam to fastq, put the filename in a text file, count the occurrences in the fastq and then delete it and move on to the next file.
CodePudding user response:
grep XXX | wc -l
gives you the number of lines that containXXX
, not the number of occurrences (unless each line can contain at most one occurrence).
With GNUgrep -o XXX | wc -l
you'll be able to count the total number of occurrences.Using
>> TELO_XXX
inside the loops is sub-optimal, it's better to use a file descriptor.You name the FastQ fileNo need to create any FastQ file;$i.fq
, so you just have torm "$i.fq"
after using it.samtools bam2fq
writes its output to stdout so you cangrep
it directly`.
Here's an untested code illustrating all that:
exec 3> TELO_BAM || exit 1
exec 4> TELO_FQ || exit 1
for bam in *bam*
do
printf '%s: ' "$bam" 1>&3
grep -o 'TTAGGG' "$bam" | wc -l 1>&3
printf '%s: ' "$bam.fq" 1>&4
samtools bam2fq "$bam" | grep -o 'TTAGGG' | wc -l 1>&4
done