my first question on stack overflow and I hope you can help me.
Suppose a BAM file, from which I only want to extract the reads of a certain length (42 - 65 nt; column 10), but with the information of the remaining columns. Exemplary snippet:
VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.2 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4
VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.1 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCGCCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4
My try was to access the BAM file (Initial.bam) with samtools view, and to search for substrings fitting to the desired read size, which are parsed into a new BAM file (Extract.bam).
samtools view -h Initial.bam | \awk 'substr($0,1,1)=="@" || ($10>=42 && $10<=65)'| \samtools view -b > Extract.bam
However, the Extract.bam only contains the extracted header section (starting with '@') of the Initial.bam. So, header extractions works, as well as parsing into a new bam file. The initial files do contain reads of the desired range, but at that point I do not know how to adapt my code snippet. Do you have any suggestions?
CodePudding user response:
Found an adaptation that delivered the desired output.
samtools view -h Initial.bam | awk '{ if ( substr($0,1,1) == "@" || (length($10) >= 42 && length($10) <= 65)) {print $0} }' | samtools view -b > Extract.bam
Please let me know if you have any suggestions on how to reduce the code. Thanks!
CodePudding user response:
For reducing the awk
program you can:
- replace
substr($0,1,1) == "@"
with/^@/
- replace
length($10) >= 42 && length($10) <= 65
with$10 ~ /^.{42,65}$/
- remove the
{print $0}
, which is the default action
The code would become:
samtools view -h Initial.bam |
awk '/^@/ || $10 ~ /^.{42,65}$/' |
samtools view -b
CodePudding user response:
If you don't have gawk
, this should work for the others
samtools view -h Initial.bam |
mawk '((_=length($10))%__-(_-__)$)</^[@]/' __=42 | samtools view -b