Home > Net >  Bash: Extract reads from BAM files based on read length
Bash: Extract reads from BAM files based on read length

Time:06-01

my first question on stack overflow and I hope you can help me.

Suppose a BAM file, from which I only want to extract the reads of a certain length (42 - 65 nt; column 10), but with the information of the remaining columns. Exemplary snippet:

VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.2 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4

VH00693:3:AAANGKTM5:1:1507:7438:26974_AGTTATAGAC 256 ENST00000438504.1 352 0 32M * 0 0 CCTGCAGGAATATGGCTCCATCTTCATGGGCGCCTGCAGGAATATGGCTCCATCTTCATGGGCG CCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC;CCCCCCCCCCCCCCCCCCCC NH:i:50 HI:i:4

My try was to access the BAM file (Initial.bam) with samtools view, and to search for substrings fitting to the desired read size, which are parsed into a new BAM file (Extract.bam).

samtools view -h Initial.bam | \awk 'substr($0,1,1)=="@" || ($10>=42 && $10<=65)'| \samtools view -b > Extract.bam

However, the Extract.bam only contains the extracted header section (starting with '@') of the Initial.bam. So, header extractions works, as well as parsing into a new bam file. The initial files do contain reads of the desired range, but at that point I do not know how to adapt my code snippet. Do you have any suggestions?

CodePudding user response:

Found an adaptation that delivered the desired output.

samtools view -h Initial.bam | awk '{ if ( substr($0,1,1) == "@" || (length($10) >= 42 && length($10) <= 65)) {print $0} }' | samtools view -b > Extract.bam

Please let me know if you have any suggestions on how to reduce the code. Thanks!

CodePudding user response:

For reducing the awk program you can:

  • replace substr($0,1,1) == "@" with /^@/
  • replace length($10) >= 42 && length($10) <= 65 with $10 ~ /^.{42,65}$/
  • remove the {print $0}, which is the default action

The code would become:

samtools view -h Initial.bam |
awk '/^@/ || $10 ~ /^.{42,65}$/' |
samtools view -b

CodePudding user response:

If you don't have gawk, this should work for the others

samtools view -h Initial.bam |

mawk '((_=length($10))%__-(_-__)$)</^[@]/' __=42 | samtools view -b
  • Related