Home > Net >  How to extract user defined region from an fasta file with a list in other file
How to extract user defined region from an fasta file with a list in other file

Time:08-09

I have a multi-fasta sequence file: test.fasta

>Ara_001
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_002
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_003
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT

I have another list file with a range: range.txt

Ara_001       3 60
Ara_002       10 80
Ara_003       20 50

I want to extract the defined region.

My expected out put would be:

>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VT
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKP
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG

I tried:

#!/bin/bash
lines=$(awk 'END {print NR}' range.txt)
for ((a=1; a<= $lines ; a  ))
 do
 number=$(awk -v lines=$a 'NR == lines' range.txt)
 grep -v ">" test.fasta | awk -v lines=$a 'NR == lines' | cut -c$number
done;

CodePudding user response:

In bash:

#!/bin/bash
while read l
do
m=${l#* }
n=${m// *}
o=${m//* }
in=`cat ${l// *}`
echo ${in:n:o-n}
done<range.txt

In python:

#!/bin/python
fp = open('range.txt', 'r')
raw = fp.read()
list = raw.split('\n')
i = 0
while i < len(list) - 1:
  list_ = list[i].split(' ')
  tmp = open(list_[0], 'r')
  raw_text = tmp.read()
  print(raw_text[int(list_[1]):int(list_[2])-int(list_[1])]   '\n')
  i =1
  tmp.close()
fp.close()

CodePudding user response:

Using biopython:

# read ranges as dictionary
with open('range.txt') as f:
    ranges = {ID: (int(start), int(stop)) for ID, start, stop in map(lambda s: s.strip().split(), f)}
# {'Ara_001': (3, 60), 'Ara_002': (10, 80), 'Ara_003': (20, 50)}

# load input fasta and slice
from Bio import SeqIO
with open ('test.fasta') as handle:
    out = [r[slice(*ranges[r.id])] for r in SeqIO.parse(handle, 'fasta')]

# export sliced sequences
with open('output.fasta', 'w') as handle:
    SeqIO.write(out, handle, 'fasta')

Output file:

>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGEVTSHLQGMFN
RTIRLLEAGI
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG

NB. With this quick code there must be a entry for each sequence id in range.txt, it's however quite easy to modify it to use a default behavior on case of absence of it.

  • Related