I have a multi-fasta sequence file: test.fasta
>Ara_001
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_002
MGIKGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKPPELKRQELAKRYSKRADATADLTGAIEAGN
>Ara_003
MGIKGLTKLLAEHAPRAAAQRRVEDYRGRVIAIDASLSIYQFLVVVGRKGTEVLTNEAEG
LTVDCYARFVFDGEPPDLKKRELAKRSLRRDDASEDLNRAIEVGDEDSIEKFSKRTVKIT
I have another list file with a range: range.txt
Ara_001 3 60
Ara_002 10 80
Ara_003 20 50
I want to extract the defined region.
My expected out put would be:
>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VT
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
VTSHLQGMFNRTIRLLEAGIKPVYVFDGKP
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG
I tried:
#!/bin/bash
lines=$(awk 'END {print NR}' range.txt)
for ((a=1; a<= $lines ; a ))
do
number=$(awk -v lines=$a 'NR == lines' range.txt)
grep -v ">" test.fasta | awk -v lines=$a 'NR == lines' | cut -c$number
done;
CodePudding user response:
In bash:
#!/bin/bash
while read l
do
m=${l#* }
n=${m// *}
o=${m//* }
in=`cat ${l// *}`
echo ${in:n:o-n}
done<range.txt
In python:
#!/bin/python
fp = open('range.txt', 'r')
raw = fp.read()
list = raw.split('\n')
i = 0
while i < len(list) - 1:
list_ = list[i].split(' ')
tmp = open(list_[0], 'r')
raw_text = tmp.read()
print(raw_text[int(list_[1]):int(list_[2])-int(list_[1])] '\n')
i =1
tmp.close()
fp.close()
CodePudding user response:
Using biopython:
# read ranges as dictionary
with open('range.txt') as f:
ranges = {ID: (int(start), int(stop)) for ID, start, stop in map(lambda s: s.strip().split(), f)}
# {'Ara_001': (3, 60), 'Ara_002': (10, 80), 'Ara_003': (20, 50)}
# load input fasta and slice
from Bio import SeqIO
with open ('test.fasta') as handle:
out = [r[slice(*ranges[r.id])] for r in SeqIO.parse(handle, 'fasta')]
# export sliced sequences
with open('output.fasta', 'w') as handle:
SeqIO.write(out, handle, 'fasta')
Output file:
>Ara_001
KGLTKLLADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGE
>Ara_002
ADNAPSCMKEQKFESYFGRKIAVDASMSIYQFLIVVGRTGTEMLTNEAGEVTSHLQGMFN
RTIRLLEAGI
>Ara_003
RRVEDYRGRVIAIDASLSIYQFLVVVGRKG
NB. With this quick code there must be a entry for each sequence id in range.txt, it's however quite easy to modify it to use a default behavior on case of absence of it.