I have a file containing a sequence:
>sequence
TAGGACTGAGGGCTGGACAGGGCTGCGGGAG
and another one containing numbers refering to positions:
3
6
11
I would like to get a new file containing 'N' instead of A,C,G,T at the positions defined in the second file such as:
>sequence
TANGANTGAGNGCTGGACAGGGCTGCGGGAG
Is there a way to do it using bash awk/sed or should I use a python script with SeqIO from biopython?
EDIT:
Here is a start for python script:
from Bio import SeqIO
import sys
import string
unput1=raw_input("enter sequence:")
unput2=raw_input("enter position file:")
fasta_file=unput1
position_file=unput2
result_file="outfile.fasta"
nb_list=list()
with open(position_file) as f:
for line in f:
line=line.strip()
headerline = line.split()
position=headerline[0]
position_list.add(position)
for record in SeqIO.parse(StringIO(data), "fasta"):
if record.id in nb_list:
seq_record[position_list]="N"
SeqIO.write([seq_record], f, "fasta")
CodePudding user response:
Using awk with empty FS
. This may not work with every awk version or with arbitrarily long sequences:
$ awk 'BEGIN {
FS=OFS="" # process each char as an individual field
}
NR==FNR { # process the numbers file
a[$0] # hash numbers to a hash
next
}
/^[ACGT]/ { # process sequence file
for(i=1;i<=NF;i ) # itetate every field
if(i in a) # if i found in a
$i="N" # replace char with N
}1' no-file seq-file
Output:
>sequence
TANGANTGAGNGCTGGACAGGGCTGCGGGAG
CodePudding user response:
Using POSIX awk and substr()
to address string indexes:
awk '
FNR==NR {a[c ] = $0}
FNR!=NR && !/^[[:space:]]*[;>]|^[[:space:]]*$/ {
for (i in a) {
n=a[i]
$0 = substr($0, 1, n-1) "N" substr($0, n 1)
}
}
FNR!=NR' indexes.txt sequence.fasta