I have a fasta file with several sequences, but the first line of all the sequences start with the same string (ABI) and I want to change and replace it with the names of the species stored in a different text file.
My fasta file looks like
>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA
The list of spp looks like this:
Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa
How I can change those ABI headers of my sequences and replace them with the name of my species using that exact order.
Required output:
>Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
>Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
>Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
>Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA
I was using something like:
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' spp_list.txt FS="[> ]" all_spp.fasta
This is not working, could someone guide me please.
Thanks in advance
Regards
CodePudding user response:
Hello, not a dev so don't be rude.
Hope this will help you:
I create a file fasta.txt that contains:
>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA
I also created a file spplist.txt that contains:
Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa
I then created a python script named fasta.py, here it is:
#!/bin/python3
#import re library: https://docs.python.org/3/library/re.html
#import sys library: https://docs.python.org/3/library/sys.html
import re,sys
#saving the reference of the standard output into "original_stdout"
original_stdout = sys.stdout
with open("spplist.txt", "r") as spplist:
x = spplist.readlines()
with open("fasta.txt", "r") as fasta:
output_file = open("output.txt", "w")
#redirecting standard output to output_file
sys.stdout = output_file
for line in fasta:
if re.match(r">ABI", line):
print(x[0].rstrip())
del x[0]
else:
print(line.rstrip())
#restoring the native standard output
sys.stdout = original_stdout
#Notify the user at the end of the work
print("job done")
(these three file need to be in the same directory if you want the script to work as it is)
Here is my directoy tree:
❯ tree
.
├── fasta.py
├── fasta.txt
└── spplist.txt
To execute the script, open a shell, cd in the directory and type:
❯ python3 fasta.py
job done
You will see a new file named output.txt in the directory:
❯ tree
.
├── fasta.py
├── fasta.txt
├── output.txt
└── spplist.txt
and here is its content:
Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA
Hope this can help you out. bguess.