Home > Net >  replace the header line of several sequences in a fasta file and replace them with the species names
replace the header line of several sequences in a fasta file and replace them with the species names

Time:04-23

I have a fasta file with several sequences, but the first line of all the sequences start with the same string (ABI) and I want to change and replace it with the names of the species stored in a different text file.

My fasta file looks like

>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA

The list of spp looks like this:

Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa

How I can change those ABI headers of my sequences and replace them with the name of my species using that exact order.

Required output:

>Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
>Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
>Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
>Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA

I was using something like:

awk '
FNR==NR{
  a[$1]=$2
  next
}
($2 in a) && /^>/{
  print ">"a[$2]
  next
}
1
' spp_list.txt FS="[> ]"  all_spp.fasta

This is not working, could someone guide me please.

Thanks in advance

Regards

CodePudding user response:

Hello, not a dev so don't be rude.

Hope this will help you:

I create a file fasta.txt that contains:

>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA

I also created a file spplist.txt that contains:

Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa

I then created a python script named fasta.py, here it is:

#!/bin/python3

#import re library: https://docs.python.org/3/library/re.html
#import sys library: https://docs.python.org/3/library/sys.html
import re,sys

#saving the reference of the standard output into "original_stdout"
original_stdout = sys.stdout


with open("spplist.txt", "r") as spplist:
    x = spplist.readlines()
    with open("fasta.txt", "r") as fasta:
        output_file = open("output.txt", "w")
        #redirecting standard output to output_file
        sys.stdout = output_file

        for line in fasta:
            if re.match(r">ABI", line):
                print(x[0].rstrip())
                del x[0]
            else:
                print(line.rstrip())

        #restoring the native standard output
        sys.stdout = original_stdout

#Notify the user at the end of the work
print("job done")

(these three file need to be in the same directory if you want the script to work as it is)

Here is my directoy tree:

❯ tree
.
├── fasta.py
├── fasta.txt
└── spplist.txt

To execute the script, open a shell, cd in the directory and type:

❯ python3 fasta.py
job done

You will see a new file named output.txt in the directory:

❯ tree
.
├── fasta.py
├── fasta.txt
├── output.txt
└── spplist.txt

and here is its content:

Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA

Hope this can help you out. bguess.

  • Related