Extracting information from a file using another file with list of specific text; Ubuntu/ Linux-CodePudding

I have one file which has list of IDs, index/header, let's call it list.txt

TRINITY_DN10002_c0_g1_i1.p1
TRINITY_DN10002_c0_g1_i2.p1
TRINITY_DN10002_c1_g1_i1.p1
TRINITY_DN10006_c0_g1_i1.p1
TRINITY_DN10006_c0_g1_i4.p1
TRINITY_DN10007_c0_g1_i2.p1
TRINITY_DN1000_c0_g1_i15.p1
TRINITY_DN1000_c0_g1_i31.p1
TRINITY_DN1000_c0_g1_i40.p2
TRINITY_DN1000_c0_g2_i1.p1
TRINITY_DN10012_c2_g1_i1.p1
TRINITY_DN10014_c0_g2_i1.p1
TRINITY_DN10014_c2_g1_i1.p1
TRINITY_DN10014_c2_g2_i1.p1

and another file (large) which has the data/information (sequences) I want to extract from, calling it as datafile.fasta

CAGTCAATAATTTTGACGTACTTTTCAAAACATTTCTTGCTGTTTCTTCAAAACTCTTAT
CTAATTTTTTGTTTTTCAGAAGGTTAAAAGTATATTGAAAAGTTTCTCGATTTTCAGTGG
TTAGTCTAGGATTGCTATTATTTTCAATTTGGTGTTTGCATTCTTCAAACATTACCATCA
GATCATATCGTGTTATTTCATACTGCATTTCTTTGATTTCCTGCTTTGACAAAGTTGTTA
TCGATTTTTTCATCCTGTTTCTTAATTTTGTCAAACCAT
>TRINITY_DN36094_c0_g1_i1 len=250 path=[0:0-249]
AGGAAAGTACTTAGTGAGTACATACCTGACAGTGATCACGGTTTTACGAAAAGATACATT
GAAAGAAGAATACTTCAACCTCCATTAAAACGATACACGTATTGCTTGCTCACTTCCATG
TCTTATGGCGCTTCAGACTGCTATGATCCAAATTTTTGCGGTCCCATCTTTGGTAAACGA
AATTGTTACAAGCGGTCCAAATTTACAGATAATCATTTCCCAGATATAAATCTCGGCGGG
AAACTTCATC
>TRINITY_DN36018_c0_g1_i2 len=1265 path=[0:0-408 1:409-443 2:444-1264]
AAAAGCTCCAATTGTTTGGTGGACTCCCGCTGGCTAGTCAGGATAACAGATTACGGTCTG
CCAAGTTTTAGTAGCGGCCAGTTTTTTGACGAATCAGAACAAGATGCATATAGACGTAAA
CTTTGGACCGCCCCAGAATTATTGAGGGAGAACATACCCCCTAAGAACGGTTCCCAAAAG
GGTGACGTATACAGTTTCGCCATAGTGGCGTACGAGATTATAACAAGGTCTGAACCATTT
CCCTTTGATTTAATGACTGCCAGAGACGCGGTAAATAGGGTAAGAAACGGTGAAAGCATC
CCGTTCAGACCATGCCTACCCGAGACAACGGATGTTGGCAAAGCTGTCCTTGACCTCTTG
CGAGCTTGTTGGCATGAGGTTCCAGAACACAGACCCAACTTCAACCAGGTTCGAACTGTT
[....]

I am trying to get an output which should look like this

GCGTACGAGATTATAACAAGGTCTGAACCATTT
CCCTTTGATTTAATGACTGCCAGAGACGCGGTAA

TRINITY_DN10002_c0_g1_i1.p1
TCCATTAAAACGATACACGTATTGCTTGCTCACTTCCATG
TCTTATGGCGCTTCAGACTGCTA

to illustrate a few.

How can I use the list file to extract out my sequences?

I tried grep -Fwf list.txt -A1 datafile.fasta > NRCDS_nuc.fasta but when I looked at the output file it comes out empty.

But trying samtools faidx tirnity_out.Trinity.fasta and manually add the id's seems taxing. Any help would be appreciated, thank you!

CodePudding user response：

Here's one way using awk. This doesn't produce your expected output (maybe something went wrong when it was posted?), but from your description I think the following might be what's needed here. This removes the extension (i.e. .p1, .p2, etc) from the lines in the list file and adds them to an array. Then for each header line in the FASTA file, a ternary operator sets a flag to true if the first field exists in the array and to false if it doesn't. The default action is to print, so lines are printed only when the flag is true.

awk 'FNR==NR { sub(/\.[^\.]*$/,""); a[$0]; next } /^>/ { f=substr($1,2) in a ? 1 :0 }f' list.txt datafile.fasta

CodePudding user response：

It looks that the goal is to extract specific FASTA records from a file based on FASTA IDs. Anytime you deal with FASTA records, I highly recommend that you use BioPython or BioPerl modules to make your life easier. Here is how you can extract specific FASTA records from youar FASTA database file using Python and Biopython Module:

from Bio import SeqIO
import textwrap

my_id_file     = open('fasta_id_records.txt','r')
my_fasta_file  = open('fasta_database.fasta','r')

my_dictionary = {} # fasta IDs are keys, value can be anything.
for line in my_id_file:
    my_dictionary[line[:-1]] = 'value' 

#Read fasta file records one-by-one, and print the records whose IDs in fasta_id_records.txt
for seq_record in SeqIO.parse(my_fasta_file, "fasta"):
    if seq_record.id in my_dictionary:
        sequence = str(seq_record.seq)
        fasta_record = textwrap.fill(sequence, width=60) #wrap to 60 character
        print(f">{seq_record.id}\n{fasta_record}")

my_fasta_file.close()
my_id_file.close()

This could be written in fewer lines, but it's easier to follow the way I wrote.