Using a file with specific IDs to extract data from another file into separate files and then using-CodePudding

I have a file with some IDs listed like this:

id1
id2
id3 
etc

I want to use those IDs to extract data from files (IDs are occurring in every file) and save output for each of these IDs to a separate file (IDs are protein family names and I want to get each protein from a specific family). And, when I have the name for each of the protein I want to use this name to get those proteins (in .fasta format), so that they would be grouped by their family (they'll be staying in the same group)

So I've tried to do it like this (I knew that it would dump all the IDs into one file):

#! /bin/bash

for file in *out
do grep -n -E 'id1|id2|id3' /directory/$file >> output; done

I would appreciate any help and I will gladly specify if not everything is clear to you.

EDIT: i will try to clarify, sorry for the inconvenience:

so theres a file called "pfamacc" with the following content:

PF12312
PF43555
PF34923

and so on - those are the IDs that i need to acces other files, which have a structure like that "something_something.faa.out"

<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923

i need those accesion numbers so i can then get protein sequences from files which look like this:

>RXOOOA
ASDBSADBASDGHH

>OC2144
SADHHASDASDCJHWINF

CodePudding user response：

With the assumption there is a file ids_file.txt in the same directory with the subsequent content:

id1
id2
id3
id4

And in the same directory is as well a file called id1 with the following content:

Bla bla bla
id1
and id2
is
here id4

Then this script could help:

#!/bin/sh

IDS=$(cat ids_file.txt)
IDS_IN_ONE=$(cat ids_file.txt | tr '\n' '|' | sed -r 's/(\|)?\|$//')
echo $IDS_IN_ONE

for file in $IDS; do
 grep -n -E "$IDS_IN_ONE" ./$file >> output
done

The file output has then the following result:

2:id1
3:and id2
5:here id4

CodePudding user response：

Reading that a list needs to be cross-referenced to get a 2nd list, which then needs to be used to gather FASTAs.

Starting with the following 3 files...

starting_values.txt

PF12312
PF43555
PF34923

cross_reference.txt

<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923

find_from_file.fasta

>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
SADHHASDASDCJHWINF
>NC11111
IURJCNKAERJKADSF

for i in `cat starting_values.txt`; do awk -v var=$i 'var==$4 {print $1}' cross_reference.txt; done > needed_accessions.txt

If multiline FASTA change to single line. https://www.biostars.org/p/9262/

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' find_from_file.fasta > find_from_file.temp

for i in `cat needed_accessions.txt`; do grep -A 1 "$i" find_from_file.temp; done > found_sequences.fasta

Final Output...

found_sequences.fasta

>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINFSADHHASDASDCJHWINF