I have a file with some IDs listed like this:
id1
id2
id3
etc
I want to use those IDs to extract data from files (IDs are occurring in every file) and save output for each of these IDs to a separate file (IDs are protein family names and I want to get each protein from a specific family). And, when I have the name for each of the protein I want to use this name to get those proteins (in .fasta
format), so that they would be grouped by their family (they'll be staying in the same group)
So I've tried to do it like this (I knew that it would dump all the IDs into one file):
#! /bin/bash
for file in *out
do grep -n -E 'id1|id2|id3' /directory/$file >> output; done
I would appreciate any help and I will gladly specify if not everything is clear to you.
EDIT: i will try to clarify, sorry for the inconvenience:
so theres a file called "pfamacc" with the following content:
PF12312
PF43555
PF34923
and so on - those are the IDs that i need to acces other files, which have a structure like that "something_something.faa.out"
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
i need those accesion numbers so i can then get protein sequences from files which look like this:
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
CodePudding user response:
With the assumption there is a file ids_file.txt
in the same directory with the subsequent content:
id1
id2
id3
id4
And in the same directory is as well a file called id1
with the following content:
Bla bla bla
id1
and id2
is
here id4
Then this script could help:
#!/bin/sh
IDS=$(cat ids_file.txt)
IDS_IN_ONE=$(cat ids_file.txt | tr '\n' '|' | sed -r 's/(\|)?\|$//')
echo $IDS_IN_ONE
for file in $IDS; do
grep -n -E "$IDS_IN_ONE" ./$file >> output
done
The file output
has then the following result:
2:id1
3:and id2
5:here id4
CodePudding user response:
Reading that a list needs to be cross-referenced to get a 2nd list, which then needs to be used to gather FASTAs.
Starting with the following 3 files...
starting_values.txt
PF12312
PF43555
PF34923
cross_reference.txt
<acc_number> <aligment_start> <aligment_end> <pfam_acc>
RXOOOA 5 250 PF12312
OC2144 6 200 PF34923
find_from_file.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINF
SADHHASDASDCJHWINF
>NC11111
IURJCNKAERJKADSF
for i in `cat starting_values.txt`; do awk -v var=$i 'var==$4 {print $1}' cross_reference.txt; done > needed_accessions.txt
If multiline FASTA change to single line. https://www.biostars.org/p/9262/
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' find_from_file.fasta > find_from_file.temp
for i in `cat needed_accessions.txt`; do grep -A 1 "$i" find_from_file.temp; done > found_sequences.fasta
Final Output...
found_sequences.fasta
>RXOOOA
ASDBSADBASDGHH
>OC2144
SADHHASDASDCJHWINFSADHHASDASDCJHWINF