I have this result from phobius which looks like the following
ID sp|Q92673|1-2157
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q5SSG8|25-479
FT DOMAIN 1 455 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q92854|22-734
FT DOMAIN 1 713 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y5E9|27-686
FT DOMAIN 1 660 NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID sp|Q9Y6N8|55-613
FT DOMAIN 1 559 NON CYTOPLASMIC.
//
I wish to print the correspoing Uniprot ID in front of each line for each result seperated by \\
.
Here is the perl snippet I created
open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
if ($_=~/^ID (\S ?)\s/) #search accession number started by > and terminate at white space
{
$id=$1;
chomp ($id);
print FILE "$id\t"; #will print accession number in a colomn
}
if ($_=~/^FT /)
{
print FILE "$_";
}
}
This prints the ID only in first line i.e, It works perfectly fine with the results having single domain but fails if there are more than one domain.
for example
FT SIGNAL 1 28
FT DOMAIN 1 11 N-REGION.
FT DOMAIN 12 22 H-REGION.
FT DOMAIN 23 28 C-REGION.
FT DOMAIN 29 2135 NON CYTOPLASMIC.
FT TRANSMEM 2136 2156
FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
FT TRANSMEM 120 144
FT DOMAIN 145 145 NON CYTOPLASMIC.
how may I make it work for multiple entries.
expected output
sp|Q92673|1-2157 FT SIGNAL 1 28
sp|Q92673|1-2157 FT DOMAIN 1 11 N-REGION.
sp|Q92673|1-2157 FT DOMAIN 12 22 H-REGION.
sp|Q92673|1-2157 FT DOMAIN 23 28 C-REGION.
sp|Q92673|1-2157 FT DOMAIN 29 2135 NON CYTOPLASMIC.
sp|Q92673|1-2157 FT TRANSMEM 2136 2156
sp|Q92673|1-2157 FT DOMAIN 2157 2157 CYTOPLASMIC.
sp|Q5SSG8|25-479 FT DOMAIN 1 455 NON CYTOPLASMIC.
sp|Q92854|22-734 FT DOMAIN 1 713 NON CYTOPLASMIC.
sp|Q9Y5E9|27-686 FT DOMAIN 1 660 NON CYTOPLASMIC.
sp|Q9Y6N8|55-613 FT DOMAIN 1 559 NON CYTOPLASMIC.
sp|Q02763|23-748 FT DOMAIN 1 726 NON CYTOPLASMIC.
sp|Q14517|22-4181 FT DOMAIN 1 4160 NON CYTOPLASMIC.
sp|O75051|35-1237 FT DOMAIN 1 1203 NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT DOMAIN 1 119 CYTOPLASMIC.
tr|D3DPA4|1-145 FT TRANSMEM 120 144
tr|D3DPA4|1-145 FT DOMAIN 145 145 NON CYTOPLASMIC.
Thanks for the help in advance
CodePudding user response:
Just move the print FILE "$id\t"
into the other if
block, i.e. only populate the $id when it's specified, print it for every domain.
You might add a check that the $id isn't empty before printing it, but it shouldn't happen if I understand the format correctly.
if (/^ID (\S ?)\s/)
{
$id = $1;
}
if (/^FT /)
{
print FILE "$id\t$_";
}