Home > Net >  issue with loop printing in perl in file parsing
issue with loop printing in perl in file parsing

Time:10-20

I have this result from phobius which looks like the following

ID   sp|Q92673|1-2157
FT   SIGNAL        1     28       
FT   DOMAIN        1     11       N-REGION.
FT   DOMAIN       12     22       H-REGION.
FT   DOMAIN       23     28       C-REGION.
FT   DOMAIN       29   2135       NON CYTOPLASMIC.
FT   TRANSMEM   2136   2156       
FT   DOMAIN     2157   2157       CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q5SSG8|25-479
FT   DOMAIN        1    455       NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q92854|22-734
FT   DOMAIN        1    713       NON CYTOPLASMIC.
//
---------------------------------------------------------------------
ID   sp|Q9Y5E9|27-686
FT   DOMAIN        1    660       NON CYTOPLASMIC.
// 
---------------------------------------------------------------------
ID   sp|Q9Y6N8|55-613
FT   DOMAIN        1    559       NON CYTOPLASMIC.
//

I wish to print the correspoing Uniprot ID in front of each line for each result seperated by \\.

Here is the perl snippet I created

open (MYFILE, "result_phobius.txt" )||warn "Couldn't open file because $!"; #give input file name
open (FILE, ">output.txt"); #output file name
while (<MYFILE>)
{
    if ($_=~/^ID   (\S ?)\s/) #search accession number started by > and terminate at white space
    {
        $id=$1;
        chomp ($id);
        print FILE "$id\t"; #will print accession number in a colomn
    }
        if ($_=~/^FT   /)
        
    {
        print FILE "$_";
        
    }
}

This prints the ID only in first line i.e, It works perfectly fine with the results having single domain but fails if there are more than one domain.

for example

FT   SIGNAL        1     28       
FT   DOMAIN        1     11       N-REGION.
FT   DOMAIN       12     22       H-REGION.
FT   DOMAIN       23     28       C-REGION.
FT   DOMAIN       29   2135       NON CYTOPLASMIC.
FT   TRANSMEM   2136   2156       
FT   DOMAIN     2157   2157       CYTOPLASMIC.
sp|Q5SSG8|25-479    FT   DOMAIN        1    455       NON CYTOPLASMIC.
sp|Q92854|22-734    FT   DOMAIN        1    713       NON CYTOPLASMIC.
sp|Q9Y5E9|27-686    FT   DOMAIN        1    660       NON CYTOPLASMIC.
sp|Q9Y6N8|55-613    FT   DOMAIN        1    559       NON CYTOPLASMIC.
sp|Q02763|23-748    FT   DOMAIN        1    726       NON CYTOPLASMIC.
sp|Q14517|22-4181   FT   DOMAIN        1   4160       NON CYTOPLASMIC.
sp|O75051|35-1237   FT   DOMAIN        1   1203       NON CYTOPLASMIC.
tr|D3DPA4|1-145 FT   DOMAIN        1    119       CYTOPLASMIC.
FT   TRANSMEM    120    144       
FT   DOMAIN      145    145       NON CYTOPLASMIC.

how may I make it work for multiple entries.

expected output

sp|Q92673|1-2157    FT   SIGNAL        1     28       
sp|Q92673|1-2157    FT   DOMAIN        1     11       N-REGION.
sp|Q92673|1-2157    FT   DOMAIN       12     22       H-REGION.
sp|Q92673|1-2157    FT   DOMAIN       23     28       C-REGION.
sp|Q92673|1-2157    FT   DOMAIN       29   2135       NON CYTOPLASMIC.
sp|Q92673|1-2157    FT   TRANSMEM   2136   2156       
sp|Q92673|1-2157    FT   DOMAIN     2157   2157       CYTOPLASMIC.
sp|Q5SSG8|25-479    FT   DOMAIN        1    455       NON CYTOPLASMIC.
sp|Q92854|22-734    FT   DOMAIN        1    713       NON CYTOPLASMIC.
sp|Q9Y5E9|27-686    FT   DOMAIN        1    660       NON CYTOPLASMIC.
sp|Q9Y6N8|55-613    FT   DOMAIN        1    559       NON CYTOPLASMIC.
sp|Q02763|23-748    FT   DOMAIN        1    726       NON CYTOPLASMIC.
sp|Q14517|22-4181   FT   DOMAIN        1   4160       NON CYTOPLASMIC.
sp|O75051|35-1237   FT   DOMAIN        1   1203       NON CYTOPLASMIC.
tr|D3DPA4|1-145     FT   DOMAIN        1    119       CYTOPLASMIC.
tr|D3DPA4|1-145     FT   TRANSMEM    120    144       
tr|D3DPA4|1-145     FT   DOMAIN      145    145       NON CYTOPLASMIC.

Thanks for the help in advance

CodePudding user response:

Just move the print FILE "$id\t" into the other if block, i.e. only populate the $id when it's specified, print it for every domain.

You might add a check that the $id isn't empty before printing it, but it shouldn't happen if I understand the format correctly.

   if (/^ID   (\S ?)\s/)
   {
        $id = $1;
   }
   if (/^FT   /)
   {
        print FILE "$id\t$_";
   }
  • Related