How can I print out multiple similar patterns I have matched in perl?-CodePudding

So I got a uniprot flat file in fasta format (https://www.uniprot.org/uniprot/P30988.txt here is the link but it's supposed to be able to work with any uniprot flat file) and I want to essentially take the transmembrane parts from the sequence and print them out. Here is what I have done so far:

#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
    if($_=~ m/^ID/) {
    print $_ ; }
    if($_=~ m/^AC/){
    print $_ ; }
    #if ($_=~ m/^SQ/){
    #print $_; }
    if ($_=~ m/^\s (. )/) {
    $seq.=$1;                   # Here I match ONLY the sequence
    $seq=~s/\s//g; print $seq;  # Here I removed spaces from the sequence
     }

    if($_=~ m/^FT\s TRANSMEM\s (\d )\.\.(\d )/){
    $s=$1; $e=$2;} # Now I know that
      # this is the way to match the lines that mention the
      # transmembrane parts and the parentheses are the number of
      # the amino acids on the sequence BUT as it has multiple
      # transmembrane domains I am not very sure how to handle it.

    $tr=substr ($seq, $s-1, $e-$s 1); # here I try to use the numbers
      # given in the file to get the sequences of each part

    push (@transmembrane, $tr);
    print @transmembrane; # Now this is something I tried and while
    # it printed a sequence it was not the same length as the total
    # amount of amino acids that are in the transmembrane parts. Not
    # sure whats up with that.

The issue I have is that I'm not sure how to deal with the fact that $1 and $2 are not one value but as many as there are transmembrane parts.

If any of you know how to make this work I'd appreciate it!

CodePudding user response：

[ As with @choroba I have no idea what a transmembrane is, so I'm guessing in the code below ]

First thing - you script doesn't compile - you are missing the closing } for the while block.

Next observation is in this block of code (I've removed your comments and reformatted make the code clearer)

    if($_=~ m/^FT\s TRANSMEM\s (\d )\.\.(\d )/){  
        $s=$1; 
        $e=$2;
    } 
    $tr=substr ($seq, $s-1, $e-$s 1); 
    push (@transmembrane, $tr); 
    print @transmembrane;

I'm guessing that you want the $tr variable updated every time that the regex matches and added to @transmembrane. Unfortunately, these two lines are not in the scope of the if statement, so it will only happen once.

    $tr=substr ($seq, $s-1, $e-$s 1); 
    push (@transmembrane, $tr);

Next the statement print @transmembrane; is in the scope of the while block, so will be output after every line is read from P30988.txt. It needs to be moved out of the while block scope.

The next very fundamental issue is you are running the substr statement against the $seq variable before it has been populated. Looking at P30988.txt, I see that the FT TRANSMEM lines come before the sequence data. The $seq variable will not have been populated when you run the substr statement.

That means to need to store the FT TRANSMEM offsets and run the substr commands once the while loop terminates.

Here is a rewrite that does that (note I've commented out the printing of $seq)

open (IN,'P30988.txt');
while (<IN>) {
    if($_=~ m/^ID/) {
        print $_ ; }
    if($_=~ m/^AC/){
        print $_ ; }
    #if ($_=~ m/^SQ/){
        #print $_; }
    if ($_=~ m/^\s (. )/) { 
        $seq.=$1;                        #Here I match ONLY the sequence
        $seq=~s/\s//g; 
        #print $seq;       #Here I removed spaces from the sequence
     } 
    
    
    if($_=~ m/^FT\s TRANSMEM\s (\d )\.\.(\d )/){  
        $s = $1; 
        $e = $2;

        # store the offset & length for later
        push @transmembrane, [ $s-1, $e-$s 1 ];
    } 

}

for my $item (@transmembrane)
{
    my ($from, $len) = @$item;
    print substr($seq, $from, $len) . "\n";
}

I get this when I run the modified code

ID   CALCR_HUMAN             Reviewed;         474 AA.
AC   P30988; A0A0C4DG16; A4D1G6; F5H605; O14585; Q13941; Q5ZGL8; Q659U6; Q6DJU8;
AC   Q6T712;
IVGHSLSIFTLVISLGIFVFF
MFLTYILNSMIIIIHLVEVVP
YFWMLCEGIYLHTLIVVAVFT
LRWYYLLGWGFPLVPTTIHAI
LLYIIHGPVMAALVVNFFFLL
AVKATMILVPLLGIQFVVFPW
YVMHSLIHFQGFFVATIYCFC