Why is my last line is always output twice?-CodePudding

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here

#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {

if($_=~m /^\s (\D )/) {   #this is the pattern I used to match the sequence in the document
  $seq=$1;
  $seq=~s/\s//g;}         #removing the spaces from the sequence

  print $seq;  
}

I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt

CodePudding user response：

Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.

#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {

    if ($_ =~ m /^\s (\D )/) {   
        $seq = $1;
        $seq =~ s/\s//g;
    }   

    print $seq;  
}

The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.

I suspect you want this

#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {

    if ($_ =~ m /^\s (\D )/) {   
        $seq = $1;
        $seq =~ s/\s//g;

        # only print $seq for lines that match with /^\s (\D )/
        # Also - added a newline to make it easier to debug

        print $seq . "\n";
    } 
}

When I run that I get this

MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ 
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK 
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR 
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE 
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI 
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP 
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR 
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA

CodePudding user response：

You can simplify this a bit:

while (<IN>) {
    next unless m/^\s/;
    s/\s //g;
    print;
    }

You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.

You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.

Finally, print what's left. Without an argument, print uses the value in the pic variable $_.

Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.