I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s (\D )/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1;
but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
CodePudding user response:
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s (\D )/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print
command means that $seq
will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s (\D )/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s (\D )/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
CodePudding user response:
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s //g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next
knows it's dealing with a good line. Now the if
disappears.
You don't need to get a capture ($1
) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if
and the extra variable.
Finally, print what's left. Without an argument, print
uses the value in the pic variable $_
.
Now that's much more manageable. You escape that scoping issue with if
causing the extra output because there's no scope to worry about.