Can anyone spot the problem in my for loop in PERL?-CodePudding

I have a text file which contains a protein sequence in FASTA format (https://www.uniprot.org/uniprot/P51170.fasta). FASTA files have the first line which is a header and the rest is the sequence of interest. Each letter is one amino acid. I want to write a program that finds the motifs VSEX (X being any amino acid and the others being specific ones) and prints out the motif itself and the position it was found. So far this has been my code

#!usr/bin/perl
open (IN,'P51170.fasta.txt');
while(<IN>) {
$seq.=$_;
$seq=~s/ //g;
chomp $seq;
} 
#print $seq;

$j=0;
$l= length $seq;
#print $l;
for ($i=0, $i<=$l-4,$i  ){
    $j=$i 1;
    $motif= substr ($seq,$i,4);
    if ($motif=~m/VSE(.)/) {
        print "motif $motif found in position $j \n" ;
    }
}

Im pretty sure I have messed up the loop but I dont know what went wrong. The output I get on cygwin is the following

motif  found in position 2
motif  found in position 2
motif  found in position 2

CodePudding user response：

So some general perl tips:

Always use strict; and use warnings; - that'll tell you when your code is doing something bogus.

Anyway, in trying to figure out what's going wrong (although the other post correctly points out that perl for loops need semicolons not commas) I rewrote it a little to accomplish what I think is the same result:

#!usr/bin/perl

use strict;
use warnings;

#read in source data
open (my $input_data,' '<', P51170.fasta.txt') or die $!;    
#extract 'first line':
#>sp|P51170|SCNNG_HUMAN Amiloride-sensitive sodium channel subunit gamma OS=Homo sapiens OX=9606 GN=SCNN1G PE=1 SV=4

my $header = <$input_data>;
#slurp all the rest of the file into the seq string.  
#$/ is the 'end of line' separate, thus we temporarily undefine it to read the whole file rather than just one line
my $seq = do { local $/; <$input_data> };
#remove linefeeds from the sequence
$seq =~ s/[\r\n]//g;
close $input_data;


#printing what either looks like for clarity
print $header, "\n\n";   
print $seq, "\n\n";


#iterate regex matches against $seq
while ( $seq =~ m/VSE(.)/g ) {
    # use pos() function to report where matches happened. $1 is the contents
    # of the first 'capture bracket'. 
    print "motif $1 at ", pos($seq), "\n";
}

So rather than manually for-looping through your data, we instead use the regex engine and the perl pos() function to find where any relevant matches occur.

CodePudding user response：

Use semicolons in the C-style loop:

for ($i=0; $i<=$l-4; $i  ) {

Or, use a Perl style loop:

for my $i (0 .. $l - 4) {

But you don't have to loop over the positions, Perl can do that for you (see pos):

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

open my $in, '<', 'P51170.fasta.txt' or die $!;
my $seq = "";
while (<$in>) {
    chomp;
    s/ //g;
    $seq .= $_;
}

while ($seq =~ /(VSE.)/g) {
    say "Motif $1 found at ", pos($seq) - 3;
}

Note that I followed some good practices:

I used strict and warnings;
I checked the return value of open;
I used the 3-argument version of open;
I used a lexical filehandle;
I don't remove spaces from $seq again and again, only from the newly read lines.

CodePudding user response：

Your base program can be reduced to a one-liner:

perl -0777 -nwE's/\s //g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt

-0777 is slurp mode: the whole file is read in one go
-n uses either standard input, or a file name as argument as source for input
-E enable features (say in this case)
s/\s //g removes all whitespace from the default variable $_ -- the input from the file
$1 contains the string matched by the parentheses in the regex
pos reports the position of the match

Though of course this includes the header in the offset, which I assume you do not want. So we're going to have to skip the slurp mode -0777 and the -n switch. It will clutter the code somewhat:

perl -wE'$h = <>; $_ = do { local $/; <> }; s/\s //g; say "motif $1 found at " . pos while /VSE(.)/g' P51170.fasta.txt

Here we use the idiomatic do block slurp in combination with the diamond operator <>. The diamond operator will use either standard input, or a file name, just like -n.

The first <> reads the header, which can be used later.

The one-liner as a program file looks like this:

use strict;
use warnings;                 # never skip these pragmas
use feature 'say';

my $h = <>;                   # header, skip for now
$_ = do { local $/; <> };     # slurp rest of file
s/\s //g;                     # remove whitespace
say "motif $1 found at " . pos while /VSE(.)/g;

Use it like this:

perl fasta.pl P51170.fasta.txt