Home > Back-end >  Perl: Regex not grabbing multiline C style comments in code
Perl: Regex not grabbing multiline C style comments in code

Time:04-13

I have a Perl program that:

  • Reads a SRC file written in C
  • Uses a regex match from SRC file to find specific formatted data to use as the Destination filename
  • Opens new Destination file
  • Performs another regex match to find all C style comments /* */ that contain a keyword abcd. Note: these comments can be 1 line or more than 1 line so the regex is looking for the first /* and then the keyword abcd and then any amount of text and space before it encounters a closing */
  • Writes the regex matches to the destination file
#!/usr/bin/perl
use warnings;
use strict;

my $src = 'D:\\Scripts\\sample.c';
my $fileName;

# open source file for reading
open(SRC_FH,'<',$src) or die $!;

while(my $row = <SRC_FH>){
    if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
        $fileName = $1;
    }
}

my $des = "D:\\Scripts\\" . $fileName . ".txt";

# open destination file for writing
open(DES_FH,'>',$des) or die $!;

print("copying content from $src to $des\n");

seek SRC_FH, 0, 0;

while(my $row = <SRC_FH>){
    if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
        print DES_FH "$1\n";
    }
}

# always close the filehandles
close(SRC_FH);

close(DES_FH);
print "File content copied successfully!\n";

My problem is I think because of the way perl code executes although by regex is correct, my destination file is only getting the 1 line comments written to it. Any C style comments that are more than 1 line are not getting written to the destination file. What am I missing in my 2nd if statement?

I checked my 2nd if statement regex here https://regexr.com/ and it works as its supposed to capturing multi line C style comments as well as single line comments that also contain the keyword abcd.

So I tried the 1st suggestion below by zdim. Here is what I used:

#!/usr/bin/perl
use warnings;
use strict;

my $src = 'D:\\Scripts\\sample.c';
my $fileName;
my @comments;

# open source file for reading
open(SRC_FH,'<',$src) or die $!;

while(my $row = <SRC_FH>){
    if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
        $fileName = $1;
    }
}

my $des = "D:\\Scripts\\" . $fileName . ".txt";

# open destination file for writing
open(DES_FH,'>',$des) or die $!;

print("copying content from $src to $des\n");

#seek SRC_FH, 0, 0;

my $content = do {
    #read whole file at once
    local $/;
    open (SRC_FH,'<', $src) or die $!;
    <SRC_FH>;
};

#if($content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg){
#       my @comments = $content;
#   }

my @comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/sg;

foreach (@comments){
    print DES_FH "$1\n";
}

#while(my $row = <SRC_FH>){
#   if ($row =~ /(\/\*.*abcd.[\s\S]*?\*\/)/){
#       print DES_FH "$1\n";
#   }
#}

# always close the filehandles
close(SRC_FH);

close(DES_FH);
print "File content copied successfully!\n";

The result is all the content from sample.c are copied to the destination file. A full 1:1 copy. Where I am looking to pull all comments single line and multiline out of the C file.

Example 1: /* abcd */ Example 2: /* some text * some more comments abcd and some more comments */

Final Solution

#!/usr/bin/perl
use warnings;
use strict;

my $src = 'D:\\Scripts\\sample.c';
my $fileName;

# open source file for reading
open(SRC_FH,'<',$src) or die $!;

while(my $row = <SRC_FH>){
    if ($row =~ /([0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{2}|[0-9]{2}\.[0-9]{2}\.[0-9]{3}\.[a-z,0-9]{3})/){
        $fileName = $1;
    }
}

my $des = "D:\\Scripts\\" . $fileName . ".txt";

# open destination file for writing
open(DES_FH,'>',$des) or die $!;

print("copying content from $src to $des\n");

seek SRC_FH, 0, 0;

my $content = do{local $/; <SRC_FH>};

my @comments = $content =~ /(\/\*.*abcd.[\s\S]*?\*\/)/g;

for(@comments){
    print DES_FH "$_\n";
}

# always close the filehandles
close(SRC_FH);

close(DES_FH);
print "File content copied successfully!\n";

CodePudding user response:

What am I missing in my 2nd if statement?

Well, nothing -- it's just that in a multiline C comment neither of its lines has both /* and */. Thus that regex just cannot match a multiline comment when a file is read line by line.

To catch such comments either:

  • Read the whole file into a string ("slurp" it), and then add /s modifier on the regex so that . matches a newline as well. Also use /g modifier so to capture all such patterns in the string. One way

    my $content = do { 
        local $/;  # undef record separator so the whole file is read at once
        open my $src_fh, '<', $src_file or die $!;  # have to re-open
        <$src_fh>;                                  # reads it all
    };  # lexical filehandle gets closed as we leave scope
    
    # NOTE -- there may be difficulties in capturing comments in a C source file
    my @comments = $content =~ /.../sg;  # your regex
    

    Or use a library to slurp a file, like

    use Path::Tiny;
    my $content = path($src_file)->slurp; 
    

Or,

  • Set a flag when you see /*, copy all lines until you hit the closing */, then unset the flag. Here is a rudimentary version of that

    my $inside_comment = 0;
    while (<$src_fh>) {
        if (m{(/\*.*)}) {         #/ fix syntax hilite
            $inside_comment = 1;  # opening line for the comment 
            say $des_fh $1; 
        } 
        elsif (m{(.*\*/)}) {      # closing line for the comment
            say $des_fh $1; 
            $inside_comment = 0; 
        } 
        elsif ($inside_comment) { say $des_fh $_}
    }
    

    I tested all this but please check and improve. For one, this plays funny with leading spaces.

Note: Getting all comments out of a C program in general may be rather tricky.


Here is a one-line version of slurping

my $file_content = do { local (@ARGV, $/) = $file_name; <> }
  • Related