Perl anchored regex performance-CodePudding

Problem and Data

At the bottom of this post is the entire script from which this NYTProf data was generated. The script builds a hash and then attempts to delete keys that contain certain bad pattern. Running the code through NYTProf generates the following

delete @$hash{ grep { /\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.29ms making    2 calls to main::CORE:sort, avg 3.64ms/call
# spent   808µs making 7552 calls to main::CORE:match, avg 107ns/call
# spent   806µs making 7552 calls to main::CORE:regcomp, avg 107ns/call

There are over 7,000 calls being made to main::CORE:match and main::CORE:regcomp. The assumption is that this is a sufficient amount of calls to reduce noise levels.

Moving on! The bad patterns only need to be deleted if they appear at the beginning of a key. Sounds great! Adding a ^ to anchor the regex should improve performance. However, NYTProf generates the following. NYTprof has been run many times and this is quite consistent

delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
# spent  7.34ms making    2 calls to main::CORE:sort, avg 3.67ms/call
# spent  1.62ms making 7552 calls to main::CORE:regcomp, avg 214ns/call
# spent   723µs making 7552 calls to main::CORE:match, avg 96ns/call

Questions

The anchored regex nearly doubles the amount of time spent in these main::CORE:* methods. But an anchored regex should improve performance. What is unique about this dataset that causes the anchored regex to take so much additional time?

Entire Script

use strict;
use Devel::NYTProf;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my $hash;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }
}

sub CleanseHash {
    my @bad_patterns = (
        'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
        'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
    );

    for my $bad_pattern (@bad_patterns) {
        delete @$hash{ grep { /^\Q$bad_pattern\E/ } sort keys %$hash };
    }
}

DB::enable_profile();
CleanseHash();
DB::finish_profile();

CodePudding user response：

It's very unlikely you can optimise the regex engine. If performance is your goal, though, you can concentrate on other parts of the code. For example, try this:

for my $bad_pattern (@bad_patterns) {
    my $re = qr/^\Q$bad_pattern\E/;
    delete @$hash{ grep /$re/, sort keys %$hash };
}

On my machine, it runs much faster (regardless of the presence of the anchor), because the expression form of grep doesn't have to create a scope and the complex compilation of the regex happens just once for each bad pattern.

CodePudding user response：

That's a fairly straightforward matching, with a pattern being a fixed string. So the anchored pattern must be faster in general. The profiling confirms that much, with 96 ns/call vs 107 ns/call.

But when I benchmark anchored and un-anchored versions of the code they run neck-to-neck. This is about the rest of the code, which overwhelms the regex's match: the sort of keys is unneeded for comparison, and the regex is being compiled inside grep's loop, unneeded.

When that is relieved I do get the anchored call to be 10--15% faster (over different runs)

use warnings;
use strict;
use feature 'say';

use Data::Dump;
use Storable qw(dclone);
use Benchmark qw(cmpthese);

my $runfor = shift // 3;

my @states = qw(KansasCity MississippiState ColoradoMountain IdahoInTheNorthWest AnchorageIsEvenFurtherNorth);
my @cities = qw(WitchitaHouston ChicagoDenver);
my @streets = qw(DowntownMainStreetInTheCity CenterStreetOverTheHill HickoryBasketOnTheWall);
my @seasoncode = qw(8000S 8000P 8000F 8000W);
my @historycode = qw(7000S 7000P 7000F 7000W 7000A 7000D 7000G 7000H);
my @sides = qw(left right up down);

my @bad_patterns = ( 
    'KansasCity[0].WitchitaHouston[1].DowntownMainStreetInTheCity[2]',
    'ColoradoMountain[0].ChicagoDenver[1].HickoryBasketOnTheWall[2].8000F'
);

my $hash_1;
for my $state (@states) {
    for my $city (@cities) {
        for my $street (@streets) {
            for my $season (@seasoncode) {
                for my $history (@historycode) {
                    for my $side (@sides) {
                        $hash_1->{$state . '[0].' . $city . '[1].' . $street . '[2].' . $season . '.' . $history . '.' . $side} = 1;
                    }
                }
            }
        }
    }   
}    
my $hash_2 = dclone $hash_1;

#say for @bad_patterns; say '---'; dd $hash_1; exit;

sub no_anchor {
    for my $bad_pattern (@bad_patterns) {
        my $re = qr/\Q$bad_pattern\E/;
        delete @$hash_2{ grep { /$re/ } keys %$hash_2 };
    }   
}   
sub w_anchor {
    for my $bad_pattern (@bad_patterns) {
        my $re = qr/^\Q$bad_pattern\E/;
        delete @$hash_1{ grep { /$re/ } keys %$hash_1 };
    }
}


cmpthese( -$runfor, {
    'no_anchor' => sub { no_anchor() },
    'w_anchor'  => sub { w_anchor() },
});

I have the comparison subs use external data (not passed to tested subs as usually), to cut out any extra work, and then I use separate hashref copies obtained with Storable::dclone.

The output of benchmark above run with 10 seconds (pass 10 to program when run):

           Rate no_anchor  w_anchor
no_anchor 296/s        --      -13%
w_anchor  341/s       15%        --

So the anchored version does win, albeit with a lesser margin than I'd expect: with this data most of the time the regex doesn't match so the un-anchored version does more work in all but two cases and I'd expect larger difference. This is due to other complexities in the code, which further dilute difference in the matching efficiency itself.

This lends us an important lesson about timing code: it can be subtle. One needs to ensure that only the relevant sections are compared, and fairly (in equal situataions).