I have this Bash program:
dict_fn='/usr/share/dict/american-english'
benchmark_repetitions=100
for i in `seq 1 $benchmark_repetitions` ; do
readarray -t syllables < 'syllables.txt'
for syllable in "${syllables[@]}" ; do
matches=$(grep "$syllable" $dict_fn | wc -l)
#echo "${syllable}: $matches"
done
done
It runs in about 4 seconds on my laptop
$ time t.sh
real 0m3,488s
user 0m4,103s
sys 0m0,910s
with this input file syllables.txt:
art
bur
cic
dun
eik
fum
gaw
hiw
irc
jaw
kry
lus
mac
noq
old
pew
qla
rub
sur
tif
uks
Then I tried to do the same in Perl:
use feature qw(say);
use strict;
use warnings;
use experimental qw(declared_refs refaliasing signatures);
{
my $benchmark_repetitions = 100;
for (1..$benchmark_repetitions) {
my \@lines = read_dict();
my \@syllables = read_syllables();
for my $syllable (@syllables) {
my @matches = grep /\Q$syllable/, @lines;
my $num_matches = scalar @matches;
#say "${syllable}: ${num_matches}";
}
}
}
sub read_syllables() {
my $fn = 'syllables.txt';
return read_file($fn);
}
sub read_dict() {
my $fn = '/usr/share/dict/american-english'; # A file with more than 100,000 words
return read_file($fn);
}
sub read_file( $fn ) {
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
chomp(my @lines = <$fh>);
close $fh;
return \@lines;
}
But this script ran to my surprise much slower:
$ time p.pl
real 0m13,392s
user 0m13,369s
sys 0m0,021s
I thought that the perl script would be faster since it did not call the external commands grep
and wc
like the bash script does. Any ideas why the Bash script is so fast, or alternatively, why the Perl script is so slow?
CodePudding user response:
The crux of this is clearly in the filtering of so many lines, where Perl's grep
runs a regex engine every time; it appears that system's grep
is just that much faster, what I find unsurprising. (Even though the actual performance ratio is!)
As a little proof, replace
my $num_matches = () = grep /\Q$syllable/, @$lines;
with
my $num_matches = () = grep { index($_,$syllable) != -1 } @$lines;
and my runtime gets cut in half!
I get rid of the explicit array (@matches
) since bash code doesn't have one for that but directly pipes the list to wc
(small speedup). I also removed the experimental features so that this is more accessible (to older Perls) but that doesn't make much of a difference.
While this brings us reasonably close we'd still need a considerable speedup to catch up but I don't see where to get that. Perhaps system's grep
and handling of files is just faster.
CodePudding user response:
As commented by AnFi, the real bottleneck here is that you read the entire file into memory needlessly in your Perl version.
As a simple baseline, reading the syllables into memory and then looping over the big file line by line brings my baseline down to 0.5 seconds, down from 1.5 seconds from reading both files into memory and then quitting. But even this no-op baseline is slower than grep
in a loop! (0.18 seconds with your code; marginally faster with grep -Fc
.)
The fastest Perl implementation I could pull off takes 4 seconds; a slightly incorrect Perl version which produces wrong results got it down to 3.8 seconds.
use feature qw(say);
use strict;
use warnings;
{
open(my $fh, '<', shift @ARGV);
chomp (my @syllables = <$fh>);
close $fh;
my %syllable = map { $_ => 0 } @syllables;
while (<>) {
for my $match (@syllables) {
$syllable{$match} = 1 if index($_,$match) != -1;
}
}
for my $match (@syllables) {
say "${match}: $syllable{$match}";
}
}
(Credit for the index
hack goes to @zdim.)
The slightly incorrect one fails because there are overlaps between the regular expressions, so if you have "art" and "tar" in the syllables, "tartan" will match "tar" but then not match "art".
my $regex = "(" . join("|", @syllables) . ")";
while (<>) {
for my $match ($_ =~ qr($regex)) {
$syllable{$match};
}
}
I tried to fool around with non-overlapping regexes and named regex groups but all the alternative solutions I could come up with were much slower.
Wait, hang on, here is 3.1 seconds with a simple lookahead.
use feature qw(say);
use strict;
use warnings;
{
open(my $fh, '<', shift @ARGV);
chomp (my @syllables = <$fh>);
close $fh;
my %syllable = map { $_ => 0 } @syllables;
my $regex = "(?=(" . join("|", @syllables) . "))";
my $re = qr{$regex(?:.|$)};
while (<>) {
my %seen;
for my $match ($_ =~ m/$re/go) {
next if $seen{$match} ;
$syllable{$match};
}
}
for my $syll (@syllables) {
say "${syll}: $syllable{$syll}";
}
}
(This is on a Mac, with 10 iterations and input from /usr/share/dict/web2
which is probably different from your american-english
file, but not too different. 235886 words, 2493109 bytes. I refactored the tests to all run in a Bash loop and had the Perl script just produce the result once.)
I would vaguely speculate that grep
does some sort of memory mapping which behaves more or less ideally with the OS buffering, but I have not looked into this in any detail. The grep
on MacOS is the BSD heritage, and different from what you probably have on Linux. (You haven't revealed your platform, so I'm speculating some more here.)