I have a two dimensional hash with 100K keys(primary) like this and I need to get the primary key - the name of the fruit only if a particular condition is satisfied;
like - if the price is between 35 and 55; Desired output is Orange and Grape.
And there is a list (hundreds in count) of unique price ranges for which I need a list of fruits within each range.
Iterating through hash again and again for each price range takes a lot of time. Is there a way we can do it quickly instead of looping through the entire hash for each price range?
Hash format :
$fruits{"Mango"}{color}=Yellow
$fruits{"Mango"}{price}=80
$fruits{"Orange"}{color}=Orange
$fruits{"Orange"}{price}=40
$fruits{"Grape"}{color}=Green
$fruits{"Grape"}{price}=50
CodePudding user response:
Here is an example of how you can do a single scan through the fruits by sorting the the prices in numerical order. This should be faster than scanning the whole hash once for each price range:
package Main;
use v5.20.0;
use feature qw(say);
use strict;
use warnings;
use experimental qw(signatures);
{
my %fruits;
$fruits{Mango}{color} = "Yellow";
$fruits{Mango}{price} = 80;
$fruits{Orange}{color} = "Orange";
$fruits{Orange}{price} = 40;
$fruits{Grape}{color} = "Green";
$fruits{Grape}{price} = 50;
my @ranges = ( [35, 55], [45, 55], [2, 85] );
my $self = Main->new(
fruits => \%fruits,
ranges => \@ranges
);
$self->init_mapping_arrays();
my $names = $self->get_fruit_names_for_all_ranges();
}
sub init_mapping_arrays( $self ) {
my @prices;
my @names;
for my $fruit (keys %{ $self->{fruits} }) {
push @names, $fruit;
push @prices, $self->{fruits}{$fruit}{price};
}
my @idx = map { $_->[0] }
sort { $a->[1] <=> $b->[1] } map { [$_, $prices[$_]] } 0..$#prices;
$self->{prices} = [@prices[@idx]];
$self->{names} = [@names[@idx]];
}
sub get_fruit_names_for_all_ranges ($self) {
my @names;
my $prices = $self->{prices};
my $ranges = $self->{ranges};
for my $i (0..$#$prices) {
for my $range (0..$#$ranges) {
if ( ($ranges->[$range][0] <= $prices->[$i])
&& ($ranges->[$range][1] >= $prices->[$i]))
{
push @{$names[$range]}, $self->{names}[$i];
}
}
}
return \@names;
}
sub new( $class, %args ) { bless \%args, $class }
If this is not fast enough, the get_fruit_names_for_all_ranges()
sub can be optimized further by also sorting the ranges.
CodePudding user response:
If the fruits are sorted, two binary searches would find the fruits quickly.
sub search_cmp
my @fruits = (
{ name => "Orange", price => 40, ... },
...
);
my @ranges = (
[ 35, 55 ],
...
);
my @sorted_fruits = sort { $a->{price} <=> $b->{price} } @fruits;
for my $range (@ranges) {
my $i = binsearch { $a <=> $b->{price} } $range[0], @sorted_fruits, 0;
$i = ~$i if $i < 0;
my $j = binsearch { $a <=> $b->{price} } $range[1], @sorted_fruits, $i;
$j = ~$j - 1 if $j < 0;
say "[$range->{min}, $range->{max}]: @fruits[$i..$j]";
}
sub _unsigned_to_signed { unpack('j', pack('J', $_[0])) }
sub binsearch(&$\@;$$) {
my $compare = $_[0];
#my $value = $_[1];
my $array = $_[2];
my $min = $_[3] // 0;
my $max = $_[4] // $#$array;
return -1 if $max == -1;
my $ap = do { no strict 'refs'; \*{caller().'::a'} }; local *$ap;
my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$bp;
*$ap = \($_[1]);
while ($min <= $max) {
my $mid = int(($min $max)/2);
*$bp = \($array->[$mid]);
my $cmp = $compare->()
or return $mid;
if ($cmp < 0) {
$max = $mid - 1;
} else {
$min = $mid 1;
}
}
return _unsigned_to_signed(~$min);
}
Performance analysis
The best possible worst case is O(R * F) because every range could match every fruit.
The naive approach described by the OP asked to replace is O(R * F). So is it as fast as it can be? No, because the naive approach always adopts its worse case.
In practice, if we can assume that each range matches just a few fruits, we can get far far better results on average from the above: O( ( F R ) log F )