Perl Hash : get keys if the value is in-between a range-CodePudding

I have a two dimensional hash with 100K keys(primary) like this and I need to get the primary key - the name of the fruit only if a particular condition is satisfied;

like - if the price is between 35 and 55; Desired output is Orange and Grape.

And there is a list (hundreds in count) of unique price ranges for which I need a list of fruits within each range.

Iterating through hash again and again for each price range takes a lot of time. Is there a way we can do it quickly instead of looping through the entire hash for each price range?

Hash format :

$fruits{"Mango"}{color}=Yellow

$fruits{"Mango"}{price}=80

$fruits{"Orange"}{color}=Orange

$fruits{"Orange"}{price}=40

$fruits{"Grape"}{color}=Green

$fruits{"Grape"}{price}=50

CodePudding user response：

Here is an example of how you can do a single scan through the fruits by sorting the the prices in numerical order. This should be faster than scanning the whole hash once for each price range:

package Main;
use v5.20.0;
use feature qw(say);
use strict;
use warnings;
use experimental qw(signatures);
{
    my %fruits;
    $fruits{Mango}{color}  = "Yellow";
    $fruits{Mango}{price}  = 80;
    $fruits{Orange}{color} = "Orange";
    $fruits{Orange}{price} = 40;
    $fruits{Grape}{color}  = "Green";
    $fruits{Grape}{price}  = 50;
    my @ranges = ( [35, 55], [45, 55], [2, 85] );
    my $self = Main->new(
        fruits => \%fruits,
        ranges => \@ranges
    );
    $self->init_mapping_arrays();
    my $names = $self->get_fruit_names_for_all_ranges();
}


sub init_mapping_arrays( $self ) {
    my @prices;
    my @names;
    for my $fruit (keys %{ $self->{fruits} }) {
        push @names, $fruit;
        push @prices, $self->{fruits}{$fruit}{price};
    }
    my @idx = map { $_->[0] }
      sort { $a->[1] <=> $b->[1] } map { [$_, $prices[$_]] } 0..$#prices;
    $self->{prices} = [@prices[@idx]];
    $self->{names} = [@names[@idx]];
}

sub get_fruit_names_for_all_ranges ($self) {
    my @names;
    my $prices = $self->{prices};
    my $ranges = $self->{ranges};
    for my $i (0..$#$prices) {
        for my $range (0..$#$ranges) {
            if (   ($ranges->[$range][0] <= $prices->[$i])
                && ($ranges->[$range][1] >= $prices->[$i]))
            {
                push @{$names[$range]}, $self->{names}[$i];
            }
        }
    }
    return \@names;
}

sub new( $class, %args ) { bless \%args, $class }

If this is not fast enough, the get_fruit_names_for_all_ranges() sub can be optimized further by also sorting the ranges.

CodePudding user response：

If the fruits are sorted, two binary searches would find the fruits quickly.

sub search_cmp 

my @fruits = (
   { name => "Orange", price => 40, ... },
   ...
);

my @ranges = (
   [ 35, 55 ],
   ...   
);

my @sorted_fruits = sort { $a->{price} <=> $b->{price} } @fruits;

for my $range (@ranges) {
   my $i = binsearch { $a <=> $b->{price} } $range[0], @sorted_fruits, 0;
   $i = ~$i if $i < 0;

   my $j = binsearch { $a <=> $b->{price} } $range[1], @sorted_fruits, $i;
   $j = ~$j - 1 if $j < 0;

   say "[$range->{min}, $range->{max}]: @fruits[$i..$j]";
}

sub _unsigned_to_signed { unpack('j', pack('J', $_[0])) }

sub binsearch(&$\@;$$) {
   my  $compare = $_[0];
   #my $value   = $_[1];
   my  $array   = $_[2];
   my  $min     = $_[3] // 0;
   my  $max     = $_[4] // $#$array;

   return -1 if $max == -1;

   my $ap = do { no strict 'refs'; \*{caller().'::a'} };  local *$ap;
   my $bp = do { no strict 'refs'; \*{caller().'::b'} };  local *$bp;

   *$ap = \($_[1]);
   while ($min <= $max) {
      my $mid = int(($min $max)/2);
      *$bp = \($array->[$mid]);

      my $cmp = $compare->()
         or return $mid;

      if ($cmp < 0) {
         $max = $mid - 1;
      } else {
         $min = $mid   1;
      }
   }

   return _unsigned_to_signed(~$min);
}

Performance analysis

The best possible worst case is O(R * F) because every range could match every fruit.

The naive approach described by the OP asked to replace is O(R * F). So is it as fast as it can be? No, because the naive approach always adopts its worse case.

In practice, if we can assume that each range matches just a few fruits, we can get far far better results on average from the above: O( ( F R ) log F )