Home > Software engineering >  Intersect two arrays with minimal duplicate values
Intersect two arrays with minimal duplicate values

Time:07-26

For optimisation purposes, I need to intersect two arrays and keep the least number of duplicate values from the two initial arrays in the resulting array.

The order of values in the resulting array is not important.

Another important constraint is time complexity as this will be executed in a big loop.

Why array_intersect doesn't work :

From Shawn Pyle in the PHP docs :

array_intersect handles duplicate items in arrays differently. If there are duplicates in the first array, all matching duplicates will be returned. If there are duplicates in any of the subsequent arrays they will not be returned.

Rules :

  • Returns values of $arr1 that are in $arr2
  • If $arr1 or $arr2 contain duplicate values, return the least number of values between the two

Examples :

  • intersect([1, 1, 2, 3, 4, 4, 5], [1, 3, 3, 5, 5]) returns [1, 3, 5]
  • intersect([1, 1, 2, 3, 4, 4, 5], [1, 1, 1, 3, 3, 5, 5]) returns [1, 1, 3, 5]
  • intersect([1, 1, 2, 3, 4, 4, 5, 5], [1, 3, 3, 5, 5]) returns [1, 3, 5, 5]
  • intersect([1, 1, 1], [1, 1, 1]) returns [1, 1, 1]
  • intersect([1, 2, 3], [1, 3, 2]) returns [1, 2, 3]

CodePudding user response:

Here's my attempt, I'm basically looking for a faster way to do this if possible :

function intersect($a, $b)
{
    $a_values_count = array_count_values($a);
    $b_values_count = array_count_values($b);

    $res = array_values(array_intersect($a, $b));
    $res_values_count = array_count_values($res);
    foreach ($res as $key => $val)
    {
        if ($res_values_count[$val] > $a_values_count[$val] || $res_values_count[$val] > $b_values_count[$val])
        {
            unset($res[$key]);
            $res_values_count[$val]--;
        }
    }

    return array_values($res);
}

assert(intersect([1, 1, 2, 3, 4, 4, 5], [1, 3, 3, 5, 5]) == [1, 3, 5]);
assert(intersect([1, 1, 2, 3, 4, 4, 5], [1, 1, 1, 3, 3, 5, 5]) == [1, 1, 3, 5]);
assert(intersect([1, 1, 2, 3, 4, 4, 5, 5], [1, 3, 3, 5, 5]) == [1, 3, 5, 5]);
assert(intersect([1, 1, 1], [1, 1, 1]) == [1, 1, 1]);
assert(intersect([1, 2, 3], [1, 3, 2]) == [1, 2, 3]);

CodePudding user response:

Hi I have to say that at first look the @Aderrahim answer look really nice, but then I tried to use a simple approach and test the performance.

Here is the code:

function intersectSimple($a, $b)
{
    $result = array();
    $short = count($a) < count($b) ? $a : $b;
    $long = count($a) < count($b) ? $b : $a;
    foreach ($short as $v) {
        if (in_array($v, $long)) {
            //if found add to results and remove from b
            $result[] = $v;
            unset($long[array_search($v, $long)]);
        }
    }
    return $result;
}

function intersectAderrahim($a, $b)
{
    $a_values_count = array_count_values($a);
    $b_values_count = array_count_values($b);

    $res = array_values(array_intersect($a, $b));
    $res_values_count = array_count_values($res);
    foreach ($res as $key => $val)
    {
        if ($res_values_count[$val] > $a_values_count[$val] || $res_values_count[$val] > $b_values_count[$val])
        {
            unset($res[$key]);
            $res_values_count[$val]--;
        }
    }

    return array_values($res);
}

//Start timer
$start = microtime(true);

echo "Start Test\n";
//Test code print each assert result
//Run code 100000 times
for ($i = 0; $i < 100000; $i  )
{
    $a = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $b = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $result = intersectSimple($a, $b);
    assert(count($result) == 10);
}

//Stop timer
$end = microtime(true);
$time = $end - $start;
//Print performance in microseconds
echo "Performance Simple: $time\n";

//Start timer
$start = microtime(true);

echo "Start Test\n";
//Test code print each assert result
//Run code 100000 times
for ($i = 0; $i < 100000; $i  )
{
    $a = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $b = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
    $result = intersectAderrahim($a, $b);
    assert(count($result) == 10);
}
//Stop timer
$end = microtime(true);
$time = $end - $start;
//Print performance in microseconds
echo "Performance Aderrahim: $time\n";

So I write a fast performance test and run it, the results are:

Start Test
Performance Simple: 0.060362815856934
Start Test
Performance Aderrahim: 0.16634893417358

I don't know if this can be extrapolated to a real case, but you can try in your scenario and test which is better. I really love to know which is the best with real data.

  • Related