Home > Enterprise >  Find similar words in an array and eliminate them
Find similar words in an array and eliminate them

Time:09-21

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

foreach($a as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
paris
london tour
london tours
london
londonn

I can eliminate the same words with array_unique

foreach(array_unique($a) as $name) {

echo $name;
echo '<br>';

}

Output:

paris
london
london tour
london tours
londonn

I want to take this further and eliminate similar words. Like, if there is a "london", I want to eliminate "londonn".

So the output will be:

paris
london
london tour

I tried similar_text($name, $name, $percent) but it did not help.

Here is what I tried with my limited of knowledge:

foreach(array_unique($a) as $name) {

$test = $a;
foreach($test as $test1) {

 similar_text($name, $test1, $percent);
if ($percent > 90) {
echo $name;
echo '<br>';
} 

}
}

Output:

paris
paris
london
london
london
london tour
london tour
london tours
london tours
londonn
londonn
londonn

The source of the words is a search list:

$a[] = "$popular_search";

CodePudding user response:

The main problem seems to be the way you use the two nested loops. Here's a very explicit example, without anything fancy, showing how you could do this:

$a[] = "paris";
$a[] = "london";
$a[] = "paris";
$a[] = "london tour";
$a[] = "london tours";
$a[] = "london";
$a[] = "londonn";

$b = [];
foreach($a as $outerName) {
    // start optimistic, no similar string found
    $isUnique = true;
    foreach($b as $innerName) {
        // check whether the string already has a similar entry
        similar_text($outerName, $innerName, $percent);
        if ($percent > 90) {
            $isUnique = false;
            break;
        }
    }
    if ($isUnique) {
        $b[] = $outerName;
    }
}

print_r($b);

Working example

The output is:

Array
(
    [0] => paris
    [1] => london
    [2] => london tour
)

How does it work? There's an outer loop that simply goes through all the strings in array $a. Inside that loop it loops through the strings $b that have already been identified as being unique enough. If a string from $a is similar enough to a string of $b we skip it. That's all.

CodePudding user response:

You can use the %percent part that the function returns... This returns a percentage of similarity between the 2 inputs.

For a word game I implemented, I used this approach and for me to 'match' the word(s), testing for a percentage of >= 60 to 80 seemed to work for 'most' of my test cases, depends how picky you want it to be!

For my case, to get it accurate, I actually converted the test words to metaphones first:

public static function testMetaphone($s1 = "", $s2 = "", $phonemes = 4)
{
    if (empty($s1) || empty($s2)) {
        return false;
    }

    $m1 = metaphone($s1, $phonemes);
    $m2 = metaphone($s2, $phonemes);
    $sim = similar_text($m1, $m2, $perc);
    $logMessage = "M1: {$m1}, M2: {$m2}, Similarity: $sim ($perc %) - Originals text: {$s1} | {$s2}";
    Log::info("testMetaphone: " . $logMessage);
    // Test accuracy
    if ($perc >= 85) {
        return true;
    } else {
        return false;
    }
}

Usage:

$answerCheck = testMetaphone("Toyota", "Totota", 6);

See it in action: https://3v4l.org/KceXD - The above fails, if %-age is 85% but passes if `. So, again may need to play with that to find where YOU are happy with its accuracy.

For you're case you can loop the array and compare each element with every other element using this function and keep track of each word checked and how many similar entries there is and delete then 'duplicates' accordingly.

  • Related