Fuzzy phrase similarity in SQL Server-CodePudding

Using SQL Server, how should I perform a fuzzy ranked search across all rows in a large table for similarity to a long phrase on one column?

In other words, if my data looks like this:

Id	Data
1	the quick brown fox jumps over the lazy dog
2	the quick brown cat jumps over the lazy frog
3	the lazy quick brown frog jumps over the cat
4	lorem ipsum dolor sit amet

And I search for "the quick brown ox jumps over a lazy dog", I want results that roughly resemble:

Id	Score
1	95
2	80
3	40
4	0

Actual data would have more rows and longer phrases.

Obviously I don't want an exact string match, so using LIKE or CONTAINS apparently wouldn't work.

Word order matters, so searching each word individually would not work either.

Full-text indices sound like they're only useful for word similarity, so I haven't seen a way to apply that to phrase similarity.

Levenshtein distance sounds like it should only be used on smaller data, so I assume it would be too slow here?

I've seen TFIDF and cosine similarity mentioned, but I'm not sure whether that's the right thing to use here, or how it might be implemented on SQL Server.

CodePudding user response：

Use the FUNCTION F_INFERENCE_BASIQUE I wrote about that...

https://sqlpro.developpez.com/cours/sql/comparaisons-motifs/#LVI

Demo :

CREATE TABLE T_TEST (id int, val VARCHAR(256));
INSERT INTO T_TEST VALUES
(1,     'the quick brown fox jumps over the lazy dog'),
(2,     'the quick brown cat jumps over the lazy frog'),
(3,     'the lazy quick brown frog jumps over the cat'),
(4,     'lorem ipsum dolor sit amet');

SELECT id, 100.0 * dbo.F_INFERENCE_BASIQUE(val, 'the quick brown fox jumps over the lazy dog') 
       / LEN('the quick brown fox jumps over the lazy dog') AS PERCENT_MATCH
FROM   T_TEST

id          PERCENT_MATCH
----------- ---------------------------------------
1           100.000000000000
2           46.511627906976
3           81.395348837209
4           6.976744186046

You can adapt the code for you own convenience, as an example in eliminatig the two ways comparizon with only the one way... In this case the match indice is :

id          PERCENT_MATCH
----------- ---------------------------------------
1           100.000000000000
2           46.511627906976
3           25.581395348837
4           4.651162790697

Wich is nearly close to your demand !

CodePudding user response：

There are many methods and algorithms that perform fuzzy string comparison.

In my system I use a CLR function that calculates a Jaro–Winkler distance. I use it when a user tries to create a new company. Before creating a new entry I calculate the Jaro–Winkler distance between a new company name and all existing companies in the database to see if it already exists and allow for some mistypes and slightly different spelling.

I show few top matches to the user hoping that they would not create duplicates.

This is how it works for your example:

DECLARE @T TABLE (id int, val VARCHAR(256));
INSERT INTO @T VALUES
(1,  'the quick brown fox jumps over the lazy dog'),
(2,  'the quick brown cat jumps over the lazy frog'),
(3,  'the lazy quick brown frog jumps over the cat'),
(4,  'lorem ipsum dolor sit amet'),
(12, 'the quick brown ox jumps over the lazy dog'),
(13, 'the quick brown fox jumps over a lazy dog'),
(14, 'the quick brown ox jumps over a lazy dog')
;

SELECT
    T.*
    ,dbo.StringSimilarityJaroWinkler(T.val, 
        'the quick brown ox jumps over a lazy dog') AS dist
FROM @T AS T
ORDER BY dist desc
;

 ---- ---------------------------------------------- ------------------- 
| id |                     val                      |       dist        |
 ---- ---------------------------------------------- ------------------- 
| 14 | the quick brown ox jumps over a lazy dog     | 1                 |
| 13 | the quick brown fox jumps over a lazy dog    | 0.995121951219512 |
| 12 | the quick brown ox jumps over the lazy dog   | 0.975586080586081 |
|  1 | the quick brown fox jumps over the lazy dog  | 0.971267143709004 |
|  2 | the quick brown cat jumps over the lazy frog | 0.931560196560197 |
|  3 | the lazy quick brown frog jumps over the cat | 0.836212121212121 |
|  4 | lorem ipsum dolor sit amet                   | 0.584472934472934 |
 ---- ---------------------------------------------- -------------------

Here is a C# code for the CLR function:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class UserDefinedFunctions
{
    /*
    The Winkler modification will not be applied unless the percent match
    was at or above the WeightThreshold percent without the modification.
    Winkler's paper used a default value of 0.7
    */
    private static readonly double m_dWeightThreshold = 0.7;

    /*
    Size of the prefix to be concidered by the Winkler modification.
    Winkler's paper used a default value of 4
    */
    private static readonly int m_iNumChars = 4;

    [Microsoft.SqlServer.Server.SqlFunction(DataAccess = DataAccessKind.None, SystemDataAccess = SystemDataAccessKind.None, IsDeterministic = true, IsPrecise = true)]
    public static SqlDouble StringSimilarityJaroWinkler(SqlString string1, SqlString string2)
    {
        if (string1.IsNull || string2.IsNull)
        {
            return 0.0;
        }

        return GetStringSimilarityJaroWinkler(string1.Value, string2.Value);
    }

    private static double GetStringSimilarityJaroWinkler(string string1, string string2)
    {
        int iLen1 = string1.Length;
        int iLen2 = string2.Length;
        if (iLen1 == 0)
        {
            return iLen2 == 0 ? 1.0 : 0.0;
        }

        int iSearchRange = Math.Max(0, Math.Max(iLen1, iLen2) / 2 - 1);

        bool[] Matched1 = new bool[iLen1];
        for (int i = 0; i < Matched1.Length;   i)
        {
            Matched1[i] = false;
        }
        bool[] Matched2 = new bool[iLen2];
        for (int i = 0; i < Matched2.Length;   i)
        {
            Matched2[i] = false;
        }

        int iNumCommon = 0;
        for (int i = 0; i < iLen1;   i)
        {
            int iStart = Math.Max(0, i - iSearchRange);
            int iEnd = Math.Min(i   iSearchRange   1, iLen2);
            for (int j = iStart; j < iEnd;   j)
            {
                if (Matched2[j]) continue;
                if (string1[i] != string2[j]) continue;

                Matched1[i] = true;
                Matched2[j] = true;
                  iNumCommon;
                break;
            }
        }
        if (iNumCommon == 0) return 0.0;

        int iNumHalfTransposed = 0;
        int k = 0;
        for (int i = 0; i < iLen1;   i)
        {
            if (!Matched1[i]) continue;
            while (!Matched2[k])
            {
                  k;
            }
            if (string1[i] != string2[k])
            {
                  iNumHalfTransposed;
            }
              k;
            // even though length of Matched1 and Matched2 can be different,
            // number of elements with true flag is the same in both arrays
            // so, k will never go outside the array boundary
        }
        int iNumTransposed = iNumHalfTransposed / 2;

        double dWeight =
            (
                (double)iNumCommon / (double)iLen1  
                (double)iNumCommon / (double)iLen2  
                (double)(iNumCommon - iNumTransposed) / (double)iNumCommon
            ) / 3.0;

        if (dWeight > m_dWeightThreshold)
        {
            int iComparisonLength = Math.Min(m_iNumChars, Math.Min(iLen1, iLen2));
            int iCommonChars = 0;
            while (iCommonChars < iComparisonLength && string1[iCommonChars] == string2[iCommonChars])
            {
                  iCommonChars;
            }
            dWeight = dWeight   0.1 * iCommonChars * (1.0 - dWeight);
        }
        return dWeight;
    }

};