Home > OS >  How to make this IsSimilar(varchar,varchar) function more performant
How to make this IsSimilar(varchar,varchar) function more performant

Time:02-11

I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:

  • John Johnson
  • Jon Jonsen
  • John & Jane Johnson-Peters
  • Fam. Johnsen
  • J. Johnson

To achieve this I've created some SQL functions and functional wise they work:

  • [dbo].[Levenshtein]: A copy of the top rated answer from enter image description here It takes 12-13s to find this record based on my mispelled name, which off course is too long. The full table is only 512 records at the moment.

    Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.

    CodePudding user response:

    After implementing the suggestions in the comments on the OP, I managed to cut down the same SELECT statement from 12-13s to about 1s, which is a lot more acceptable.

    The SplitString has been changed to an inline function:

    Create FUNCTION [dbo].[SplitString](
        @s nvarchar(4000)
    )
    RETURNS TABLE
    WITH SCHEMABINDING
    AS
    RETURN(
        SELECT TRIM(value) AS value FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),' ', '-'),'\', '-'), '-', NULL)  
        WHERE RTRIM(value) <> ''
    );
    

    Cutting down on the variables and using a Join statement for the IsSimilar function gives a boost as well:

    CREATE FUNCTION [dbo].[IsSimilar](
        @s nvarchar(4000)
        , @t nvarchar(4000)
    )
    RETURNS BIT
    AS
    BEGIN
    
        IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
            return 1
    
        IF (LEN(TRIM(@s)) > 0 AND LEN(TRIM(@t)) > 0)
        BEGIN
            RETURN (SELECT IIF (EXISTS(SELECT [sT].[value] FROM (SELECT [value] FROM [dbo].[SplitString](@s)) AS sT INNER JOIN (SELECT [value] FROM [dbo].[SplitString](@t)) AS tT ON (((LEN([sT].[value]) = 1 OR LEN([tT].[value]) = 1 OR SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND SUBSTRING([sT].[value], 1, 1) = SUBSTRING([tT].[value], 1, 1)) OR (NOT(SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND (SELECT [dbo].[Levenshtein]([sT].[value],[tT].[value], 5)) IS NOT NULL)) ), 1, 0))
        END
        RETURN 0
    END
    

    I don't know how much this boost will hold up to real big data, but in this case that'll not be the case as Person records get linked to Account records with every new account creation and only Person records with AccountID = NULL will be considered.

  • Related