I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:
- John Johnson
- Jon Jonsen
- John & Jane Johnson-Peters
- Fam. Johnsen
- J. Johnson
To achieve this I've created some SQL functions and functional wise they work:
[dbo].[Levenshtein]: A copy of the top rated answer from It takes 12-13s to find this record based on my mispelled name, which off course is too long. The full table is only 512 records at the moment.
Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.
CodePudding user response:
After implementing the suggestions in the comments on the OP, I managed to cut down the same SELECT statement from 12-13s to about 1s, which is a lot more acceptable.
The SplitString has been changed to an inline function:
Create FUNCTION [dbo].[SplitString]( @s nvarchar(4000) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN( SELECT TRIM(value) AS value FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),' ', '-'),'\', '-'), '-', NULL) WHERE RTRIM(value) <> '' );
Cutting down on the variables and using a Join statement for the IsSimilar function gives a boost as well:
CREATE FUNCTION [dbo].[IsSimilar]( @s nvarchar(4000) , @t nvarchar(4000) ) RETURNS BIT AS BEGIN IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.') return 1 IF (LEN(TRIM(@s)) > 0 AND LEN(TRIM(@t)) > 0) BEGIN RETURN (SELECT IIF (EXISTS(SELECT [sT].[value] FROM (SELECT [value] FROM [dbo].[SplitString](@s)) AS sT INNER JOIN (SELECT [value] FROM [dbo].[SplitString](@t)) AS tT ON (((LEN([sT].[value]) = 1 OR LEN([tT].[value]) = 1 OR SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND SUBSTRING([sT].[value], 1, 1) = SUBSTRING([tT].[value], 1, 1)) OR (NOT(SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND (SELECT [dbo].[Levenshtein]([sT].[value],[tT].[value], 5)) IS NOT NULL)) ), 1, 0)) END RETURN 0 END
I don't know how much this boost will hold up to real big data, but in this case that'll not be the case as Person records get linked to Account records with every new account creation and only Person records with AccountID = NULL will be considered.