Home > Enterprise >  String matching and ranking logic
String matching and ranking logic

Time:10-07

I was wondering if anyone here could come up with some solution for me that would take 2 strings and compare then word by word to give me a percentage match of the entire string.

Example: If I wanted to compare these 2 strings

  1. The Elf on the Shelf: A Christmas Musical (Touring)
  2. The Elf on the Shelf Musical Baltimore

In SQL if I do a like comparison it won't match. But if I can break up each word and say if it matches we would see that 6 of the 7 words match from string 2 to string 1. And then could say 85% match

Thanks!

CodePudding user response:

You need to calculate the similarity between two strings. There are many algorithms you can accomplish this with. Let's try out the Levenshtein distance and the Longest Common Subsequence; each with their own advantages.

-- Sample strings
DECLARE 
 @string1 VARCHAR(100) = 'The Elf on the Shelf: A Christmas Musical (Touring)',
 @string2 VARCHAR(100) = 'The Elf on the Shelf Musical Baltimore';

--uncomment to test:
--SELECT @string1 = 'Their', @string2 = 'Theirs'

-- Longest Common Subsequence Solution
SELECT Similarity = 1.*LEN(dbo.LongestCommonSubsequence(@string1,@string2))/L2
FROM
(
  SELECT MIN(f.S), MAX(f.S)
  FROM  (VALUES(LEN(@string1)),(LEN(@string2))) AS f(S)
) f(L1,L2);

-- Levenshtein
SELECT Similarity = (1.*L1-Lev)/L2
FROM
(
  SELECT MIN(f.S), MAX(f.S), dbo.LEVENSHTEIN(@string1,@string2)
  FROM  (VALUES(LEN(@string1)),(LEN(@string2))) AS f(S)
) f(L1,L2,Lev);

Each return:

Similarity
---------------------------------------
0.62745098039


Similarity
---------------------------------------
0.31372549019

For "their" and "theirs" you get:

Similarity
---------------------------------------
0.83333333333

Similarity
---------------------------------------
0.66666666666

CodePudding user response:

Here's one possible solution for getting the percentage of matching words, this assumes the words match between the two strings irrespective of position.

I appreciate it may not be exactly what you are after and doesn't do "similar" words, but hopefully meets the criteria for a percentage match. There's plenty of scope to tweak if it's not exactly what's needed.

Here the two strings are broken up into rows and merged into a single table, after stripping out common punctuation. Then a row_number window function partitions them by matching words and counts each pair. Finally this is counted for just the matching pairs and summed with a count of the repeating words common in both, then as a percental of the shorter string.

declare 
    @s1 varchar(100)='The Elf on the Shelf: A Christmas Musical. (Touring)',
    @s2 varchar(100)='The Elf on the Shelf Musical, Baltimore';

with words as (
  select 1 s, Replace(Translate(value,'!"*():;,.','|||||||||'),'|','') word
  from String_Split(@s1,' ')
  union all
  select 2, Replace(Translate(value,'!"*():;,.','|||||||||'),'|','')
  from String_Split(@s2,' ')
), matching as (
    select *, Row_Number() over(partition by word order by s) rn
    from words
), final as (
    select * , Count(*) over(partition by word, s) repeating, Count(*) over() * 1.0 totwords, sum(Iif(s=1,1,0)) over() s1words
    from matching
    outer apply(values(Iif(rn=2 and rn=s,1,0)))x(p)
)
select (Sum (p)   max(case when s=1 and repeating>1 then repeating end))
/ Max(Iif(totwords/s1words>0.5, totwords-s1words, s1words)) * 100 [Matching Words %]
from final

Here, 6 words in each string match, so it results in the desired 6 being 85.7% of the shorter 7-word string.

Example DB<>Fiddle

CodePudding user response:

Maybe create a function to do the job?

 Public Function fCompareStrings(strSource As String, strSearch As String) As Single
Dim intCounter As Integer
Dim strTest As String
Dim intScore As Integer
Const conWORD_LENGTH As Integer = 3
 
For intCounter = 1 To Len(strSource) - conWORD_LENGTH
  strTest = Mid(strSource, intCounter, conWORD_LENGTH)
    If InStr(1, strSearch, strTest, vbTextCompare) > 0 Then
      intScore = intScore   1
    End If
Next
 
fCompareStrings = 100 * (intScore / (Len(strSource) - conWORD_LENGTH))
End Function
  • Related