I was wondering if anyone here could come up with some solution for me that would take 2 strings and compare then word by word to give me a percentage match of the entire string.
Example: If I wanted to compare these 2 strings
- The Elf on the Shelf: A Christmas Musical (Touring)
- The Elf on the Shelf Musical Baltimore
In SQL if I do a like
comparison it won't match.
But if I can break up each word and say if it matches we would see that 6 of the 7 words match from string 2 to string 1. And then could say 85% match
Thanks!
CodePudding user response:
You need to calculate the similarity between two strings. There are many algorithms you can accomplish this with. Let's try out the Levenshtein distance and the Longest Common Subsequence; each with their own advantages.
-- Sample strings
DECLARE
@string1 VARCHAR(100) = 'The Elf on the Shelf: A Christmas Musical (Touring)',
@string2 VARCHAR(100) = 'The Elf on the Shelf Musical Baltimore';
--uncomment to test:
--SELECT @string1 = 'Their', @string2 = 'Theirs'
-- Longest Common Subsequence Solution
SELECT Similarity = 1.*LEN(dbo.LongestCommonSubsequence(@string1,@string2))/L2
FROM
(
SELECT MIN(f.S), MAX(f.S)
FROM (VALUES(LEN(@string1)),(LEN(@string2))) AS f(S)
) f(L1,L2);
-- Levenshtein
SELECT Similarity = (1.*L1-Lev)/L2
FROM
(
SELECT MIN(f.S), MAX(f.S), dbo.LEVENSHTEIN(@string1,@string2)
FROM (VALUES(LEN(@string1)),(LEN(@string2))) AS f(S)
) f(L1,L2,Lev);
Each return:
Similarity
---------------------------------------
0.62745098039
Similarity
---------------------------------------
0.31372549019
For "their" and "theirs" you get:
Similarity
---------------------------------------
0.83333333333
Similarity
---------------------------------------
0.66666666666
CodePudding user response:
Here's one possible solution for getting the percentage of matching words, this assumes the words match between the two strings irrespective of position.
I appreciate it may not be exactly what you are after and doesn't do "similar" words, but hopefully meets the criteria for a percentage match. There's plenty of scope to tweak if it's not exactly what's needed.
Here the two strings are broken up into rows and merged into a single table, after stripping out common punctuation. Then a row_number window function partitions them by matching words and counts each pair. Finally this is counted for just the matching pairs and summed with a count of the repeating words common in both, then as a percental of the shorter string.
declare
@s1 varchar(100)='The Elf on the Shelf: A Christmas Musical. (Touring)',
@s2 varchar(100)='The Elf on the Shelf Musical, Baltimore';
with words as (
select 1 s, Replace(Translate(value,'!"*():;,.','|||||||||'),'|','') word
from String_Split(@s1,' ')
union all
select 2, Replace(Translate(value,'!"*():;,.','|||||||||'),'|','')
from String_Split(@s2,' ')
), matching as (
select *, Row_Number() over(partition by word order by s) rn
from words
), final as (
select * , Count(*) over(partition by word, s) repeating, Count(*) over() * 1.0 totwords, sum(Iif(s=1,1,0)) over() s1words
from matching
outer apply(values(Iif(rn=2 and rn=s,1,0)))x(p)
)
select (Sum (p) max(case when s=1 and repeating>1 then repeating end))
/ Max(Iif(totwords/s1words>0.5, totwords-s1words, s1words)) * 100 [Matching Words %]
from final
Here, 6 words in each string match, so it results in the desired 6 being 85.7% of the shorter 7-word string.
CodePudding user response:
Maybe create a function to do the job?
Public Function fCompareStrings(strSource As String, strSearch As String) As Single
Dim intCounter As Integer
Dim strTest As String
Dim intScore As Integer
Const conWORD_LENGTH As Integer = 3
For intCounter = 1 To Len(strSource) - conWORD_LENGTH
strTest = Mid(strSource, intCounter, conWORD_LENGTH)
If InStr(1, strSearch, strTest, vbTextCompare) > 0 Then
intScore = intScore 1
End If
Next
fCompareStrings = 100 * (intScore / (Len(strSource) - conWORD_LENGTH))
End Function