I want to search for duplicates in my database, but it could be things like
"The smallest thing, and nothing more" "The Smallest Things, And Nothing More" "The smallest thing, and nothing more." "The smallest thing, and nothing"
Is there an easy way to design a fuzzy ==
function that gives a weight of matching, instead of a binary true/false result?
CodePudding user response:
Ruby ships with a library called did_you_mean
it is used to make suggestions for code correction when you make a mistake like "abc".downcsae
will ask you "Did you mean downcase?"
This library includes a module called DidYouMean::Levenshtein
which has a method called distance
. This distance is the number of transformations required for 2 strings to be equal
Example:
s = "The smallest thing, and nothing more"
x = "The Smallest Things, And Nothing More"
DidYouMean::Levenshtein.distance(s,x)
#=> 6
DidYouMean::Levenshtein.distance(s.downcase,x.downcase)
#=> 1
This may be useful in your case although you would need to determine the threshold.
Implementation is also available via the Gem::Text
module which you could include in a class if needed e.g.
class MyClass
extend Gem::Text
def self.fuzzy_equal(x:, y:, threshold:3)
levenshtein_distance(x,y) <= threshold
end
end
MyClass.fuzzy_equal?(x: s,y: x)
#=> false
MyClass.fuzzy_equal?(x: s.downcase,y: x.downcase)
#=> true
MyClass.fuzzy_equal?(x: s,y: x, threshold: 10)
#=> true