Remove similar Strings in Python List by Levenshtein Ratio-CodePudding

I am trying to remove similar words in a python List by calculating the Levenshtein ratio between them. If the ratio is >0.9 the word should be removed. So I have to create a loop that iterates through the list and compare each word with each other word and then remove them if the ratio is to high. Unfortunately I haven't quite figured out how to implement that in python, since I cannot iterate through a list and remove its contents. My first naive idea was sth. like the code seen below, which doesn't work, since it skips half the elements. Any help would be appreciated.

word_list = ['asf', 'bcd', 'ase', 'cdf', 'asl']

The Goal here would be to remove the elements 'ase' and 'asl' because they are similar to 'asf' My first attempt looked like this:

for word in word_list:
    for x in word list:
       if Levenshtein.ratio(word, x) >= 0.9:
          word_list.remove(word)

CodePudding user response：

using difflib you can match the similar strings in a list. and then remove the similar strings.

import difflib

word_list = ['asf', 'bcd', 'ase', 'cdf', 'asl']

close_match = difflib.get_close_matches('asf', word_list)
close_match.remove('asf')

result = []
for i in word_list:
    if i not in close_match:
        result.append(i)
print(result)

>>> ['asf', 'bcd', 'cdf']

CodePudding user response：

You could try something like below:

# Output list
output = []

for word in word_list:
    for x in word list:
        # Change our logic, and look for values below 0.9 - Then add those to our output list
       if not Levenshtein.ratio(word, x) >= 0.9 and (word not in output):
          output.append(word)