Home > Software engineering >  String similarity between list of words
String similarity between list of words

Time:06-22

I have a series transition of string sequences with each string separated by '<', the last element of each sequence is always the same, e.g.,:

0                    b>c>d>a
1                    d>c>c>a
2                    e>e>c>a
3                    d>b>c>a
4                    d>c>c>a

I want to calculate the similarity between each sequence with all other sequences, the level % of that similarity, and get the most frequent sequences in the dataset. I know this is general but what is the best approach to do this?

this is what I tried so far but is just returns a matrix, not the level of similarity or the most frequent sequences:

n = transition.shape[0]
for i,p1 in enumerate(transition):
    for j,p2 in enumerate(transition[i:]):
        sim[i,j i] = sim[j i,i] = np.sum(np.array(p1) ==  np.array(p2))

CodePudding user response:

One of the possible solutions is to use Levenshtein Distance

And then with Python your code would look something like that:

pip install python-Levenshtein

import Levenshtein
dist = Levenshtein.distance('Levenshtein', 'Lenvinsten')
print(dist)

And you'll have to create a pivot table to put distances of all your string in one place.

  • Related