I have a series transition
of string sequences with each string separated by '<', the last element of each sequence is always the same, e.g.,:
0 b>c>d>a
1 d>c>c>a
2 e>e>c>a
3 d>b>c>a
4 d>c>c>a
I want to calculate the similarity between each sequence with all other sequences, the level % of that similarity, and get the most frequent sequences in the dataset. I know this is general but what is the best approach to do this?
this is what I tried so far but is just returns a matrix, not the level of similarity or the most frequent sequences:
n = transition.shape[0]
for i,p1 in enumerate(transition):
for j,p2 in enumerate(transition[i:]):
sim[i,j i] = sim[j i,i] = np.sum(np.array(p1) == np.array(p2))
CodePudding user response:
One of the possible solutions is to use Levenshtein Distance
And then with Python your code would look something like that:
pip install python-Levenshtein
import Levenshtein
dist = Levenshtein.distance('Levenshtein', 'Lenvinsten')
print(dist)
And you'll have to create a pivot table to put distances of all your string in one place.