I have a dataframe (more than 1 million rows) that has an open text columns for customer can write whatever they want. Misspelled words appear frequently and I'm trying to group comments that are grammatically the same.
For example:
ID | Comment |
---|---|
1 | I want to change my credit card |
2 | I wannt change my creditt card |
3 | I want change credit caurd |
I have tried using Levenshtein Distance but computationally it is very expensive. Can you tell me another way to do this task?
Thanks!
CodePudding user response:
Levenshtein Distance has time complexity O(N^2).
If you define a maximum distance you're interested in, say m, you can reduce the time complexity to O(Nxm). The maximum distance, in your context, is the maximum number of typos you accept while still considering two comments as identical.
If you cannot do that, you may try to parallelize the task.
CodePudding user response:
this is not a trivial task. If faced with this problem, my approach would be:
- Tokenise your sentences. There are many ways to tokenise a sentence, the most straightforward way is to convert a sentence to a list of words. E.g.
I want to change my credit card
becomes[I, want, to, change, my, credit, card]
. Another way is to roll a window of size n across your sentence, e.g.I want to
becomes['I w', ' wa', 'wan', 'ant', ...]
for window size 3. - After tokenising your sentence, create an embedding (vectorising), i.e. convert your token to a vector of numbers. The most simple way is to use some ready-made library like sklearn's TfidfVectorizer. If your data cares about the order of the words, then a more sophisticated vectoriser is needed.
- After vectorising, use a clustering algorithm. The most simple one is K-Means.
Of course, this is a very complicated task, and there could be a lot of ways to approach this problem. What I described is the simplest out-of-the-box solution. Some clever people have used different strategies to get better results. One example is https://www.youtube.com/watch?v=nlKE4gvJjMo. You need to do this research on this field on your own.
Edit: of course your approach is good for a small dataset. But the difficult part lies in how to perform better than a O(n^2) complexity.