I want to remove rows from a Pandas dataframe that have the same value in column A and are an approximate match in column B (by edit distance). Ex:
Index | A | B |
---|---|---|
0 | 'Apple' | 'bicycle' |
1 | 'Apple' | 'gigantic bicycle' |
2 | 'Apple' | 'a bicycle' |
3 | 'Peach' | 'bicycle' |
4 | 'Apple' | '~bicycle**' |
5 | 'Peach' | 'airplane' |
6 | 'Apple' | 'car' |
7 | 'Apple' | 'cars' |
In this case I want the resulting data frame to be:
Index | A | B |
---|---|---|
0 | 'Apple' | 'bicycle' |
1 | 'Apple' | 'gigantic bicycle' |
3 | 'Peach' | 'bicycle' |
5 | 'Peach' | 'airplane' |
6 | 'Apple' | 'car' |
Rows 0 and 3 both remain because the value in column A is different, and row 1 remains because 'gigantic bicycle' is different enough from 'bicycle'. We want to remove matches closer than say an edit distance of 3.
I found the get_close_matches() in the difflib library has a decent matching system but doesn't seem to compare by edit distance so it's not quite what I need.
CodePudding user response:
I'm not sure exactly what type of edit distance you're talking about, but fuzzywuzzy might have some functions that could help you. Personally I found this article to be pretty helpful in understanding the different distance functions fuzzywuzzy has.
CodePudding user response:
I would suggest looking into using Levenshtein distance. I believe there is a Python library for calculating this.