Remove Pandas rows that contain approximate string matches-CodePudding

I want to remove rows from a Pandas dataframe that have the same value in column A and are an approximate match in column B (by edit distance). Ex:

Index	A	B
0	'Apple'	'bicycle'
1	'Apple'	'gigantic bicycle'
2	'Apple'	'a bicycle'
3	'Peach'	'bicycle'
4	'Apple'	'~bicycle**'
5	'Peach'	'airplane'
6	'Apple'	'car'
7	'Apple'	'cars'

In this case I want the resulting data frame to be:

Index	A	B
0	'Apple'	'bicycle'
1	'Apple'	'gigantic bicycle'
3	'Peach'	'bicycle'
5	'Peach'	'airplane'
6	'Apple'	'car'

Rows 0 and 3 both remain because the value in column A is different, and row 1 remains because 'gigantic bicycle' is different enough from 'bicycle'. We want to remove matches closer than say an edit distance of 3.

I found the get_close_matches() in the difflib library has a decent matching system but doesn't seem to compare by edit distance so it's not quite what I need.

CodePudding user response：

I'm not sure exactly what type of edit distance you're talking about, but fuzzywuzzy might have some functions that could help you. Personally I found this article to be pretty helpful in understanding the different distance functions fuzzywuzzy has.

CodePudding user response：

I would suggest looking into using Levenshtein distance. I believe there is a Python library for calculating this.

https://pypi.org/project/python-Levenshtein/