Home > Software engineering >  Remove Pandas rows that contain approximate string matches
Remove Pandas rows that contain approximate string matches

Time:06-29

I want to remove rows from a Pandas dataframe that have the same value in column A and are an approximate match in column B (by edit distance). Ex:

Index A B
0 'Apple' 'bicycle'
1 'Apple' 'gigantic bicycle'
2 'Apple' 'a bicycle'
3 'Peach' 'bicycle'
4 'Apple' '~bicycle**'
5 'Peach' 'airplane'
6 'Apple' 'car'
7 'Apple' 'cars'

In this case I want the resulting data frame to be:

Index A B
0 'Apple' 'bicycle'
1 'Apple' 'gigantic bicycle'
3 'Peach' 'bicycle'
5 'Peach' 'airplane'
6 'Apple' 'car'

Rows 0 and 3 both remain because the value in column A is different, and row 1 remains because 'gigantic bicycle' is different enough from 'bicycle'. We want to remove matches closer than say an edit distance of 3.

I found the get_close_matches() in the difflib library has a decent matching system but doesn't seem to compare by edit distance so it's not quite what I need.

CodePudding user response:

I'm not sure exactly what type of edit distance you're talking about, but fuzzywuzzy might have some functions that could help you. Personally I found this article to be pretty helpful in understanding the different distance functions fuzzywuzzy has.

CodePudding user response:

I would suggest looking into using Levenshtein distance. I believe there is a Python library for calculating this.

https://pypi.org/project/python-Levenshtein/

  • Related