I'm trying to compare the values in 2 dataframes. This is my code :
for i in df1['Searches']:
for j in df['Tags']:
if i == j:
print(i,j)
The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.
For instance:
Searches | Tags
----------------------------------
lightblue | light blue
light-blue | light blue
light blu | light blue
lite blue | light blue
liteblue | light blue
liteblu | light blue
light b l u e | light blue
light.blue | light blue
l i ght blue | light blue
I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?
Thank you for taking the time to read my question and help in any way you can.
CodePudding user response:
You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity
from the Natural Language Toolkit (NLTK):
from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)
Result:
Searches Tags jaro_similarity
lightblue light blue 0.966667
light-blue light blue 0.933333
light blu light blue 0.966667
lite blue light blue 0.896296
liteblue light blue 0.858333
liteblu light blue 0.819048
light b l u e light blue 0.923077
light.blue light blue 0.933333
l i ght blue light blue 0.877778
You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance
module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance
CodePudding user response:
You can use a string similarity metric to determine a match. For example, here I use edit_distance
from the nltk
library:
import pandas as pd
from nltk.metrics.distance import edit_distance
searches = \
['lightblue',
'light-blue',
'light blu',
'lite blue',
'liteblue',
'liteblu',
'light b l u e',
'light.blue',
'l i ght blue',
'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'
matches = []
distance_threshold = 5
for i in df['Searches']:
for j in df['Tags']:
if edit_distance(i, j) < distance_threshold:
# print(i,j)
matches.append(i)
print(list(set(matches)))
Output:
['light.blue',
'light b l u e',
'lightblue',
'light blu',
'light-blue',
'lite blue',
'l i ght blue',
'liteblu',
'liteblue']
But you will have to adjust distance_threshold
to your liking or choose another metric that works better. See a list of metrics here:
https://www.nltk.org/api/nltk.metrics.distance.html
There are many other libraries that you can try as well. Just give it a search.