How do you match strings with different values in pandas?-CodePudding

I'm trying to compare the values in 2 dataframes. This is my code :

for i in df1['Searches']:
    for j in df['Tags']:
        if  i == j:
           print(i,j)

The code works. However, I want to account for cases where the strings don't entirely match, due to spacing, misspelling, or punctuation, but they should match given how much they have in common.

For instance:

   Searches       |   Tags 
----------------------------------
   lightblue      |   light blue
   light-blue     |   light blue
   light blu      |   light blue
   lite blue      |   light blue
   liteblue       |   light blue
   liteblu        |   light blue
   light b l u e  |   light blue
   light.blue     |   light blue
   l i ght blue   |   light blue

I listed variations of possible strings that could show up under searches, and the string that it should match to under tags. Is there a way to account for those variations and still have them match?

Thank you for taking the time to read my question and help in any way you can.

CodePudding user response：

You are getting into fuzzy string matching. One way to do that is to use a similarity metric such as jaro_similarity from the Natural Language Toolkit (NLTK):

from nltk.metrics.distance import jaro_similarity
df['jaro_similarity'] = df.apply(lambda row: jaro_similarity(row['Searches'], row['Tags']), axis=1)

Result:

     Searches       Tags  jaro_similarity
    lightblue light blue         0.966667
   light-blue light blue         0.933333
    light blu light blue         0.966667
    lite blue light blue         0.896296
     liteblue light blue         0.858333
      liteblu light blue         0.819048
light b l u e light blue         0.923077
   light.blue light blue         0.933333
 l i ght blue light blue         0.877778

You have to pick a cut-off point by experimenting on your data. Documentation on the nltk.metrics.distance module: https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance

CodePudding user response：

You can use a string similarity metric to determine a match. For example, here I use edit_distance from the nltk library:

import pandas as pd
from nltk.metrics.distance import edit_distance

searches = \
['lightblue',
 'light-blue',
 'light blu',
 'lite blue',
 'liteblue',
 'liteblu',
 'light b l u e',
 'light.blue',
 'l i ght blue',
 'totally different string'
]
df = pd.DataFrame()
df['Searches'] = searches
df['Tags'] = 'light blue'

matches = []
distance_threshold = 5
for i in df['Searches']:
    for j in df['Tags']:
        if  edit_distance(i, j) < distance_threshold:
            # print(i,j)  
            matches.append(i)
print(list(set(matches)))

Output:

['light.blue',
 'light b l u e',
 'lightblue',
 'light blu',
 'light-blue',
 'lite blue',
 'l i ght blue',
 'liteblu',
 'liteblue']

But you will have to adjust distance_threshold to your liking or choose another metric that works better. See a list of metrics here:

https://www.nltk.org/api/nltk.metrics.distance.html

There are many other libraries that you can try as well. Just give it a search.