How to compare strings from 2 dataframes, and create new column containing matching words?-CodePudding

I have two datasets readed in Pandas dataframes.

categories_df:

id category
0  foot 
1  electricity
2  car moto
3  driving licence

sentences_df

sentence_id sentence
0           I love football
1           Yesterday I didn't have elec
2           I only have a car
3           I have driving licence car since 2020
4           Hard do find examples

I would like to do this:

For each category, if one is retrieve, or contained or is in (i.e. not full match) one of sentence in sentences_df, create a new column 'category_matched' with a list of category matched

Expected output for sentences_df (a new column):

sentence_id sentence                               category_matched
0           I love football                        [foot]
1           Yesterday I didn't have elec           [electricity]
2           I only have a car                      [car moto]
3           I have driving licence car since 2020  [car moto, driving licence]
4           Hard do find examples                  [no match]

Category column in categories_df can have several word for one category. Sentences in sentence column (sentences_df) can contain a lot of words and numbers/special characters (like -, /, &, etc).

How can I do this ?

CodePudding user response：

Assuming categories_df is an existing variable within this scope:

def get_overlapping_categories(text):
    overlapping_categories = []
    tokens = text.split()
    for c in categories_df["category"]:
        if any([t in c.split() for t in tokens]):
            overlapping_categories.append(c)
    return overlapping_categories

sentences_df["category_matched"] = sentences_df["sentence"].apply(get_overlapping_categories)

This uses a simple split on whitespaces, so in something like a, b the a would not be a token that is checked, so you might want to improve this by replacing .split() by a better tokenizing method such as nltk.word_tokenize(sentence).

CodePudding user response：

You can loop through the two series and check:

result = ''
if category in sentence:
    result = f'{result}, {category}'

add result to corresponding line of the first series after a split.

But yes, as mentioned in the comment, don't add lists to your dataframe. Keep it as string and then when you need to calculate, split the individual strings.

CodePudding user response：

Fuzzy string matching, e.g., with fuzzywuzzy, is probably your best bet. It returns a similarity score according to different heuristics, and in your case, partial_ratio could work. You can apply it to all sentence-category combinations like so:

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz # !pip install fuzzywuzzy

# Generate dataframes
cat_df = pd.DataFrame({'category': ['foot', 
                                    'electricity',
                                    'car moto', 
                                    'driving licence']},
                      index = pd.Series([0, 1, 2, 3], name="id"))

sent_df = pd.DataFrame({'sentence': ['I love football',
                                     "Yesterday I didn't have elec",
                                     'I only have a car',
                                     'I have driving licence car since 2020',
                                     'Hard do find examples']},
                        index = pd.Series([0, 1, 2, 3, 4], name="sentence_id"))


# Make vectorized fuzzy string matching function
partial_ratio_vec = np.vectorize(fuzz.partial_ratio)

# Apply function to all sentence-category-combinations
result = partial_ratio_vec(cat_df.category, sent_df.sentence.values[:, None])

# Generate result dataframe
df = pd.DataFrame(result, columns = cat_df.category, index = sent_df.sentence)

The resulting dataframe looks like this:

sentence	foot	electricity	car moto	driving licence
I love football	100	19	38	20
Yesterday I didn't have elec	25	53	22	33
I only have a car	25	33	55	12
I have driving licence car since 2020	0	45	50	100
Hard do find examples	25	32	50	40

That is, you get a score for each category and could go from there with filtering etc. It works well for some examples (like the first sentence), but for others it's not so clear; e.g., car moto scores 50 for both the last and the second to last sentences, so with this approach you can't define a threshold that would match one but not the other.