How do I map a dictionary with a set of strings to the column of a pandas data frame in Python?-CodePudding

I have a data frame with a column named text and want to assign values in a new column if the text in the first column contains one or more substrings from a dictionary. If the text column contains a substring, I want the key of the dictionary to be assigned to the new column category.

This is what my code looks like:

import pandas as pd

some_strings = ['Apples and pears and cherries and bananas', 
                'VW and Ford and Lamborghini and Chrysler and Hyundai', 
                'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']

test_df = pd.DataFrame(some_strings, columns = ['text'])

cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'}, 
           'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'}, 
           'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}

The dictionary cat_map contains sets of strings as values. If the text column in test_df contains any of those words, then I want the key of the dictionary to be assigned as value to the new category column. The output dataframe should look like this:

output_frame = pd.DataFrame({'text': some_strings, 
                            'category': categories})

Any help on this would be appreciated.

CodePudding user response：

You can try

d = {v:k for k, s in cat_map.items() for v in s}

test_df['category'] = (test_df['text'].str.extractall('(' '|'.join(d) ')')
                       [0].map(d)
                       .groupby(level=0).agg(set))

print(d)

{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}


print(test_df)

                                                   text    category
0             Apples and pears and cherries and bananas    {fruits}
1  VW and Ford and Lamborghini and Chrysler and Hyundai      {cars}
2                Berlin and Paris and Athens and London  {capitals}

CodePudding user response：

Not exactly sure what you're trying to achieve but if I understood properly you could check if any of the word in the string is present in your cat_map

import pandas as pd

results = {"text": [], "category": []}

for element in some_strings:
    for key, value in cat_map:
        # Check if any of the word of the current string is in current category
        if set(element.split(' ')).intersection(value):
            results["text"].append(element)
            results["category"].append(key)

df = pd.DataFrame.from_dict(results)

CodePudding user response：

One approach:

lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"\b({'|'.join(lookup)})\b"

test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)

Output

                                                text  category
0          Apples and pears and cherries and bananas    fruits
1  VW and Ford and Lamborghini and Chrysler and H...      cars
2             Berlin and Paris and Athens and London  capitals