I have a data frame with a column named text
and want to assign values in a new column if the text in the first column contains one or more substrings from a dictionary. If the text
column contains a substring, I want the key of the dictionary to be assigned to the new column category
.
This is what my code looks like:
import pandas as pd
some_strings = ['Apples and pears and cherries and bananas',
'VW and Ford and Lamborghini and Chrysler and Hyundai',
'Berlin and Paris and Athens and London']
categories = ['fruits', 'cars', 'capitals']
test_df = pd.DataFrame(some_strings, columns = ['text'])
cat_map = {'fruits': {'apples', 'pears', 'cherries', 'bananas'},
'cars': {'VW', 'Ford', 'Lamborghini', 'Chrysler', 'Hyundai'},
'capitals': {'Berlin', 'Paris', 'Athens', 'London'}}
The dictionary cat_map
contains sets of strings as values. If the text
column in test_df
contains any of those words, then I want the key of the dictionary to be assigned as value to the new category
column. The output dataframe should look like this:
output_frame = pd.DataFrame({'text': some_strings,
'category': categories})
Any help on this would be appreciated.
CodePudding user response:
You can try
d = {v:k for k, s in cat_map.items() for v in s}
test_df['category'] = (test_df['text'].str.extractall('(' '|'.join(d) ')')
[0].map(d)
.groupby(level=0).agg(set))
print(d)
{'cherries': 'fruits', 'pears': 'fruits', 'bananas': 'fruits', 'apples': 'fruits', 'Chrysler': 'cars', 'Hyundai': 'cars', 'Lamborghini': 'cars', 'Ford': 'cars', 'VW': 'cars', 'Berlin': 'capitals', 'Athens': 'capitals', 'London': 'capitals', 'Paris': 'capitals'}
print(test_df)
text category
0 Apples and pears and cherries and bananas {fruits}
1 VW and Ford and Lamborghini and Chrysler and Hyundai {cars}
2 Berlin and Paris and Athens and London {capitals}
CodePudding user response:
Not exactly sure what you're trying to achieve but if I understood properly you could check if any of the word in the string is present in your cat_map
import pandas as pd
results = {"text": [], "category": []}
for element in some_strings:
for key, value in cat_map:
# Check if any of the word of the current string is in current category
if set(element.split(' ')).intersection(value):
results["text"].append(element)
results["category"].append(key)
df = pd.DataFrame.from_dict(results)
CodePudding user response:
One approach:
lookup = { word : label for label, words in cat_map.items() for word in words }
pattern = fr"\b({'|'.join(lookup)})\b"
test_df["category"] = test_df["text"].str.extract(pattern, expand=False).map(lookup)
print(test_df)
Output
text category
0 Apples and pears and cherries and bananas fruits
1 VW and Ford and Lamborghini and Chrysler and H... cars
2 Berlin and Paris and Athens and London capitals