I am struggling to find a "pythonic" way to make this logic work:
I have a dataframe of payment transactions with a column of strings( "beneficiary"):
index | beneficiary |
---|---|
12 | REWE SAGT DANKE. ... |
13 | NaN |
14 | OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE |
15 | NETFLIX INTERNATIONAL B.V. |
I need to create another column in my dataframe which would be the category that every line belongs to.
index | beneficiary | category |
---|---|---|
12 | REWE SAGT DANKE. ... | Groceries |
14 | OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE | Groceries |
15 | NETFLIX INTERNATIONAL B.V. | Entertainment |
I am thinking to make a dictionary like this below and somehow reverse look-up the sub-string values from the categories dict with the column above:
categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
"Entertainment": ["NETFLIX"]}
The logic is: if sub-string "REWE" is in df['beneficiary'] then df['category'] equals the key of the dict element where the sub-string is.
I am open to other mapping logic.
CodePudding user response:
You can use a regex, for this you need to rework your dictionary to have the matched strings as key and categories as values:
categories = {"Groceries": ["EDEKA", "REWE", "OBI"],
"Entertainment": ["NETFLIX"]}
cat_sub = {v:k for k,l in categories.items() for v in l}
regex = r'(%s)' % '|'.join(fr'\b{c}\b' for c in cat_sub)
# regex looks like this: (\bEDEKA\b|\bREWE\b|\bOBI\b|\bNETFLIX\b)
df['category'] = df['beneficiary'].str.extract(regex, expand=False).map(cat_sub)
NB. I used word boundaries (\b
) to ensure matching full words, use regex = r'(%s)' % '|'.join(cat_sub)
if you don't want this behaviour
output:
index beneficiary category
0 12 REWE SAGT DANKE. ... Groceries
1 13 NaN NaN
2 14 OBI BAU- U. HEIMWER//BERLIN/DE / OBI SAGT DANKE Groceries
3 15 NETFLIX INTERNATIONAL B.V. Entertainment
NB. if needed to drop the NaNs, use dropna
CodePudding user response:
Try reversing your dictionary to map each word to a category and applying this to each word in "beneficiary":
word_cat = {w: k for k,v in categories.items() for w in v}
df["category"] = df["beneficiary"].str.split(expand=True).apply(lambda x: x.map(word_cat)).bfill(axis=1).iloc[:,0]
df = df.dropna()
>>> df
index beneficiary category
0 12 REWE SAGT DANKE. Groceries
2 14 OBI BAU-U. HEIMWER//BERLIN/DE /OBI SAGT DANKE Groceries
3 15 NETFLIX INTERNATIONAL B.V. Entertainment