I have a dataframe df1
:
ID | item |
---|---|
11111 | chair |
11112 | desk¥blue chair |
11113 | bed¥pen¥cable |
11114 | mug¥old sofa¥toy |
11115 | old mug¥wine glass |
11116 | blue chair¥old mug |
etc.
and a large dictionary item_dict
(from csv) which categorize some but not all items into categories:
item | category |
---|---|
chair | furniture |
blue chair | furniture |
mug | tableware |
old mug | tableware |
wine glass | tableware |
etc.
The goal is to make a new column in df1
of all category of interest for items, if any, in each row:
ID | item | category |
---|---|---|
11111 | chair | furniture |
11112 | desk¥blue chair | furniture |
11113 | bed¥pen¥cable | |
11114 | mug¥old sofa¥toy | mug |
11115 | old mug¥wine glass | mug |
11116 | blue chair¥old mug | furniture, mug |
I’ve tried
df1[‘category’] = df1[‘item’].replace(item_dict)
which will only replace values that only have a single item;
adding regex=True
would allow multi-item entries to be replaced but would turn desk¥blue chair
into desk¥blue furniture
bc a shorter one existed.
I know replace
isn’t the best option since I want the new column to only contain categories of interest without other item names, but am unsure of what to use.
CodePudding user response:
You can try this:
# !pip install -U swifter #### only in case of very large dataframe
import swifter
delimiter = '¥'
df1['category'] = df1['item'].swifter.apply(
lambda _: delimiter.join([item_dict.get(el, '') for el in _.split(delimiter)])
)