I have a dataframe of strings columns as shown below
id | item | attributes
------------------------------
1 | apple | fruit, red, round
1 | apple | natural, organic
1 | apple | red
2 | pea | a green vegetable
2 | pea | round
Notice that an attribute can be a single word or a sentence e.g. a green vegetable
. If a row has more than one attribute, they are separated by a comma. Now, attribute red
occurs once in the first row and once in the third row for the same item
apple
. Clearly this is a redundancy. The attribute round
also occurs twice, once in the first row for apple
and once in fifth row for pea
. But because round
occurs for two different items, it is not a redundancy.
For each Item, I want to group by id
and item
and merge all the unique attributes into a single cell separated by a comma so that there is no redundancy as shown below:
Expected output
id | item | attributes
------------------------------------------------
1 | apple | fruit, red, round, natural, organic
2 | pea | a green vegetable, round
How can I do this in Python? I have tried
df['merged'] = df.groupby(['id','item']).attributes.transform(lambda x:", ".join(list(set(x))))
But his merges all the attributes and repeats Red
. How do I merge without repetition?
CodePudding user response:
Set up data: (I add one more rows to make the sample more representative)
import pandas as pd
import numpy as np
import re
df = pd.DataFrame({'item': ['apple']*3 ['pea']*3,
'atb':['fruit, red, round', 'natural, organic', 'red', 'a green vegetable', 'round', 'dummy, vegetable']})
df
Create a self-define function:
def string_split(row):
# empty list:
counts = list()
# split string by separators: if you want multiple separtors use: | between separators, ex: re.split(' |, |;')
# this will split by either: (1) ' ', (2) ', ', or (3) ';'
rowwords = re.split(', ', row)
# obtain unique words:
for word in rowwords:
if word not in counts:
counts.append(word)
return counts
Using groupby
to combine all rows into one row, separated by ', '
:
out = df.groupby('item')['atb'].apply(lambda g: ', '.join(g)).reset_index()
out
Using apply
to get unique words desired:
out['new'] = out['atb'].apply(string_split)
out