How to group by columns and merge only the unique strings of another column which are separated by a-CodePudding

I have a dataframe of strings columns as shown below

id | item  | attributes
------------------------------
1  | apple | fruit, red, round
1  | apple | natural, organic
1  | apple | red
2  | pea   | a green vegetable
2  | pea   | round

Notice that an attribute can be a single word or a sentence e.g. a green vegetable. If a row has more than one attribute, they are separated by a comma. Now, attribute red occurs once in the first row and once in the third row for the same item apple. Clearly this is a redundancy. The attribute round also occurs twice, once in the first row for apple and once in fifth row for pea. But because round occurs for two different items, it is not a redundancy.

For each Item, I want to group by id and item and merge all the unique attributes into a single cell separated by a comma so that there is no redundancy as shown below:

Expected output

id | item  | attributes
------------------------------------------------
1  | apple | fruit, red, round, natural, organic
2  | pea   | a green vegetable, round

How can I do this in Python? I have tried

df['merged'] = df.groupby(['id','item']).attributes.transform(lambda x:", ".join(list(set(x))))

But his merges all the attributes and repeats Red. How do I merge without repetition?

CodePudding user response：

Set up data: (I add one more rows to make the sample more representative)

import pandas as pd
import numpy as np
import re

df = pd.DataFrame({'item': ['apple']*3 ['pea']*3,
                   'atb':['fruit, red, round', 'natural, organic', 'red', 'a green vegetable', 'round', 'dummy, vegetable']})

df

Create a self-define function:

def string_split(row):
    # empty list:
    counts = list()
    
    # split string by separators: if you want multiple separtors use: | between separators, ex: re.split(' |, |;')
    # this will split by either: (1) ' ', (2) ', ', or (3) ';'
    rowwords = re.split(', ', row)
    
    # obtain unique words:
    for word in rowwords:
        if word not in counts:
            counts.append(word)    
    
    return counts

Using groupby to combine all rows into one row, separated by ', ':

out = df.groupby('item')['atb'].apply(lambda g: ', '.join(g)).reset_index()
out

Using apply to get unique words desired:

out['new'] = out['atb'].apply(string_split)
out