How do I get rid of duplicate elements in a cell of a dataframe column where an element can be multi-CodePudding

I want to remove any duplicate words or group of words in the cells of two columns below and still maintain " , " between each word or group of words. Ive tried a function that uses return (' , '.join(dict.fromkeys(text.split()))) and then apply this function on each column but it separates words I dont want separated and adds unneeded commas (e.g. three four should not be split by a comma). The solution will be applied to many more rows in col2 and col3.

code:

df0 = pd.DataFrame(data ={'col1':[123,123,123],'col2':['one , two , three four', 'two','three four'],
                          'col3':['many numbers , another number', 'number','another number , number']})

df0['col2'] = df0.groupby(['col1'])['col2'].transform(lambda x : ' , '.join(x))
df0['col3'] = df0.groupby(['col1'])['col3'].transform(lambda x : ' , '.join(x))
df0 = df0.drop_duplicates()

df0

Current output:

    col1    col2                                            col3
0   123     one , two , three four , two , three four       many numbers , another number , number , another number , number

desired output:

    col1    col2                        col3
0   123     one , two , three four      many numbers , another number , number

CodePudding user response：

.transform() will maintain the number of rows present in your original group. Since it seems your are using .drop_duplicates() on your dataframe to negate this, you are better off using .agg() in the first place.

From there, the solution is similar to yours, but using set instead of dict (similar, but simpler) and passing your delimiter ' , ' back to split.

Assuming the final order of elements does not matter, this will work:

delim = ' , '
df0 = df0.groupby('col1', as_index = False)[['col2', 'col3']].agg(lambda s: ' , '.join(set(delim.join(s).split(delim))))