I want to remove any duplicate words or group of words in the cells of two columns below and still maintain " , " between each word or group of words. Ive tried a function that uses return (' , '.join(dict.fromkeys(text.split())))
and then apply this function
on each column but it separates words I dont want separated and adds unneeded commas (e.g. three four
should not be split by a comma). The solution will be applied to many more rows in col2
and col3
.
code:
df0 = pd.DataFrame(data ={'col1':[123,123,123],'col2':['one , two , three four', 'two','three four'],
'col3':['many numbers , another number', 'number','another number , number']})
df0['col2'] = df0.groupby(['col1'])['col2'].transform(lambda x : ' , '.join(x))
df0['col3'] = df0.groupby(['col1'])['col3'].transform(lambda x : ' , '.join(x))
df0 = df0.drop_duplicates()
df0
Current output:
col1 col2 col3
0 123 one , two , three four , two , three four many numbers , another number , number , another number , number
desired output:
col1 col2 col3
0 123 one , two , three four many numbers , another number , number
CodePudding user response:
.transform()
will maintain the number of rows present in your original group. Since it seems your are using .drop_duplicates()
on your dataframe to negate this, you are better off using .agg()
in the first place.
From there, the solution is similar to yours, but using set
instead of dict
(similar, but simpler) and passing your delimiter ' , '
back to split
.
Assuming the final order of elements does not matter, this will work:
delim = ' , '
df0 = df0.groupby('col1', as_index = False)[['col2', 'col3']].agg(lambda s: ' , '.join(set(delim.join(s).split(delim))))