Home > Back-end >  remove duplicates values in a column seperated by comma in Pandas
remove duplicates values in a column seperated by comma in Pandas

Time:03-22

I want to remove duplicates in a column via Pandas. I tried df.drop_duplicates() but no luck. How to achieve this in Pandas?

Input:

A
team=red, Manager=Travis
team=Blue, Manager=John, team=Blue
Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC

Expected_Output:

A
team=red, Manager=Travis
team=Blue, Manager=John
Manager=David, Bank=HDFC, team=XYZ

Code

df = df.drop_duplicates('A', keep='last')

CodePudding user response:

Try:

df['A'].str.split(',').explode().str.strip(' ')\
       .drop_duplicates().groupby(level=0).agg(','.join)

Output:

0             team=red,Manager=Travis
1              team=Blue,Manager=John
2    Manager=David,Bank=HDFC,team=XYZ
Name: A, dtype: object

CodePudding user response:

You can use some data structures to achieve this result.

  1. split entries
  2. convert to set (or some non duplicated structure)
  3. join back to string
print(df['A'])
0                         team=red, Manager=Travis
1               team=Blue, Manager=John, team=Blue
2    Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Name: A, dtype: object


out = (
    df['A'].str.split(r',\s ')
    .map(set)
    .str.join(", ")
)

print(out)
0              Manager=Travis, team=red
1               team=Blue, Manager=John
2    Bank=HDFC, team=XYZ, Manager=David
Name: A, dtype: object

Alternatively, if the order of your string entries is important, you can use dict.fromkeys instead of a set. Since dictionaries are implicitly ordered as of Py > 3.6

out = (
    df['A'].str.split(r',\s ')
    .map(dict.fromkeys)
    .str.join(", ")
)

print(out)
0              team=red, Manager=Travis
1               team=Blue, Manager=John
2    Manager=David, Bank=HDFC, team=XYZ
Name: A, dtype: object
  • Related