I want to remove duplicates in a column via Pandas.
I tried df.drop_duplicates()
but no luck.
How to achieve this in Pandas?
Input:
A
team=red, Manager=Travis
team=Blue, Manager=John, team=Blue
Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Expected_Output:
A
team=red, Manager=Travis
team=Blue, Manager=John
Manager=David, Bank=HDFC, team=XYZ
Code
df = df.drop_duplicates('A', keep='last')
CodePudding user response:
Try:
df['A'].str.split(',').explode().str.strip(' ')\
.drop_duplicates().groupby(level=0).agg(','.join)
Output:
0 team=red,Manager=Travis
1 team=Blue,Manager=John
2 Manager=David,Bank=HDFC,team=XYZ
Name: A, dtype: object
CodePudding user response:
You can use some data structures to achieve this result.
- split entries
- convert to set (or some non duplicated structure)
- join back to string
print(df['A'])
0 team=red, Manager=Travis
1 team=Blue, Manager=John, team=Blue
2 Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Name: A, dtype: object
out = (
df['A'].str.split(r',\s ')
.map(set)
.str.join(", ")
)
print(out)
0 Manager=Travis, team=red
1 team=Blue, Manager=John
2 Bank=HDFC, team=XYZ, Manager=David
Name: A, dtype: object
Alternatively, if the order of your string entries is important, you can use dict.fromkeys
instead of a set
. Since dictionaries are implicitly ordered as of Py > 3.6
out = (
df['A'].str.split(r',\s ')
.map(dict.fromkeys)
.str.join(", ")
)
print(out)
0 team=red, Manager=Travis
1 team=Blue, Manager=John
2 Manager=David, Bank=HDFC, team=XYZ
Name: A, dtype: object