I want to drop duplicates from DF where column's values are equal for one unique key. Example:
In:
KEY SYSTEM
TD-438426 AAA
TD-438426 BBB
TD-438426 AAA
TD-438709 BBB
TD-438709 BBB
TD-438750 CCC
TD-438750 CCC
TD-438750 CCC
TD-438874 AAA
TD-438874 BBB
Out:
KEY SYSTEM
TD-438426 AAA
TD-438426 BBB
TD-438709 BBB
TD-438750 CCC
TD-438874 AAA
TD-438874 BBB
P.S. Of course there are some exceptions that I want to catch.
In:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 ABC BBB
Out:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 ABC BBB
And
In:
KEY TEST SYSTEM
TD-438426 ABC AAA
TD-438426 CBA AAA
Out:
KEY TEST SYSTEM
TD-438426 ABC AAA
CodePudding user response:
Like @mcsioni mentioned in the comments, what you are looking for is df.drop_duplicates()
Also, it is useful to understand two arguments of this method, namely, subset
and keep
.
E.g., You want to retain only unique values in the KEY
column and keep the first SYSTEM
value for each unique KEY
, you'd do:
df.drop_duplicates(subset=['KEY'], keep='first')
If you just used df.drop_duplicates()
without any arguments, the subset will be all the columns, which is what your desired output is asking for.
EDIT
To keep up with your new requirement, do this:
df.drop_duplicates(subset=['KEY', 'SYSTEM'], keep='first')
Note: The default behavior for the keep
argument is 'first'
but doesn't hurt to be explicit when working with high-level libraries like pandas.