Home > front end >  DATAFRAME: drop duplicates where column's values are equal for unique key
DATAFRAME: drop duplicates where column's values are equal for unique key

Time:04-15

I want to drop duplicates from DF where column's values are equal for one unique key. Example:

In:

KEY         SYSTEM
TD-438426   AAA
TD-438426   BBB
TD-438426   AAA
TD-438709   BBB
TD-438709   BBB
TD-438750   CCC
TD-438750   CCC
TD-438750   CCC
TD-438874   AAA
TD-438874   BBB

Out:

KEY         SYSTEM
TD-438426   AAA
TD-438426   BBB
TD-438709   BBB
TD-438750   CCC
TD-438874   AAA
TD-438874   BBB

P.S. Of course there are some exceptions that I want to catch.

In:

KEY         TEST    SYSTEM
TD-438426   ABC     AAA
TD-438426   ABC     BBB

Out:

KEY         TEST    SYSTEM
TD-438426   ABC     AAA
TD-438426   ABC     BBB

And

In:

KEY         TEST    SYSTEM
TD-438426   ABC     AAA
TD-438426   CBA     AAA

Out:

KEY         TEST    SYSTEM
TD-438426   ABC     AAA

CodePudding user response:

Like @mcsioni mentioned in the comments, what you are looking for is df.drop_duplicates()

Also, it is useful to understand two arguments of this method, namely, subset and keep.

E.g., You want to retain only unique values in the KEY column and keep the first SYSTEM value for each unique KEY, you'd do:

df.drop_duplicates(subset=['KEY'], keep='first')

If you just used df.drop_duplicates() without any arguments, the subset will be all the columns, which is what your desired output is asking for.

EDIT

To keep up with your new requirement, do this:

df.drop_duplicates(subset=['KEY', 'SYSTEM'], keep='first')

Note: The default behavior for the keep argument is 'first' but doesn't hurt to be explicit when working with high-level libraries like pandas.

  • Related