how to use drop_duplicates() with a condition?-CodePudding

I have a dataframe consisting of two columns. Column A consists of strings, column B consists of numbers. Column A has duplicates that I want to remove. However, I only want to retain those duplicates that have the highest number in column B. This is an example of how my dataframe looks like:

 columnA | columnB
---------------------
 a       | 1
 a       | 2
 b       | 2
 b       | 1

What I want is this:

 columnA | columnB
---------------------
 a       | 2
 b       | 2

using drop_duplicates()

CodePudding user response：

You can sort your dataframe in descending order based on 'columnB', and use drop_duplicates() on your columnA keeping the first occurence:

df.sort_values(by='columnB',ascending=False).drop_duplicates('columnA',keep='first')

   columnA  columnB
13       d      555
27       h        6
16       f        6
6        c        3
1        a        2
2        b        2
15       e        1

Sample data (slightly enhanced than your sample):

df.to_dict()

{'columnA': {0: 'a',
  1: 'a',
  2: 'b',
  3: 'b',
  4: 'c',
  5: 'c',
  6: 'c',
  7: 'd',
  8: 'd',
  9: 'd',
  10: 'd',
  11: 'd',
  12: 'd',
  13: 'd',
  14: 'e',
  15: 'e',
  16: 'f',
  17: 'f',
  18: 'f',
  19: 'f',
  20: 'f',
  21: 'f',
  22: 'h',
  23: 'h',
  24: 'h',
  25: 'h',
  26: 'h',
  27: 'h'},
 'columnB': {0: 1,
  1: 2,
  2: 2,
  3: 1,
  4: 1,
  5: 2,
  6: 3,
  7: 33,
  8: 223,
  9: 3,
  10: 2,
  11: 1,
  12: 3,
  13: 555,
  14: 1,
  15: 1,
  16: 6,
  17: 5,
  18: 4,
  19: 3,
  20: 2,
  21: 1,
  22: 1,
  23: 2,
  24: 3,
  25: 4,
  26: 5,
  27: 6}}

CodePudding user response：

Just group by 'A' and take the max 'B'

df.groupby('A').max()

CodePudding user response：

Grouping your dataframe by column a, taking only the max of column b and creating a new dataframe by this method can also help as it retains the original dataframe as it is:

df.groupby('columnA')['columB'].max()