I have a dataframe consisting of two columns. Column A consists of strings, column B consists of numbers. Column A has duplicates that I want to remove. However, I only want to retain those duplicates that have the highest number in column B. This is an example of how my dataframe looks like:
columnA | columnB
---------------------
a | 1
a | 2
b | 2
b | 1
What I want is this:
columnA | columnB
---------------------
a | 2
b | 2
using drop_duplicates()
CodePudding user response:
You can sort your dataframe in descending
order based on 'columnB', and use drop_duplicates()
on your columnA keeping the first occurence:
df.sort_values(by='columnB',ascending=False).drop_duplicates('columnA',keep='first')
columnA columnB
13 d 555
27 h 6
16 f 6
6 c 3
1 a 2
2 b 2
15 e 1
Sample data (slightly enhanced than your sample):
df.to_dict()
{'columnA': {0: 'a',
1: 'a',
2: 'b',
3: 'b',
4: 'c',
5: 'c',
6: 'c',
7: 'd',
8: 'd',
9: 'd',
10: 'd',
11: 'd',
12: 'd',
13: 'd',
14: 'e',
15: 'e',
16: 'f',
17: 'f',
18: 'f',
19: 'f',
20: 'f',
21: 'f',
22: 'h',
23: 'h',
24: 'h',
25: 'h',
26: 'h',
27: 'h'},
'columnB': {0: 1,
1: 2,
2: 2,
3: 1,
4: 1,
5: 2,
6: 3,
7: 33,
8: 223,
9: 3,
10: 2,
11: 1,
12: 3,
13: 555,
14: 1,
15: 1,
16: 6,
17: 5,
18: 4,
19: 3,
20: 2,
21: 1,
22: 1,
23: 2,
24: 3,
25: 4,
26: 5,
27: 6}}
CodePudding user response:
Just group by 'A' and take the max 'B'
df.groupby('A').max()
CodePudding user response:
Grouping your dataframe by column a
, taking only the max of column b
and creating a new dataframe by this method can also help as it retains the original dataframe as it is:
df.groupby('columnA')['columB'].max()