Home > Back-end >  How to drop duplicates in pandas but keep more than the first
How to drop duplicates in pandas but keep more than the first

Time:08-27

Let's say I have a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is

>> df
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  2
9  2

EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.

CodePudding user response:

You can create unique value for each consecutive group, then use groupby and head:


group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3

CodePudding user response:

Use boolean indexing with groupby.cumcount:

N = 3
df[df.groupby('a').cumcount().lt(N)]

Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
8  3
9  3

For the last N:

df[df.groupby('a').cumcount(ascending=False).lt(N)]

apply on consecutive repetitions

df[df.groupby(df['a'].ne(df['a'].shift()).cumsum()).cumcount().lt(3)])

Output:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1  # this is #3 of the local group
8  3
9  3

advantages of boolean indexing

You can use it for many other operations, such as setting values or masking:

group = df['a'].ne(df['a'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N)

df.where(m)
     a
0  1.0
1  2.0
2  2.0
3  2.0
4  NaN
5  1.0
6  1.0
7  1.0
8  3.0
9  3.0
df.loc[~m] = -1

   a
0  1
1  2
2  2
3  2
4 -1
5  1
6  1
7  1
8  3
9  3
  • Related