Pandas, remove consecutive duplicates of value ONLY-CodePudding

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'goodbye'
3        1           'happy'
4        2           'hello'
5        2           'hello'
6        2           'hello'
7        3           'goodbye' 
8        3           'hello'
9        3           'hello'
10       3           'hello'
11       3           'hello'
12       3           'hi' 
13       4           'goodbye'

Expected Output:

idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'goodbye'
3        1           'happy'
4        2           'hello'
5        2           'hello'
**6        2           'hello'**
7        3           'goodbye' 
8        3           'hello'
9        3           'hello'
**10       3           'hello'**
**11       3           'hello'**
12       3           'hi' 
13       4           'goodbye'

Those in bold should be removed

Question: How do I remove the 3rd (or greater) consecutive duplicates of specific value only of the same uniqueID?

What I've tried to do thus far:

df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'].shift() != df['uniqueID'])]

I'm not sure how to get the specific string of 'hello', where it's the 3rd or greater consecutive duplicate. This gets all consecutive duplicates, nonspecific.

CodePudding user response：

You can use groupby base 'uniqueID' then each group use shift(1) and shift(2) and check each row with previous and two previous rows and keep rows that are different from previous and two previous rows.

msk = df.groupby('uniqueID')['String'].apply(lambda x: ~((x==x.shift()) & (x==x.shift(2)) & (x=="'hello'")))
df = df[msk]
print(df)

Output:

    idx  uniqueID      String
0     0         1     'hello'
1     1         1   'goodbye'
2     2         1   'goodbye'
3     3         1     'happy'
4     4         2     'hello'
5     5         2     'hello'
7     7         3  'goodbye' 
9     9         3     'hello'
12   12         3       'hi' 
13   13         4   'goodbye'

CodePudding user response：

You can use groupby.cumcount to get the number of the duplicate and use it for boolean indexing:

N = 3
group = df['String'].ne(df.groupby('uniqueID')['String'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N-1)

print(df[m])

out = df[m]

output:

    idx  uniqueID     String
0     0         1    'hello'
1     1         1  'goodbye'
2     2         1  'goodbye'
3     3         1    'happy'
4     4         2    'hello'
5     5         2    'hello'
7     7         3  'goodbye'
8     8         3    'hello'
9     9         3    'hello'
12   12         3       'hi'
13   13         4  'goodbye'

If the data is already sorted/grouped by uniqueID, then you can simplify to:

N = 3
m = df.groupby(['uniqueID', 'String']).cumcount().lt(N-1)

out = df[m]