Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'goodbye'
3 1 'happy'
4 2 'hello'
5 2 'hello'
6 2 'hello'
7 3 'goodbye'
8 3 'hello'
9 3 'hello'
10 3 'hello'
11 3 'hello'
12 3 'hi'
13 4 'goodbye'
Expected Output:
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'goodbye'
3 1 'happy'
4 2 'hello'
5 2 'hello'
**6 2 'hello'**
7 3 'goodbye'
8 3 'hello'
9 3 'hello'
**10 3 'hello'**
**11 3 'hello'**
12 3 'hi'
13 4 'goodbye'
Those in bold should be removed
Question: How do I remove the 3rd (or greater) consecutive duplicates of specific value only of the same uniqueID?
What I've tried to do thus far:
df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'].shift() != df['uniqueID'])]
I'm not sure how to get the specific string of 'hello', where it's the 3rd or greater consecutive duplicate. This gets all consecutive duplicates, nonspecific.
CodePudding user response:
You can use groupby
base 'uniqueID'
then each group use shift(1)
and shift(2)
and check each row with previous and two previous rows and keep rows that are different from previous and two previous rows.
msk = df.groupby('uniqueID')['String'].apply(lambda x: ~((x==x.shift()) & (x==x.shift(2)) & (x=="'hello'")))
df = df[msk]
print(df)
Output:
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
2 2 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
5 5 2 'hello'
7 7 3 'goodbye'
9 9 3 'hello'
12 12 3 'hi'
13 13 4 'goodbye'
CodePudding user response:
You can use groupby.cumcount
to get the number of the duplicate and use it for boolean indexing:
N = 3
group = df['String'].ne(df.groupby('uniqueID')['String'].shift()).cumsum()
m = df.groupby(group).cumcount().lt(N-1)
print(df[m])
out = df[m]
output:
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
2 2 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
5 5 2 'hello'
7 7 3 'goodbye'
8 8 3 'hello'
9 9 3 'hello'
12 12 3 'hi'
13 13 4 'goodbye'
If the data is already sorted/grouped by uniqueID, then you can simplify to:
N = 3
m = df.groupby(['uniqueID', 'String']).cumcount().lt(N-1)
out = df[m]