Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
2 1 'goodbye'
3 1 'happy'
4 2 'hello'
5 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
9 3 'hello'
10 4 'hello'
11 5 'goodbye'
Expected Output:
idx uniqueID String
0 1 'hello'
1 1 'goodbye'
3 1 'happy'
4 2 'hello'
6 2 'goodbye'
7 3 'goodbye'
8 3 'hello'
10 4 'hello'
11 5 'hello'
Question: How do I remove the consecutive duplicates only of the same uniqueID?
What I've tried to do thus far:
df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]
I'm not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks
CodePudding user response:
use drop_duplicates with the columns list you like to consider in determining duplicate
df.drop_duplicates(subset=['uniqueID','String'])
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'
CodePudding user response:
You forget to shift the uniqueID
column to ensure same uniqueID. But shifting only works if uniqueID
is consecutive, you can sort by uniqueID
column first to ensure it is consecutive
.
out = (df.sort_values('uniqueID')
[lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
#.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
)
print(out)
idx uniqueID String
0 0 1 'hello'
1 1 1 'goodbye'
3 3 1 'happy'
4 4 2 'hello'
6 6 2 'goodbye'
7 7 3 'goodbye'
8 8 3 'hello'
10 10 4 'hello'
11 11 5 'goodbye'