Home > Enterprise >  Pandas, Remove consecutive duplicates grouped by unique identifier
Pandas, Remove consecutive duplicates grouped by unique identifier

Time:09-15

Current dataframe is as follows:

df = pd.read_csv('filename.csv', delimiter=',')

print(df)
idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
2        1           'goodbye'
3        1           'happy'
4        2           'hello'
5        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
9        3           'hello'
10       4           'hello'
11       5           'goodbye'

Expected Output:

idx   uniqueID      String 

0        1           'hello'
1        1           'goodbye'
3        1           'happy'
4        2           'hello'
6        2           'goodbye'
7        3           'goodbye' 
8        3           'hello'
10       4           'hello'
11       5           'hello'

Question: How do I remove the consecutive duplicates only of the same uniqueID?

What I've tried to do thus far:

df = df[(df['String '].shift() != df['String ']) | (df['uniqueID'] != df['uniqueID'])]

I'm not sure what case I need to include to ensure it looks specifically at the uniqueID. Any and all suggestions are appreciated. Thanks

CodePudding user response:

use drop_duplicates with the columns list you like to consider in determining duplicate

df.drop_duplicates(subset=['uniqueID','String'])
    idx     uniqueID    String
0   0   1   'hello'
1   1   1   'goodbye'
3   3   1   'happy'
4   4   2   'hello'
6   6   2   'goodbye'
7   7   3   'goodbye'
8   8   3   'hello'
10  10  4   'hello'
11  11  5   'goodbye'

CodePudding user response:

You forget to shift the uniqueID column to ensure same uniqueID. But shifting only works if uniqueID is consecutive, you can sort by uniqueID column first to ensure it is consecutive.

out = (df.sort_values('uniqueID')
       [lambda df: (df['String'].shift() != df['String']) | (df['uniqueID'].shift() != df['uniqueID'])]
       #.sort_index() # since sort values changes index order, you can optionally convert back to original with sort_index
      )
print(out)

    idx  uniqueID     String
0     0         1    'hello'
1     1         1  'goodbye'
3     3         1    'happy'
4     4         2    'hello'
6     6         2  'goodbye'
7     7         3  'goodbye'
8     8         3    'hello'
10   10         4    'hello'
11   11         5  'goodbye'
  • Related