Home > Net >  Python Pandas: drop duplicate rows (keep only first row) based on same id and same date
Python Pandas: drop duplicate rows (keep only first row) based on same id and same date

Time:10-18

I have a dataframe that looks like this:

id seen year month day dayname
f907942e330ac3653f8a9bd655770872 2021-06-02 16:34:56 2021 6 1 Monday
042b60106231fa8a8e43dd750432d5bc 2021-06-02 16:13:29 2021 6 1 Monday

The ID column has repeated values, due to the fact that it creates an entry when a user enters a building and a second one when it leaves it.

What I want to do is delete all the repeated id values for each day. For example, a person can go to that building on monday 01/01/2021 and again on wednesday 01/03/2021, given that, 4 entries are created, 2 for monday and 2 for wednesday, I just want to keep one for each specific date.

As you can imagine, I've tried things like df_filtered = df.sort_values(["seen"]).drop_duplicates("id") but it doesn't work for me as it deletes ALL of the duplicated values across the dataframe.

How can I drop the duplicated ID's (and their respective rows) on each single day without affecting the other days?

Thanks in advance.

CodePudding user response:

You can also try:

#Your Data frame:

df=pd.DataFrame({'id':['f907942e330ac3653f8a9bd655770872','042b60106231fa8a8e43dd750432d5bc'],\
                 'seen':['2021-06-02 16:34:56','2021-06-02 16:13:29'],
                 'year':['2021','2021'],\
                'month':[6,6],'day':[1,1],'dayname':['Monday','Monday']})

#Use drop_duplicates

df_nodups=df.drop_duplicates(subset=['id','year','month','day'])

CodePudding user response:

You can try group by id and the date only (without time) of column seen using pd.to_datetime dt.normalize(), and use GroupBy.first() to get the first entry of each group, as follows:

# Optionally convert to datetime if not already in datetime format
df['seen'] = pd.to_datetime(df['seen'])

df.groupby(['id', df['seen'].dt.normalize()], as_index=False, sort=False).first()

Demo

Data Input:

(Added some rows for more comprehensive testing):

df

                                 id                 seen  year  month  day    dayname
0  f907942e330ac3653f8a9bd655770872  2021-06-02 16:34:56  2021      6    2     Monday
1  f907942e330ac3653f8a9bd655770872  2021-06-02 17:54:56  2021      6    2     Monday
2  042b60106231fa8a8e43dd750432d5bc  2021-06-02 16:13:29  2021      6    2     Monday
3  f907942e330ac3653f8a9bd655770872  2021-06-04 16:22:56  2021      6    4  Wednesday
4  f907942e330ac3653f8a9bd655770872  2021-06-04 17:43:56  2021      6    4  Wednesday

Output:

                                 id                 seen  year  month  day    dayname
0  f907942e330ac3653f8a9bd655770872  2021-06-02 16:34:56  2021      6    2     Monday
1  042b60106231fa8a8e43dd750432d5bc  2021-06-02 16:13:29  2021      6    2     Monday
2  f907942e330ac3653f8a9bd655770872  2021-06-04 16:22:56  2021      6    4  Wednesday
  • Related