I have a dataframe that looks like this:
id | seen | year | month | day | dayname |
---|---|---|---|---|---|
f907942e330ac3653f8a9bd655770872 | 2021-06-02 16:34:56 | 2021 | 6 | 1 | Monday |
042b60106231fa8a8e43dd750432d5bc | 2021-06-02 16:13:29 | 2021 | 6 | 1 | Monday |
The ID column has repeated values, due to the fact that it creates an entry when a user enters a building and a second one when it leaves it.
What I want to do is delete all the repeated id values for each day. For example, a person can go to that building on monday 01/01/2021 and again on wednesday 01/03/2021, given that, 4 entries are created, 2 for monday and 2 for wednesday, I just want to keep one for each specific date.
As you can imagine, I've tried things like df_filtered = df.sort_values(["seen"]).drop_duplicates("id")
but it doesn't work for me as it deletes ALL of the duplicated values across the dataframe.
How can I drop the duplicated ID's (and their respective rows) on each single day without affecting the other days?
Thanks in advance.
CodePudding user response:
You can also try:
#Your Data frame:
df=pd.DataFrame({'id':['f907942e330ac3653f8a9bd655770872','042b60106231fa8a8e43dd750432d5bc'],\
'seen':['2021-06-02 16:34:56','2021-06-02 16:13:29'],
'year':['2021','2021'],\
'month':[6,6],'day':[1,1],'dayname':['Monday','Monday']})
#Use drop_duplicates
df_nodups=df.drop_duplicates(subset=['id','year','month','day'])
CodePudding user response:
You can try group by id
and the date only (without time) of column seen
using pd.to_datetime
dt.normalize()
, and use GroupBy.first()
to get the first entry of each group, as follows:
# Optionally convert to datetime if not already in datetime format
df['seen'] = pd.to_datetime(df['seen'])
df.groupby(['id', df['seen'].dt.normalize()], as_index=False, sort=False).first()
Demo
Data Input:
(Added some rows for more comprehensive testing):
df
id seen year month day dayname
0 f907942e330ac3653f8a9bd655770872 2021-06-02 16:34:56 2021 6 2 Monday
1 f907942e330ac3653f8a9bd655770872 2021-06-02 17:54:56 2021 6 2 Monday
2 042b60106231fa8a8e43dd750432d5bc 2021-06-02 16:13:29 2021 6 2 Monday
3 f907942e330ac3653f8a9bd655770872 2021-06-04 16:22:56 2021 6 4 Wednesday
4 f907942e330ac3653f8a9bd655770872 2021-06-04 17:43:56 2021 6 4 Wednesday
Output:
id seen year month day dayname
0 f907942e330ac3653f8a9bd655770872 2021-06-02 16:34:56 2021 6 2 Monday
1 042b60106231fa8a8e43dd750432d5bc 2021-06-02 16:13:29 2021 6 2 Monday
2 f907942e330ac3653f8a9bd655770872 2021-06-04 16:22:56 2021 6 4 Wednesday