Home > Mobile >  Python - Filter dataframe based on condition about different time for each of dates
Python - Filter dataframe based on condition about different time for each of dates

Time:08-27

I have date_time_df dataframe with date_time column as like this one:

import datetime
date_time_df = pd.DataFrame({'date_time':[datetime.datetime(2020, 7, 10, 9, 15, 3),
 datetime.datetime(2020, 7, 9, 9, 50, 0),
 datetime.datetime(2020, 7, 9, 10, 50, 3),
 datetime.datetime(2020, 7, 9, 16, 50, 0),
 datetime.datetime(2020, 7, 9, 20, 30, 0),
 datetime.datetime(2020, 7, 8, 9, 50, 0),
 datetime.datetime(2020, 7, 8, 10, 50, 3),
 datetime.datetime(2020, 7, 8, 16, 50, 0),
 datetime.datetime(2020, 7, 8, 20, 30, 0),
 datetime.datetime(2020, 7, 7, 10, 50, 3),
 datetime.datetime(2020, 7, 7, 16, 50, 0),
 datetime.datetime(2020, 7, 6, 16, 50, 0),
 datetime.datetime(2020, 7, 6, 20, 30, 0),
 datetime.datetime(2020, 7, 5, 9, 50, 0),
 datetime.datetime(2020, 7, 5, 20, 30, 0),
 datetime.datetime(2020, 7, 4, 16, 50, 0),
 datetime.datetime(2020, 7, 3, 9, 50, 0),
 datetime.datetime(2020, 7, 3, 10, 50, 3),
 datetime.datetime(2020, 7, 3, 16, 50, 0),
 datetime.datetime(2020, 7, 2, 9, 50, 0),
 datetime.datetime(2020, 7, 2, 17, 0, 0),
 datetime.datetime(2020, 7, 1, 10, 45, 3),
 datetime.datetime(2020, 7, 1, 17, 0, 0),
 datetime.datetime(2020, 7, 1, 20, 30, 0)]})

I need to sort it by the following conditions:

  1. for each date select only one time-slot around 10 hours (time >= 10 hours);
  2. in case if for some date there is only one timeslot -> keep it in case if it after 10 h (i.e time >=10 hours);
  3. if date doesn't have timeslot around 10 h, include the next one (it might be equals 16:50 or 20:30 or 17:00 in the above example)

I tried to solve it by the following steps:

  1. split date_time_df data frame to date, time columns. So, date_time_df has 3 columns as like ['date_time', 'date', 'time']
  2. create new gr_df as a result of:
gr_df  = date_time_df.groupby('date')['time'].nunique().reset_index()
gr_df.rename(columns={'time':'count_timeslots'}, inplace=True)
  1. filter gr_df and select rows which have only one timeslot:
dates_with_one_timeslot = list(gr_df[gr_df['count_timeslots']==1].date_time.unique())
  1. filter gr_df and select rows which have slot around 10 hours:
req_timeslots = list(date_time_df[date_time_df['time'].dt.strftime('%H')=='10'].date_time.unique())
  1. I don't have idea of How to cover condition for dates which doesn't have times lot around 10 hours, but it has next one after 10h (time might be equals 16:50 or 20:30 or 17:00)

Correct output:

result_lst = [
 datetime.datetime(2020, 7, 9, 10, 50, 3),
 datetime.datetime(2020, 7, 8, 10, 50, 3),
 datetime.datetime(2020, 7, 7, 10, 50, 3),
 datetime.datetime(2020, 7, 6, 16, 50, 0),
 datetime.datetime(2020, 7, 5, 20, 30, 0),
 datetime.datetime(2020, 7, 4, 16, 50, 0),
 datetime.datetime(2020, 7, 3, 10, 50, 3),
 datetime.datetime(2020, 7, 2, 17, 0, 0),
 datetime.datetime(2020, 7, 1, 10, 45, 3)
]

I'd appreciate for any idea, Thanks!

CodePudding user response:

Filter the rows where hour is greater than or equal to 10 then sort the values by date_time in ascending order and drop duplicates by date component

m = df['date_time'].dt.hour >= 10
df['date'] = df['date_time'].dt.date
df[m].sort_values('date_time').drop_duplicates('date')

             date_time        date
21 2020-07-01 10:45:03  2020-07-01
20 2020-07-02 17:00:00  2020-07-02
17 2020-07-03 10:50:03  2020-07-03
15 2020-07-04 16:50:00  2020-07-04
14 2020-07-05 20:30:00  2020-07-05
11 2020-07-06 16:50:00  2020-07-06
9  2020-07-07 10:50:03  2020-07-07
6  2020-07-08 10:50:03  2020-07-08
2  2020-07-09 10:50:03  2020-07-09
  • Related