How do I find the hour with most rides taken?-CodePudding

I have a dataset below regarding the start time commuters book a car. I'd like to

create a function to discretise all bookings into their respective hours,
and find the hour (in AM/PM format) with the most bookings

The pandas dataframe looks like this:

BookingID	RideStart
01	2022-01-01 00:07:52.943
02	2022-01-01 00:09:31.745
03	2022-01-01 00:14:37.187
04	2022-01-02 00:18:09.127

Desired output: printf("{x} am/pm is the the hour with the highest bookings made")

I tried the pd.grouper method but it dosent work, with an error "Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex".

Would really appreciate your help to solve this, thank you!

CodePudding user response：

You can use pd.DatetimeIndex for this. And then apply s.value_counts, followed by s.idxmax:

import pandas as pd

# just adding a couple of different hours
data = {'BookingID': {0: 1, 1: 2, 2: 3, 3: 4},
 'RideStart': {0: '2022-01-01 00:07:52.943',
  1: '2022-01-01 18:09:31.745',
  2: '2022-01-01 18:14:37.187',
  3: '2022-01-02 19:18:09.127'}}

df = pd.DataFrame(data)
print(df)

   BookingID                RideStart
0          1  2022-01-01 00:07:52.943
1          2  2022-01-01 18:09:31.745
2          3  2022-01-01 18:14:37.187
3          4  2022-01-02 19:18:09.127

max_hour = pd.DatetimeIndex(df['RideStart']).hour.value_counts().idxmax()
print(f'{max_hour} {"pm" if max_hour>12 else "am"} is the hour with the highest bookings made')

6 pm is the hour with the highest bookings made

CodePudding user response：

You don't need the pd.grouper method, pandas already has tools for resampling values based on Datetime. The problem is, the dataframe doesn't currently have Datetime values, just strings. You can use the pd.to_datetime() method as described in this tutorial, and then downsample your data to the hour.

>>> a = ['2022-01-01 00:07:52.943',
    '2022-01-01 00:09:31.745',
    '2022-01-01 01:12:37.187',
    '2022-01-01 02:45:42.834',
    '2022-01-01 02:56:58.152']

>>> df = pd.DataFrame(data=a)
>>> print(df.head())
                         0
0  2022-01-01 00:07:52.943
1  2022-01-01 00:09:31.745
2  2022-01-01 01:12:37.187
3  2022-01-01 02:45:42.834
4  2022-01-01 02:56:58.152

>>> df.index = pd.to_datetime(df[0])
>>> df.resample('H').count()[0] # [0] is to get rid of extra, all-containing column
0   
2022-01-01 00:00:00 2
2022-01-01 01:00:00 1
2022-01-01 02:00:00 2