I have a dataset below regarding the start time commuters book a car. I'd like to
- create a function to discretise all bookings into their respective hours,
- and find the hour (in AM/PM format) with the most bookings
The pandas dataframe looks like this:
BookingID | RideStart |
---|---|
01 | 2022-01-01 00:07:52.943 |
02 | 2022-01-01 00:09:31.745 |
03 | 2022-01-01 00:14:37.187 |
04 | 2022-01-02 00:18:09.127 |
Desired output:
printf("{x} am/pm is the the hour with the highest bookings made")
I tried the pd.grouper method but it dosent work, with an error "Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex".
Would really appreciate your help to solve this, thank you!
CodePudding user response:
You can use pd.DatetimeIndex
for this. And then apply s.value_counts
, followed by s.idxmax
:
import pandas as pd
# just adding a couple of different hours
data = {'BookingID': {0: 1, 1: 2, 2: 3, 3: 4},
'RideStart': {0: '2022-01-01 00:07:52.943',
1: '2022-01-01 18:09:31.745',
2: '2022-01-01 18:14:37.187',
3: '2022-01-02 19:18:09.127'}}
df = pd.DataFrame(data)
print(df)
BookingID RideStart
0 1 2022-01-01 00:07:52.943
1 2 2022-01-01 18:09:31.745
2 3 2022-01-01 18:14:37.187
3 4 2022-01-02 19:18:09.127
max_hour = pd.DatetimeIndex(df['RideStart']).hour.value_counts().idxmax()
print(f'{max_hour} {"pm" if max_hour>12 else "am"} is the hour with the highest bookings made')
6 pm is the hour with the highest bookings made
CodePudding user response:
You don't need the pd.grouper
method, pandas already has tools for resampling values based on Datetime. The problem is, the dataframe doesn't currently have Datetime values, just strings. You can use the pd.to_datetime()
method as described in this tutorial, and then downsample your data to the hour.
>>> a = ['2022-01-01 00:07:52.943',
'2022-01-01 00:09:31.745',
'2022-01-01 01:12:37.187',
'2022-01-01 02:45:42.834',
'2022-01-01 02:56:58.152']
>>> df = pd.DataFrame(data=a)
>>> print(df.head())
0
0 2022-01-01 00:07:52.943
1 2022-01-01 00:09:31.745
2 2022-01-01 01:12:37.187
3 2022-01-01 02:45:42.834
4 2022-01-01 02:56:58.152
>>> df.index = pd.to_datetime(df[0])
>>> df.resample('H').count()[0] # [0] is to get rid of extra, all-containing column
0
2022-01-01 00:00:00 2
2022-01-01 01:00:00 1
2022-01-01 02:00:00 2