I have a Pandas Dataframe (data) with a column ['Date'] with values like "yyyy-mm-dd HH:MM:SS" that I converted into Pandas Datetime format.
data['Date'] = pd.to_datetime(data['Date'])
How can I create a new categorical column ['Time'] with categories 'Early', 'Mid-day', 'Late' based on the hour of each day?
I tried this:
def time(x):
if '03:00:00' < x <= '11:00:00':
return 'Early'
elif '11:00:00' < x <= '17:00:00':
return 'Mid-day'
return 'Late'
data['Time'] = data['Date'].dt.time.apply(time)
But I get this error: " TypeError: '<' not supported between instances of 'str' and 'datetime.time' "
I would appreciate your help!
CodePudding user response:
You can use pandas.cut
. It is however a bit tricky as you need to convert first your datetime to timedelta:
df = pd.DataFrame({'date': ['2022-04-27 01:00:00', '2022-04-27 04:00:00',
'2022-04-27 12:00:00', '2022-04-27 17:00:00']})
# define bins and labels
bins = ['00:00:00', '03:00:00', '11:00:00', '17:00:00', '23:59:59']
labels = ['Late', 'Early', 'Mid-Day', 'Late']
# convert to timedelta
s = pd.to_timedelta(pd.to_datetime(df['date']).dt.time.astype(str))
df['time'] = pd.cut(s, bins=pd.to_timedelta(bins), labels=labels, ordered=False)
output:
date time
0 2022-04-27 01:00:00 Late
1 2022-04-27 04:00:00 Early
2 2022-04-27 12:00:00 Mid-Day
3 2022-04-27 17:00:00 Mid-Day
CodePudding user response:
How about just wrapping all of your conditions in your function in a to_datetime() ?
def time(x):
x=pd.to_datetime(x,format='%H:%M:%S')
if pd.to_datetime('03:00:00') < x <= pd.to_datetime('11:00:00'):
return 'Early'
elif pd.to_datetime('11:00:00') < x <= pd.to_datetime('17:00:00'):
return 'Mid-day'
return 'Late'
df['Time'] = df['Date'].dt.time.apply(time)
CodePudding user response:
You can use np.select
together with the DatetimeIndex.indexer_between_time
functionality to create the labels. indexer_between_time
returns the array indices where the time is between the provided endpoints, so you need to form the Boolean series from an in
check with an array the length of the DataFrame.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2010-01-01', freq='3H', periods=11)},
index=list('ABCDEFGHIJK'))
idx = pd.DatetimeIndex(df['Date'])
ilocs = np.arange(len(df))
conds = [np.in1d(ilocs, idx.indexer_between_time('03:00:00', '11:00:00', include_start=False, include_end=True)),
np.in1d(ilocs, idx.indexer_between_time('11:00:00', '17:00:00', include_start=False, include_end=True))]
choices = ['early', 'mid-day']
df['time_of_day'] = np.select(conds, choices, default='late')
Date time_of_day
A 2010-01-01 00:00:00 late
B 2010-01-01 03:00:00 late
C 2010-01-01 06:00:00 early
D 2010-01-01 09:00:00 early
E 2010-01-01 12:00:00 mid-day
F 2010-01-01 15:00:00 mid-day
G 2010-01-01 18:00:00 late
H 2010-01-01 21:00:00 late
I 2010-01-02 00:00:00 late
J 2010-01-02 03:00:00 late
K 2010-01-02 06:00:00 early