Python/Pandas Create categorical column based on DateTime values-CodePudding

I have a Pandas Dataframe (data) with a column ['Date'] with values like "yyyy-mm-dd HH:MM:SS" that I converted into Pandas Datetime format.

data['Date'] = pd.to_datetime(data['Date'])

How can I create a new categorical column ['Time'] with categories 'Early', 'Mid-day', 'Late' based on the hour of each day?

I tried this:

def time(x):
    if '03:00:00' < x <= '11:00:00':
        return 'Early'
    elif '11:00:00' < x <= '17:00:00':
        return 'Mid-day'
    return 'Late'

data['Time'] = data['Date'].dt.time.apply(time)

But I get this error: " TypeError: '<' not supported between instances of 'str' and 'datetime.time' "

I would appreciate your help!

CodePudding user response：

You can use pandas.cut. It is however a bit tricky as you need to convert first your datetime to timedelta:

df = pd.DataFrame({'date': ['2022-04-27 01:00:00', '2022-04-27 04:00:00',
                            '2022-04-27 12:00:00', '2022-04-27 17:00:00']})

# define bins and labels
bins = ['00:00:00', '03:00:00', '11:00:00', '17:00:00', '23:59:59']
labels = ['Late', 'Early', 'Mid-Day', 'Late']

# convert to timedelta
s = pd.to_timedelta(pd.to_datetime(df['date']).dt.time.astype(str))
df['time'] = pd.cut(s, bins=pd.to_timedelta(bins), labels=labels, ordered=False)

output:

                  date     time
0  2022-04-27 01:00:00     Late
1  2022-04-27 04:00:00    Early
2  2022-04-27 12:00:00  Mid-Day
3  2022-04-27 17:00:00  Mid-Day

CodePudding user response：

How about just wrapping all of your conditions in your function in a to_datetime() ?

def time(x):
    x=pd.to_datetime(x,format='%H:%M:%S')
    if pd.to_datetime('03:00:00') < x <= pd.to_datetime('11:00:00'):
        return 'Early'
    elif pd.to_datetime('11:00:00') < x <= pd.to_datetime('17:00:00'):
        return 'Mid-day'
    return 'Late'

df['Time'] = df['Date'].dt.time.apply(time)

CodePudding user response：

You can use np.select together with the DatetimeIndex.indexer_between_time functionality to create the labels. indexer_between_time returns the array indices where the time is between the provided endpoints, so you need to form the Boolean series from an in check with an array the length of the DataFrame.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Date': pd.date_range('2010-01-01', freq='3H', periods=11)},
                  index=list('ABCDEFGHIJK'))

idx = pd.DatetimeIndex(df['Date'])
ilocs = np.arange(len(df))

conds = [np.in1d(ilocs, idx.indexer_between_time('03:00:00', '11:00:00', include_start=False, include_end=True)),
         np.in1d(ilocs, idx.indexer_between_time('11:00:00', '17:00:00', include_start=False, include_end=True))]

choices = ['early', 'mid-day']

df['time_of_day'] = np.select(conds, choices, default='late')

                 Date time_of_day
A 2010-01-01 00:00:00        late
B 2010-01-01 03:00:00        late
C 2010-01-01 06:00:00       early
D 2010-01-01 09:00:00       early
E 2010-01-01 12:00:00     mid-day
F 2010-01-01 15:00:00     mid-day
G 2010-01-01 18:00:00        late
H 2010-01-01 21:00:00        late
I 2010-01-02 00:00:00        late
J 2010-01-02 03:00:00        late
K 2010-01-02 06:00:00       early