Home > front end >  pandas: extract time-of--the day as feature for classification
pandas: extract time-of--the day as feature for classification

Time:01-05

I have a time series data, with a unix seconds column (among other columns):

df = pd.DataFrame(
    {
        'user': [3,3,3,3,3,6,6,6],
        'timestamp': [1459467971, 1459468020, 1459468026, 1459468031, 
                      1459468036,1513974852, 1513974853, 1513974854]
    }
)

The dataset is for classification of some events, some are frequent in the morning, others in the afternoon.

I think time-of-day could be an important discriminant as well in this task.

How do I extract number of seconds for time of the day from this?

Note: I know that using pandas to_datetime object would give HH:MM:SS as:

df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
#df['date'] = df['timestamp'].dt.date
df['time'] = df['timestamp'].dt.time
df
    user      timestamp          time
0   3   2016-03-31 23:46:11     23:46:11
1   3   2016-03-31 23:47:00     23:47:00
2   3   2016-03-31 23:47:06     23:47:06
3   3   2016-03-31 23:47:11     23:47:11
4   3   2016-03-31 23:47:16     23:47:16
5   6   2017-12-22 20:34:12     20:34:12
6   6   2017-12-22 20:34:13     20:34:13
7   6   2017-12-22 20:34:14     20:34:14

Isn't there an easy way to pick out the number of seconds representing the time, like this:

    user      timestamp       time-of-day
0   3   2016-03-31 23:46:11       85571
1   3   2016-03-31 23:47:00       85620
2   3   2016-03-31 23:47:06       85626
3   3   2016-03-31 23:47:11       85631
4   3   2016-03-31 23:47:16       85636
5   6   2017-12-22 20:34:12       74052
6   6   2017-12-22 20:34:13       74053
7   6   2017-12-22 20:34:14       74054

CodePudding user response:

You can use dt.normalize to subtract the date part:

df['time-of-day'] = (df['timestamp'].sub(df['timestamp'].dt.normalize())
                                    .dt.total_seconds().astype(int))
print(df)

# Output
   user           timestamp        date      time  time-of-day
0     3 2016-03-31 23:46:11  2016-03-31  23:46:11        85571
1     3 2016-03-31 23:47:00  2016-03-31  23:47:00        85620
2     3 2016-03-31 23:47:06  2016-03-31  23:47:06        85626
3     3 2016-03-31 23:47:11  2016-03-31  23:47:11        85631
4     3 2016-03-31 23:47:16  2016-03-31  23:47:16        85636
5     6 2017-12-22 20:34:12  2017-12-22  20:34:12        74052
6     6 2017-12-22 20:34:13  2017-12-22  20:34:13        74053
7     6 2017-12-22 20:34:14  2017-12-22  20:34:14        74054

Note: I'm not sure to use time-of-day is relevant as continuous variable. Maybe you should think to discretize into 8 blocks of 3 hours?

CodePudding user response:

You can do it using hour, minute and second and some basic algebra

df['time-of-day'] = df['timestamp'].dt.hour * 3600   df['timestamp'].dt.minute * 60   df['timestamp'].dt.second

CodePudding user response:

you can also use this:

df['time-of-day'] = df['timestamp'].apply(lambda x: x.second   x.minute * 60   x.hour * 3600)
  • Related