I have up to three different timestamps for each day in dataframe. In a new column I want to give th-CodePudding

I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.

Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).

This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.

Any suggestion how to do this most efficiently?

Example of how it is supposed to look like:

Timestamp	Category
2020-01-17 08:18:00	1
2020-01-17 11:57:00	2
2020-01-17 15:35:00	3

2020-01-18 09:00:00	1
2020-01-18 12:00:00	2
2020-01-18 17:00:00	3

CodePudding user response：

Use groupby() and .cumcount()

df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)

CodePudding user response：

df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)

Output:

>>> df
            Timestamp  Category
0 2020-01-17 08:18:00         1
1 2020-01-17 11:57:00         2
2 2020-01-17 15:35:00         3
3 2020-01-18 09:00:00         1
4 2020-01-18 12:00:00         2
5 2020-01-18 17:00:00         3