I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.
Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).
This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.
Any suggestion how to do this most efficiently?
Example of how it is supposed to look like:
Timestamp | Category |
---|---|
2020-01-17 08:18:00 | 1 |
2020-01-17 11:57:00 | 2 |
2020-01-17 15:35:00 | 3 |
2020-01-18 09:00:00 | 1 |
2020-01-18 12:00:00 | 2 |
2020-01-18 17:00:00 | 3 |
CodePudding user response:
Use groupby() and .cumcount()
df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)
CodePudding user response:
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)
Output:
>>> df
Timestamp Category
0 2020-01-17 08:18:00 1
1 2020-01-17 11:57:00 2
2 2020-01-17 15:35:00 3
3 2020-01-18 09:00:00 1
4 2020-01-18 12:00:00 2
5 2020-01-18 17:00:00 3