Is there a way to use groupby with a time threshold? For example, groupby times that are within 1min of eachother:
Cents | ED_YF | Vals | EventTimestamp | |
---|---|---|---|---|
10 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 |
11 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 |
12 | 153.5 | 2 | 384.767 | 2022-06-28 14:36:15 |
13 | 153.5 | 2 | 384.767 | 2022-06-28 14:37:11 |
14 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 |
15 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 |
4 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 |
5 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 |
I've seen it used in a merge_asof, where a threshold is a provided input, but I havent seen anything similar in the Panda docs for a GroupBy... In this example, I'm looking to add a column ["GroupID"] (.ngroup()) based on the column EventTimestamp:
Cents | ED_YF | Vals | EventTimestamp | GroupID | |
---|---|---|---|---|---|
10 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 | 0 |
11 | 182.5 | 3 | 323.473 | 2022-06-28 13:41:01 | 0 |
12 | 153.5 | 2 | 384.767 | 2022-06-28 14:36:15 | 1 |
13 | 153.5 | 2 | 384.767 | 2022-06-28 14:37:11 | 1 |
14 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 | 2 |
15 | 43 | 2 | 107.785 | 2022-06-28 16:40:37 | 2 |
4 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 | 2 |
5 | 152 | 2 | 381.007 | 2022-06-28 16:41:00 | 2 |
Thanks!
CodePudding user response:
Try using cumsum
df['GroupID'] = (df['EventTimestamp'].diff() >= pd.Timedelta('1T')).cumsum()
Cents ED_YF Vals EventTimestamp GroupID
10 182.5 3 323.473 2022-06-28 13:41:01 0
11 182.5 3 323.473 2022-06-28 13:41:01 0
12 153.5 2 384.767 2022-06-28 14:36:15 1
13 153.5 2 384.767 2022-06-28 14:37:11 1
14 43.0 2 107.785 2022-06-28 16:40:37 2
15 43.0 2 107.785 2022-06-28 16:40:37 2
4 152.0 2 381.007 2022-06-28 16:41:00 2
5 152.0 2 381.007 2022-06-28 16:41:00 2
CodePudding user response:
You can use the datetime methods, specifically strftime
to convert the timestamps to the level of specificity you want and use that as your parameter to the groupby method.
df = pd.DataFrame({
'Vals': [323.473, 323.473, 384.767, 384.767, 107.785, 107.785],
'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01', '2022-06-28 14:36:15', '2022-06-28 14:37:11', '2022-06-28 16:40:37', '2022-06-28 16:40:37'],
})
result = df.assign(GroupId=(
df
.groupby(pd.to_datetime(df['EventTimestamp']).dt.strftime('%Y-%m-%d %H:%I'))
.ngroup()
))
Output:
Vals EventTimestamp GroupId
0 323.473 2022-06-28 13:41:01 0
1 323.473 2022-06-28 13:41:01 0
2 384.767 2022-06-28 14:36:15 1
3 384.767 2022-06-28 14:37:11 1
4 107.785 2022-06-28 16:40:37 2
5 107.785 2022-06-28 16:40:37 2
CodePudding user response:
Here is my answer:
import pandas as pd
df = pd.DataFrame({'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01',
'2022-06-28 14:36:15', '2022-06-28 14:37:11',
'2022-06-28 16:40:37', '2022-06-28 16:40:37',
'2022-06-28 16:41:00', '2022-06-28 16:41:00']})
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'])
groups = df['EventTimestamp'].diff().dt.seconds.gt(60).cumsum()
df['GroupID'] = groups
print(df)
Output is like:
EventTimestamp GroupID
0 2022-06-28 13:41:01 0
1 2022-06-28 13:41:01 0
2 2022-06-28 14:36:15 1
3 2022-06-28 14:37:11 1
4 2022-06-28 16:40:37 2
5 2022-06-28 16:40:37 2
6 2022-06-28 16:41:00 2
7 2022-06-28 16:41:00 2