Home > OS >  Pandas GroupBy DateTime with time threshold
Pandas GroupBy DateTime with time threshold

Time:07-07

Is there a way to use groupby with a time threshold? For example, groupby times that are within 1min of eachother:

Cents ED_YF Vals EventTimestamp
10 182.5 3 323.473 2022-06-28 13:41:01
11 182.5 3 323.473 2022-06-28 13:41:01
12 153.5 2 384.767 2022-06-28 14:36:15
13 153.5 2 384.767 2022-06-28 14:37:11
14 43 2 107.785 2022-06-28 16:40:37
15 43 2 107.785 2022-06-28 16:40:37
4 152 2 381.007 2022-06-28 16:41:00
5 152 2 381.007 2022-06-28 16:41:00

I've seen it used in a merge_asof, where a threshold is a provided input, but I havent seen anything similar in the Panda docs for a GroupBy... In this example, I'm looking to add a column ["GroupID"] (.ngroup()) based on the column EventTimestamp:

Cents ED_YF Vals EventTimestamp GroupID
10 182.5 3 323.473 2022-06-28 13:41:01 0
11 182.5 3 323.473 2022-06-28 13:41:01 0
12 153.5 2 384.767 2022-06-28 14:36:15 1
13 153.5 2 384.767 2022-06-28 14:37:11 1
14 43 2 107.785 2022-06-28 16:40:37 2
15 43 2 107.785 2022-06-28 16:40:37 2
4 152 2 381.007 2022-06-28 16:41:00 2
5 152 2 381.007 2022-06-28 16:41:00 2

Thanks!

CodePudding user response:

Try using cumsum

df['GroupID'] = (df['EventTimestamp'].diff() >= pd.Timedelta('1T')).cumsum()

    Cents  ED_YF     Vals      EventTimestamp  GroupID
10  182.5      3  323.473 2022-06-28 13:41:01        0
11  182.5      3  323.473 2022-06-28 13:41:01        0
12  153.5      2  384.767 2022-06-28 14:36:15        1
13  153.5      2  384.767 2022-06-28 14:37:11        1
14   43.0      2  107.785 2022-06-28 16:40:37        2
15   43.0      2  107.785 2022-06-28 16:40:37        2
4   152.0      2  381.007 2022-06-28 16:41:00        2
5   152.0      2  381.007 2022-06-28 16:41:00        2

CodePudding user response:

You can use the datetime methods, specifically strftime to convert the timestamps to the level of specificity you want and use that as your parameter to the groupby method.

df = pd.DataFrame({
    'Vals': [323.473, 323.473, 384.767, 384.767, 107.785, 107.785], 
    'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01', '2022-06-28 14:36:15', '2022-06-28 14:37:11', '2022-06-28 16:40:37', '2022-06-28 16:40:37'],
})
result = df.assign(GroupId=(
    df
    .groupby(pd.to_datetime(df['EventTimestamp']).dt.strftime('%Y-%m-%d %H:%I'))
    .ngroup()
))

Output:

      Vals       EventTimestamp  GroupId
0  323.473  2022-06-28 13:41:01        0
1  323.473  2022-06-28 13:41:01        0
2  384.767  2022-06-28 14:36:15        1
3  384.767  2022-06-28 14:37:11        1
4  107.785  2022-06-28 16:40:37        2
5  107.785  2022-06-28 16:40:37        2

CodePudding user response:

Here is my answer:

import pandas as pd

df = pd.DataFrame({'EventTimestamp': ['2022-06-28 13:41:01', '2022-06-28 13:41:01',
                                      '2022-06-28 14:36:15', '2022-06-28 14:37:11',
                                      '2022-06-28 16:40:37', '2022-06-28 16:40:37',
                                      '2022-06-28 16:41:00', '2022-06-28 16:41:00']})
df['EventTimestamp'] = pd.to_datetime(df['EventTimestamp'])
groups  = df['EventTimestamp'].diff().dt.seconds.gt(60).cumsum()
df['GroupID'] = groups
print(df)

Output is like:

     EventTimestamp       GroupID
0 2022-06-28 13:41:01        0
1 2022-06-28 13:41:01        0
2 2022-06-28 14:36:15        1
3 2022-06-28 14:37:11        1
4 2022-06-28 16:40:37        2
5 2022-06-28 16:40:37        2
6 2022-06-28 16:41:00        2
7 2022-06-28 16:41:00        2
  • Related