How to identify discrete events based on a time difference of more than 30 minutes in python-CodePudding

The problem is exactly the same as listed here albeit in python not R. What is the best solution to handle this in python?

Unclear what to proceed with next. It seems there is some way to get tidyverse in python but I am wondering if there is a way to do this with standard python packages?

Thanks

See issue from other question described here:

I have a dataframe of timestamps when an individual animal (ID) is detected in specific locations. Here is an example of the data:


timestampUTC    location    ID
2017-10-02 19:23:27 JB12    A69-1601-47272
2017-10-02 19:26:48 JB12    A69-1601-47272
2017-10-02 19:27:23 JB12    A69-1601-47272
2017-10-02 19:31:46 JB12    A69-1601-47272
2017-10-02 23:52:15 JB12    A69-1601-47272
2017-10-02 23:53:26 JB12    A69-1601-47272
2017-10-02 23:55:13 JB12    A69-1601-47272
2017-10-03 19:53:50 JB13    A69-1601-47272
2017-10-03 19:55:23 JB13    A69-1601-47272
2017-10-03 19:58:26 JB13    A69-1601-47272
2017-10-04 13:15:13 JB12    A69-1601-47280
2017-10-04 13:16:42 JB12    A69-1601-47280
2017-10-04 13:21:39 JB12    A69-1601-47280
2017-10-04 19:34:54 JB12    A69-1601-47280
2017-10-04 19:55:28 JB12    A69-1601-47280
2017-10-04 20:08:23 JB12    A69-1601-47280
2017-10-04 20:21:43 JB12    A69-1601-47280
2017-10-05 04:55:48 JB13    A69-1601-47280
2017-10-05 04:57:04 JB13    A69-1601-47280
2017-10-05 05:18:40 JB13    A69-1601-47280
2017-10-07 21:24:19 JB13    A69-1601-47280
2017-10-07 21:25:36 JB13    A69-1601-47280
2017-10-07 21:29:25 JB13    A69-1601-47280

My real dataframe is almost 200,000 lines long and has 4 different locations and 13 different IDs.

I want to sort these into discrete events (ID at location) with start and end times based on the timestampUTC column, with the events ending at the timestampUTC when the next detection for that ID in that location is more than half an hour later. The next event begins at the next datetime.

Using the example data above, I would want to generate another dataframe that looks something like this:


ID             location event start         event end
A69-1601-47272  JB12    2017-10-02 19:23:27 2017-10-02 19:31:46
A69-1601-47272  JB12    2017-10-02 23:52:15 2017-10-02 23:55:13
A69-1601-47272  JB13    2017-10-03 19:53:50 2017-10-03 19:58:26
A69-1601-47280  JB12    2017-10-04 13:15:13 2017-10-04 13:21:39
A69-1601-47280  JB12    2017-10-04 19:34:54 2017-10-04 20:21:43
A69-1601-47280  JB13    2017-10-05 04:55:48 2017-10-05 05:18:40
A69-1601-47280  JB13    2017-10-07 21:24:19 2017-10-07 21:29:25

If an ID was detected at a location it gives the ID, location, and the start and end of its time there.

For example, you can see that there are 2 discrete events for ID 47272 at location JB12 that occur on the same day (2017-10-02) but the difference between the end of the first event and the start of the second is >30 min (~4 hrs and 20 mins) so they're separate events.

I would add what code I have tried but I don't know where to start with tidyverse and would prefer to do this in python.

Thanks in advance!

And see solution in R here:

Here is an option


library(tidyverse)
df %>%
    mutate(
        timestampUTC = as.POSIXct(timestampUTC),
        diff = c(0, diff(timestampUTC) / 60),
        grp = cumsum(diff > 30)) %>%
    group_by(grp) %>%
    summarise(
        ID = first(ID),
        location = first(location),
        `event start` = first(timestampUTC),
        `event end` = last(timestampUTC))
## A tibble: 7 x 5
#    grp ID             location `event start`       `event end`
#  <int> <fct>          <fct>    <dttm>              <dttm>
#1     0 A69-1601-47272 JB12     2017-10-02 19:23:27 2017-10-02 19:31:46
#2     1 A69-1601-47272 JB12     2017-10-02 23:52:15 2017-10-02 23:55:13
#3     2 A69-1601-47272 JB13     2017-10-03 19:53:50 2017-10-03 19:58:26
#4     3 A69-1601-47280 JB12     2017-10-04 13:15:13 2017-10-04 13:21:39
#5     4 A69-1601-47280 JB12     2017-10-04 19:34:54 2017-10-04 20:21:43
#6     5 A69-1601-47280 JB13     2017-10-05 04:55:48 2017-10-05 05:18:40
#7     6 A69-1601-47280 JB13     2017-10-07 21:24:19 2017-10-07 21:29:25

I've kept some some of the intermediate steps (columns) to help with readability and understanding. In short, we convert timestamps to POSIXct, then calculate time differences in minutes between successive timestamps with diff, create groups of observations based on whether the next timestamp is > 30 minutes away. The rest is grouping by grp and summarising entries from relevant columns.

The same, more succinct (perhaps at the expense of readability)

df %>%
    group_by(grp = cumsum(c(0, diff(as.POSIXct(timestampUTC)) / 60) > 30)) %>%
    summarise(
        ID = first(ID),
        location = first(location),
        `event start` = first(timestampUTC),
        `event end` = last(timestampUTC)) %>%
    select(-grp)

Sample data

df <- read.table(text =
    "timestampUTC    location    ID
'2017-10-02 19:23:27' JB12    A69-1601-47272
'2017-10-02 19:26:48' JB12    A69-1601-47272
'2017-10-02 19:27:23' JB12    A69-1601-47272
'2017-10-02 19:31:46' JB12    A69-1601-47272
'2017-10-02 23:52:15' JB12    A69-1601-47272
'2017-10-02 23:53:26' JB12    A69-1601-47272
'2017-10-02 23:55:13' JB12    A69-1601-47272
'2017-10-03 19:53:50' JB13    A69-1601-47272
'2017-10-03 19:55:23' JB13    A69-1601-47272
'2017-10-03 19:58:26' JB13    A69-1601-47272
'2017-10-04 13:15:13' JB12    A69-1601-47280
'2017-10-04 13:16:42' JB12    A69-1601-47280
'2017-10-04 13:21:39' JB12    A69-1601-47280
'2017-10-04 19:34:54' JB12    A69-1601-47280
'2017-10-04 19:55:28' JB12    A69-1601-47280
'2017-10-04 20:08:23' JB12    A69-1601-47280
'2017-10-04 20:21:43' JB12    A69-1601-47280
'2017-10-05 04:55:48' JB13    A69-1601-47280
'2017-10-05 04:57:04' JB13    A69-1601-47280
'2017-10-05 05:18:40' JB13    A69-1601-47280
'2017-10-07 21:24:19' JB13    A69-1601-47280
'2017-10-07 21:25:36' JB13    A69-1601-47280
'2017-10-07 21:29:25' JB13    A69-1601-47280", header = T)

CodePudding user response：

If data.txt is:

2017-10-02 19:23:27 JB12    A69-1601-47272
2017-10-02 19:26:48 JB12    A69-1601-47272
2017-10-02 19:27:23 JB12    A69-1601-47272
2017-10-02 19:31:46 JB12    A69-1601-47272
2017-10-02 23:52:15 JB12    A69-1601-47272
2017-10-02 23:53:26 JB12    A69-1601-47272
2017-10-02 23:55:13 JB12    A69-1601-47272
2017-10-03 19:53:50 JB13    A69-1601-47272
2017-10-03 19:55:23 JB13    A69-1601-47272
2017-10-03 19:58:26 JB13    A69-1601-47272
2017-10-04 13:15:13 JB12    A69-1601-47280
2017-10-04 13:16:42 JB12    A69-1601-47280
2017-10-04 13:21:39 JB12    A69-1601-47280
2017-10-04 19:34:54 JB12    A69-1601-47280
2017-10-04 19:55:28 JB12    A69-1601-47280
2017-10-04 20:08:23 JB12    A69-1601-47280
2017-10-04 20:21:43 JB12    A69-1601-47280
2017-10-05 04:55:48 JB13    A69-1601-47280
2017-10-05 04:57:04 JB13    A69-1601-47280
2017-10-05 05:18:40 JB13    A69-1601-47280
2017-10-07 21:24:19 JB13    A69-1601-47280
2017-10-07 21:25:36 JB13    A69-1601-47280
2017-10-07 21:29:25 JB13    A69-1601-47280

Using pandas, the solution is pretty straightforward:

from datetime import datetime
from pandas import DataFrame, concat
from numpy import timedelta64

# this just to read the data
with open('data.txt') as f:
    df = DataFrame([
        datetime.strptime(line[0:19], '%Y-%m-%d %H:%M:%S'), line[20:27].strip(), line[28:].strip()
    ] for line in f)

# the solution starts here, only three statements
# first create a column 'period' to assist with grouping periods
df['period'] = [x := 0]   list(x if a - b < timedelta64(30, 'm') else (x := x   1)
                               for a, b in zip(df[0][1:], df[0][0:]))
# this is the grouping you want, by period, id and location
dfg = df.groupby(by=['period', 2, 1])
# resulting dataframe, adding start and end, turning index into columns and renaming
result = concat([dfg.min(), dfg.max()], axis=1)\
             .reset_index(level=[1, 2])\
             .set_axis(['id', 'location', 'start', 'end'], axis=1)

print(result)

Output:

                    id location               start                 end
groups                                                                 
0       A69-1601-47272     JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
1       A69-1601-47272     JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
2       A69-1601-47272     JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
3       A69-1601-47280     JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
4       A69-1601-47280     JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
5       A69-1601-47280     JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
6       A69-1601-47280     JB13 2017-10-07 21:24:19 2017-10-07 21:29:25

You asked for "the best" solution - there is of course no such thing. Whether a solution is sufficient for your use case is what matters, and perhaps this one is. "Better" can than be anything like 'faster', 'easier to read', 'taking less memory', 'shorter', etc.

CodePudding user response：

Let's take the time difference in minutes between each row in the timestampUTC column.

df['minutes']=((df['timestampUTC'] - df['timestampUTC'].shift()).dt.total_seconds()/60)
print(df)
'''
          timestampUTC location              ID      minutes
0  2017-10-02 19:23:27     JB12  A69-1601-47272          NaN
1  2017-10-02 19:26:48     JB12  A69-1601-47272     3.350000
2  2017-10-02 19:27:23     JB12  A69-1601-47272     0.583333
3  2017-10-02 19:31:46     JB12  A69-1601-47272     4.383333
4  2017-10-02 23:52:15     JB12  A69-1601-47272   260.483333
5  2017-10-02 23:53:26     JB12  A69-1601-47272     1.183333
6  2017-10-02 23:55:13     JB12  A69-1601-47272     1.783333
7  2017-10-03 19:53:50     JB13  A69-1601-47272  1198.616667
8  2017-10-03 19:55:23     JB13  A69-1601-47272     1.550000
9  2017-10-03 19:58:26     JB13  A69-1601-47272     3.050000
10 2017-10-04 13:15:13     JB12  A69-1601-47280  1036.783333
11 2017-10-04 13:16:42     JB12  A69-1601-47280     1.483333
12 2017-10-04 13:21:39     JB12  A69-1601-47280     4.950000
13 2017-10-04 19:34:54     JB12  A69-1601-47280   373.250000
14 2017-10-04 19:55:28     JB12  A69-1601-47280    20.566667
15 2017-10-04 20:08:23     JB12  A69-1601-47280    12.916667
16 2017-10-04 20:21:43     JB12  A69-1601-47280    13.333333
17 2017-10-05 04:55:48     JB13  A69-1601-47280   514.083333
18 2017-10-05 04:57:04     JB13  A69-1601-47280     1.266667
19 2017-10-05 05:18:40     JB13  A69-1601-47280    21.600000
20 2017-10-07 21:24:19     JB13  A69-1601-47280  3845.650000
21 2017-10-07 21:25:36     JB13  A69-1601-47280     1.283333
22 2017-10-07 21:29:25     JB13  A69-1601-47280     3.816667

'''

then let's group each group according to the 30 minute limit and define an id. So if the elapsed time is more than 30 minutes, assign a new id.

s = df['minutes'].gt(30).groupby(df['ID']).cumsum()
df['idx'] = df.groupby(['ID',s]).ngroup()   1
print(df)
'''
          timestampUTC location              ID      minutes  idx
0  2017-10-02 19:23:27     JB12  A69-1601-47272          NaN    1
1  2017-10-02 19:26:48     JB12  A69-1601-47272     3.350000    1
2  2017-10-02 19:27:23     JB12  A69-1601-47272     0.583333    1
3  2017-10-02 19:31:46     JB12  A69-1601-47272     4.383333    1
4  2017-10-02 23:52:15     JB12  A69-1601-47272   260.483333    2
5  2017-10-02 23:53:26     JB12  A69-1601-47272     1.183333    2
6  2017-10-02 23:55:13     JB12  A69-1601-47272     1.783333    2
7  2017-10-03 19:53:50     JB13  A69-1601-47272  1198.616667    3
8  2017-10-03 19:55:23     JB13  A69-1601-47272     1.550000    3
9  2017-10-03 19:58:26     JB13  A69-1601-47272     3.050000    3
10 2017-10-04 13:15:13     JB12  A69-1601-47280  1036.783333    4
11 2017-10-04 13:16:42     JB12  A69-1601-47280     1.483333    4
12 2017-10-04 13:21:39     JB12  A69-1601-47280     4.950000    4
13 2017-10-04 19:34:54     JB12  A69-1601-47280   373.250000    5
14 2017-10-04 19:55:28     JB12  A69-1601-47280    20.566667    5
15 2017-10-04 20:08:23     JB12  A69-1601-47280    12.916667    5
16 2017-10-04 20:21:43     JB12  A69-1601-47280    13.333333    5
17 2017-10-05 04:55:48     JB13  A69-1601-47280   514.083333    6
18 2017-10-05 04:57:04     JB13  A69-1601-47280     1.266667    6
19 2017-10-05 05:18:40     JB13  A69-1601-47280    21.600000    6
20 2017-10-07 21:24:19     JB13  A69-1601-47280  3845.650000    7
21 2017-10-07 21:25:36     JB13  A69-1601-47280     1.283333    7
22 2017-10-07 21:29:25     JB13  A69-1601-47280     3.816667    7
'''

Finally, let's group by the idx column and make the necessary calculations.

final=df.groupby('idx').agg(ID=('ID','first'),location=('location','first'),event_start=('timestampUTC', np.min),
    event_end=('timestampUTC', np.max))
print(final)
'''
            ID          location         event_start           event_end
idx                                                                 
1    A69-1601-47272     JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
2    A69-1601-47272     JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
3    A69-1601-47272     JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
4    A69-1601-47280     JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
5    A69-1601-47280     JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
6    A69-1601-47280     JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
7    A69-1601-47280     JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
'''