The problem is exactly the same as listed here albeit in python not R. What is the best solution to handle this in python?
Unclear what to proceed with next. It seems there is some way to get tidyverse in python but I am wondering if there is a way to do this with standard python packages?
Thanks
See issue from other question described here:
I have a dataframe of timestamps when an individual animal (ID) is detected in specific locations. Here is an example of the data:
timestampUTC location ID
2017-10-02 19:23:27 JB12 A69-1601-47272
2017-10-02 19:26:48 JB12 A69-1601-47272
2017-10-02 19:27:23 JB12 A69-1601-47272
2017-10-02 19:31:46 JB12 A69-1601-47272
2017-10-02 23:52:15 JB12 A69-1601-47272
2017-10-02 23:53:26 JB12 A69-1601-47272
2017-10-02 23:55:13 JB12 A69-1601-47272
2017-10-03 19:53:50 JB13 A69-1601-47272
2017-10-03 19:55:23 JB13 A69-1601-47272
2017-10-03 19:58:26 JB13 A69-1601-47272
2017-10-04 13:15:13 JB12 A69-1601-47280
2017-10-04 13:16:42 JB12 A69-1601-47280
2017-10-04 13:21:39 JB12 A69-1601-47280
2017-10-04 19:34:54 JB12 A69-1601-47280
2017-10-04 19:55:28 JB12 A69-1601-47280
2017-10-04 20:08:23 JB12 A69-1601-47280
2017-10-04 20:21:43 JB12 A69-1601-47280
2017-10-05 04:55:48 JB13 A69-1601-47280
2017-10-05 04:57:04 JB13 A69-1601-47280
2017-10-05 05:18:40 JB13 A69-1601-47280
2017-10-07 21:24:19 JB13 A69-1601-47280
2017-10-07 21:25:36 JB13 A69-1601-47280
2017-10-07 21:29:25 JB13 A69-1601-47280
My real dataframe is almost 200,000 lines long and has 4 different locations and 13 different IDs.
I want to sort these into discrete events (ID at location) with start and end times based on the timestampUTC column, with the events ending at the timestampUTC when the next detection for that ID in that location is more than half an hour later. The next event begins at the next datetime.
Using the example data above, I would want to generate another dataframe that looks something like this:
ID location event start event end
A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
If an ID was detected at a location it gives the ID, location, and the start and end of its time there.
For example, you can see that there are 2 discrete events for ID 47272 at location JB12 that occur on the same day (2017-10-02) but the difference between the end of the first event and the start of the second is >30 min (~4 hrs and 20 mins) so they're separate events.
I would add what code I have tried but I don't know where to start with tidyverse and would prefer to do this in python.
Thanks in advance!
And see solution in R here:
Here is an option
library(tidyverse)
df %>%
mutate(
timestampUTC = as.POSIXct(timestampUTC),
diff = c(0, diff(timestampUTC) / 60),
grp = cumsum(diff > 30)) %>%
group_by(grp) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC))
## A tibble: 7 x 5
# grp ID location `event start` `event end`
# <int> <fct> <fct> <dttm> <dttm>
#1 0 A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
#2 1 A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
#3 2 A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
#4 3 A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
#5 4 A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
#6 5 A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
#7 6 A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
I've kept some some of the intermediate steps (columns) to help with readability and understanding. In short, we convert timestamps to POSIXct, then calculate time differences in minutes between successive timestamps with diff, create groups of observations based on whether the next timestamp is > 30 minutes away. The rest is grouping by grp and summarising entries from relevant columns.
The same, more succinct (perhaps at the expense of readability)
df %>%
group_by(grp = cumsum(c(0, diff(as.POSIXct(timestampUTC)) / 60) > 30)) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC)) %>%
select(-grp)
Sample data
df <- read.table(text =
"timestampUTC location ID
'2017-10-02 19:23:27' JB12 A69-1601-47272
'2017-10-02 19:26:48' JB12 A69-1601-47272
'2017-10-02 19:27:23' JB12 A69-1601-47272
'2017-10-02 19:31:46' JB12 A69-1601-47272
'2017-10-02 23:52:15' JB12 A69-1601-47272
'2017-10-02 23:53:26' JB12 A69-1601-47272
'2017-10-02 23:55:13' JB12 A69-1601-47272
'2017-10-03 19:53:50' JB13 A69-1601-47272
'2017-10-03 19:55:23' JB13 A69-1601-47272
'2017-10-03 19:58:26' JB13 A69-1601-47272
'2017-10-04 13:15:13' JB12 A69-1601-47280
'2017-10-04 13:16:42' JB12 A69-1601-47280
'2017-10-04 13:21:39' JB12 A69-1601-47280
'2017-10-04 19:34:54' JB12 A69-1601-47280
'2017-10-04 19:55:28' JB12 A69-1601-47280
'2017-10-04 20:08:23' JB12 A69-1601-47280
'2017-10-04 20:21:43' JB12 A69-1601-47280
'2017-10-05 04:55:48' JB13 A69-1601-47280
'2017-10-05 04:57:04' JB13 A69-1601-47280
'2017-10-05 05:18:40' JB13 A69-1601-47280
'2017-10-07 21:24:19' JB13 A69-1601-47280
'2017-10-07 21:25:36' JB13 A69-1601-47280
'2017-10-07 21:29:25' JB13 A69-1601-47280", header = T)
CodePudding user response:
If data.txt
is:
2017-10-02 19:23:27 JB12 A69-1601-47272
2017-10-02 19:26:48 JB12 A69-1601-47272
2017-10-02 19:27:23 JB12 A69-1601-47272
2017-10-02 19:31:46 JB12 A69-1601-47272
2017-10-02 23:52:15 JB12 A69-1601-47272
2017-10-02 23:53:26 JB12 A69-1601-47272
2017-10-02 23:55:13 JB12 A69-1601-47272
2017-10-03 19:53:50 JB13 A69-1601-47272
2017-10-03 19:55:23 JB13 A69-1601-47272
2017-10-03 19:58:26 JB13 A69-1601-47272
2017-10-04 13:15:13 JB12 A69-1601-47280
2017-10-04 13:16:42 JB12 A69-1601-47280
2017-10-04 13:21:39 JB12 A69-1601-47280
2017-10-04 19:34:54 JB12 A69-1601-47280
2017-10-04 19:55:28 JB12 A69-1601-47280
2017-10-04 20:08:23 JB12 A69-1601-47280
2017-10-04 20:21:43 JB12 A69-1601-47280
2017-10-05 04:55:48 JB13 A69-1601-47280
2017-10-05 04:57:04 JB13 A69-1601-47280
2017-10-05 05:18:40 JB13 A69-1601-47280
2017-10-07 21:24:19 JB13 A69-1601-47280
2017-10-07 21:25:36 JB13 A69-1601-47280
2017-10-07 21:29:25 JB13 A69-1601-47280
Using pandas
, the solution is pretty straightforward:
from datetime import datetime
from pandas import DataFrame, concat
from numpy import timedelta64
# this just to read the data
with open('data.txt') as f:
df = DataFrame([
datetime.strptime(line[0:19], '%Y-%m-%d %H:%M:%S'), line[20:27].strip(), line[28:].strip()
] for line in f)
# the solution starts here, only three statements
# first create a column 'period' to assist with grouping periods
df['period'] = [x := 0] list(x if a - b < timedelta64(30, 'm') else (x := x 1)
for a, b in zip(df[0][1:], df[0][0:]))
# this is the grouping you want, by period, id and location
dfg = df.groupby(by=['period', 2, 1])
# resulting dataframe, adding start and end, turning index into columns and renaming
result = concat([dfg.min(), dfg.max()], axis=1)\
.reset_index(level=[1, 2])\
.set_axis(['id', 'location', 'start', 'end'], axis=1)
print(result)
Output:
id location start end
groups
0 A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
1 A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
2 A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
3 A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
4 A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
5 A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
6 A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
You asked for "the best" solution - there is of course no such thing. Whether a solution is sufficient for your use case is what matters, and perhaps this one is. "Better" can than be anything like 'faster', 'easier to read', 'taking less memory', 'shorter', etc.
CodePudding user response:
Let's take the time difference in minutes between each row in the timestampUTC column.
df['minutes']=((df['timestampUTC'] - df['timestampUTC'].shift()).dt.total_seconds()/60)
print(df)
'''
timestampUTC location ID minutes
0 2017-10-02 19:23:27 JB12 A69-1601-47272 NaN
1 2017-10-02 19:26:48 JB12 A69-1601-47272 3.350000
2 2017-10-02 19:27:23 JB12 A69-1601-47272 0.583333
3 2017-10-02 19:31:46 JB12 A69-1601-47272 4.383333
4 2017-10-02 23:52:15 JB12 A69-1601-47272 260.483333
5 2017-10-02 23:53:26 JB12 A69-1601-47272 1.183333
6 2017-10-02 23:55:13 JB12 A69-1601-47272 1.783333
7 2017-10-03 19:53:50 JB13 A69-1601-47272 1198.616667
8 2017-10-03 19:55:23 JB13 A69-1601-47272 1.550000
9 2017-10-03 19:58:26 JB13 A69-1601-47272 3.050000
10 2017-10-04 13:15:13 JB12 A69-1601-47280 1036.783333
11 2017-10-04 13:16:42 JB12 A69-1601-47280 1.483333
12 2017-10-04 13:21:39 JB12 A69-1601-47280 4.950000
13 2017-10-04 19:34:54 JB12 A69-1601-47280 373.250000
14 2017-10-04 19:55:28 JB12 A69-1601-47280 20.566667
15 2017-10-04 20:08:23 JB12 A69-1601-47280 12.916667
16 2017-10-04 20:21:43 JB12 A69-1601-47280 13.333333
17 2017-10-05 04:55:48 JB13 A69-1601-47280 514.083333
18 2017-10-05 04:57:04 JB13 A69-1601-47280 1.266667
19 2017-10-05 05:18:40 JB13 A69-1601-47280 21.600000
20 2017-10-07 21:24:19 JB13 A69-1601-47280 3845.650000
21 2017-10-07 21:25:36 JB13 A69-1601-47280 1.283333
22 2017-10-07 21:29:25 JB13 A69-1601-47280 3.816667
'''
then let's group each group according to the 30 minute limit and define an id. So if the elapsed time is more than 30 minutes, assign a new id.
s = df['minutes'].gt(30).groupby(df['ID']).cumsum()
df['idx'] = df.groupby(['ID',s]).ngroup() 1
print(df)
'''
timestampUTC location ID minutes idx
0 2017-10-02 19:23:27 JB12 A69-1601-47272 NaN 1
1 2017-10-02 19:26:48 JB12 A69-1601-47272 3.350000 1
2 2017-10-02 19:27:23 JB12 A69-1601-47272 0.583333 1
3 2017-10-02 19:31:46 JB12 A69-1601-47272 4.383333 1
4 2017-10-02 23:52:15 JB12 A69-1601-47272 260.483333 2
5 2017-10-02 23:53:26 JB12 A69-1601-47272 1.183333 2
6 2017-10-02 23:55:13 JB12 A69-1601-47272 1.783333 2
7 2017-10-03 19:53:50 JB13 A69-1601-47272 1198.616667 3
8 2017-10-03 19:55:23 JB13 A69-1601-47272 1.550000 3
9 2017-10-03 19:58:26 JB13 A69-1601-47272 3.050000 3
10 2017-10-04 13:15:13 JB12 A69-1601-47280 1036.783333 4
11 2017-10-04 13:16:42 JB12 A69-1601-47280 1.483333 4
12 2017-10-04 13:21:39 JB12 A69-1601-47280 4.950000 4
13 2017-10-04 19:34:54 JB12 A69-1601-47280 373.250000 5
14 2017-10-04 19:55:28 JB12 A69-1601-47280 20.566667 5
15 2017-10-04 20:08:23 JB12 A69-1601-47280 12.916667 5
16 2017-10-04 20:21:43 JB12 A69-1601-47280 13.333333 5
17 2017-10-05 04:55:48 JB13 A69-1601-47280 514.083333 6
18 2017-10-05 04:57:04 JB13 A69-1601-47280 1.266667 6
19 2017-10-05 05:18:40 JB13 A69-1601-47280 21.600000 6
20 2017-10-07 21:24:19 JB13 A69-1601-47280 3845.650000 7
21 2017-10-07 21:25:36 JB13 A69-1601-47280 1.283333 7
22 2017-10-07 21:29:25 JB13 A69-1601-47280 3.816667 7
'''
Finally, let's group by the idx column and make the necessary calculations.
final=df.groupby('idx').agg(ID=('ID','first'),location=('location','first'),event_start=('timestampUTC', np.min),
event_end=('timestampUTC', np.max))
print(final)
'''
ID location event_start event_end
idx
1 A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
2 A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
3 A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
4 A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
5 A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
6 A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
7 A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
'''