The data:
id station start stop
<int> <dbl> <dttm> <dttm>
1 1 1 1899-12-31 00:05:04 1899-12-31 00:13:36
2 2 1 1899-12-31 00:14:04 1899-12-31 00:21:32
3 3 1 1899-12-31 00:21:32 1899-12-31 00:26:56
4 4 1 1899-12-31 00:26:57 1899-12-31 00:27:10
5 1 1 1899-12-31 00:38:32 1899-12-31 00:38:38
6 3 1 1899-12-31 00:43:23 1899-12-31 00:43:47
7 5 1 1899-12-31 00:53:00 1899-12-31 00:53:15
8 6 1 1899-12-31 00:53:29 1899-12-31 00:55:49
9 6 1 1899-12-31 00:56:15 1899-12-31 00:56:42
10 7 1 1899-12-31 01:14:30 1899-12-31 01:28:09
11 3 1 1899-12-31 01:28:09 1899-12-31 01:31:17
12 3 2 1899-12-31 01:34:22 1899-12-31 01:35:11
13 7 3 1899-12-31 01:36:44 1899-12-31 01:39:54
14 8 3 1899-12-31 01:41:49 1899-12-31 01:44:45
15 9 2 1899-12-31 01:44:47 1899-12-31 01:52:07
The need:
I need to merge the times of repeated id's so that they count as just one. For example, id #6 has two consecutive actions, each with a start and stop times (lines 8 and 9). The desired output would be that instead of this:
11 3 1 1899-12-31 01:28:09 1899-12-31 01:31:17
12 3 2 1899-12-31 01:34:22 1899-12-31 01:35:11
Something like this would be shown
11 3 1 1899-12-31 01:28:09 1899-12-31 01:35:11
So, both events as a single event with the first start timestamp, and the stop of the second one.
The caveat
This solution should merge times only if they belong to the same station
, so for example in the sample data id #3 in line 11 would not be merged with line 12, since they belong to different stations.
A solution using tools from tidyverse is preferred but not necessary.
CodePudding user response:
Perhaps you could try the following. Assign a unique second id id2
that will be the same for repeat, consecutive id
's. Then, after grouping by the new id2
, summarise
and take the first
start
and the last
stop
dates/times.
library(tidyverse)
library(data.table)
df %>%
mutate(id2 = rleid(id)) %>%
group_by(id2) %>%
summarise(id = first(id), start = first(start), stop = last(stop)) %>%
ungroup() %>%
select(id, start, stop)
Output
id start stop
<int> <dttm> <dttm>
1 1 1899-12-31 00:05:04 1899-12-31 00:13:36
2 2 1899-12-31 00:14:04 1899-12-31 00:21:32
3 3 1899-12-31 00:21:32 1899-12-31 00:26:56
4 4 1899-12-31 00:26:57 1899-12-31 00:27:10
5 1 1899-12-31 00:38:32 1899-12-31 00:38:38
6 3 1899-12-31 00:43:23 1899-12-31 00:43:47
7 5 1899-12-31 00:53:00 1899-12-31 00:53:15
8 6 1899-12-31 00:53:29 1899-12-31 00:56:42
9 7 1899-12-31 01:14:30 1899-12-31 01:28:09
10 3 1899-12-31 01:28:09 1899-12-31 01:35:11
11 7 1899-12-31 01:36:44 1899-12-31 01:39:54
12 8 1899-12-31 01:41:49 1899-12-31 01:44:45
13 9 1899-12-31 01:44:47 1899-12-31 01:52:07
For the caveat with different stations for a given id
you can add station
to rleid
, such that you will get a unique id2
for each combination of id
and station
combined.
In addition, you can use unique
as alternative to first
in summarise
to keep the column values you need (just an option).
df %>%
mutate(id2 = rleid(id, station)) %>%
group_by(id2) %>%
summarise(id = unique(id), station = unique(station), start = first(start), stop = last(stop)) %>%
ungroup() %>%
select(id, station, start, stop)
Output
id station start stop
<int> <dbl> <dttm> <dttm>
1 1 1 1899-12-31 00:05:04 1899-12-31 00:13:36
2 2 1 1899-12-31 00:14:04 1899-12-31 00:21:32
3 3 1 1899-12-31 00:21:32 1899-12-31 00:26:56
4 4 1 1899-12-31 00:26:57 1899-12-31 00:27:10
5 1 1 1899-12-31 00:38:32 1899-12-31 00:38:38
6 3 1 1899-12-31 00:43:23 1899-12-31 00:43:47
7 5 1 1899-12-31 00:53:00 1899-12-31 00:53:15
8 6 1 1899-12-31 00:53:29 1899-12-31 00:56:42
9 7 1 1899-12-31 01:14:30 1899-12-31 01:28:09
10 3 1 1899-12-31 01:28:09 1899-12-31 01:31:17
11 3 2 1899-12-31 01:34:22 1899-12-31 01:35:11
12 7 3 1899-12-31 01:36:44 1899-12-31 01:39:54
13 8 3 1899-12-31 01:41:49 1899-12-31 01:44:45
14 9 2 1899-12-31 01:44:47 1899-12-31 01:52:07