This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R.
I have modified the df
presented there with a particular case I am trying to filter.
This is the code
with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question).
library(tidyverse)
df <- tribble(
~Row, ~Timestamp, ~ID,
1, "0020-06-29 12:14:00", "B",
2, "0020-06-29 12:27:00", "A",
3, "0020-06-29 12:27:22", "B",
4, "0020-06-29 12:28:30", "A",
5, "0020-06-29 12:31:30", "A",
6, "0020-06-29 12:33:30", "A",
7, "0020-06-29 12:33:45", "A",
8, "0020-06-29 12:43:00", "B",
9, "0020-06-29 12:44:00", "C",
10, "0020-06-29 12:45:00", "B",
11, "0020-06-29 12:55:00", "A",
12, "0020-06-29 12:57:00", "C",
13, "0020-06-29 13:04:00", "B",
)
df %>%
group_by(ID) %>%
mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
keep = is.na(d) | d > 5*60) %>%
filter(keep) %>%
select(-d, -keep) %>%
arrange(Row)
This is the result:
# A tibble: 8 × 3
# Groups: ID [3]
Row Timestamp ID
<dbl> <chr> <chr>
1 0020-06-29 12:14:00 B
2 0020-06-29 12:27:00 A
3 0020-06-29 12:27:22 B
8 0020-06-29 12:43:00 B
9 0020-06-29 12:44:00 C
11 0020-06-29 12:55:00 A
12 0020-06-29 12:57:00 C
13 0020-06-29 13:04:00 B
This is not what I want because the time difference is always obtained to the previous row (using lag
). This means that Row 6 and Row 7 are removed because they are less than 5 minutes away from each other and also from Row 5. The truth is, though, that Row 6 is more than five minutes away from the first instance in group A. It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6.
What I have been unable to obtain (without a loop) is how to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.
EDIT 1: Possible solution using an hybrid tidyverse-loop approach:
library(tidyverse)
df <- tribble(
~Row, ~Timestamp, ~ID,
1, "0020-06-29 12:14:00", "B",
2, "0020-06-29 12:27:00", "A",
3, "0020-06-29 12:27:22", "B",
4, "0020-06-29 12:28:30", "A",
5, "0020-06-29 12:31:30", "A",
6, "0020-06-29 12:33:30", "A",
7, "0020-06-29 12:33:45", "A",
8, "0020-06-29 12:43:00", "B",
9, "0020-06-29 12:44:00", "C",
10, "0020-06-29 12:45:00", "B",
11, "0020-06-29 12:55:00", "A",
12, "0020-06-29 12:57:00", "C",
13, "0020-06-29 13:04:00", "B",
)
loop <- TRUE
while(loop) {
df <- df %>%
group_by(ID) %>%
mutate(d = abs(difftime(lag(Timestamp), Timestamp, units="secs")),
first = is.na(d) | d > 300,
prev_first = lag(first),
keep = first | is.na(d) | (prev_first & d > 300) | (!first & !prev_first))
if (all(df$keep)) loop <- FALSE
df <- df %>% filter(keep)
}
df <- df %>%
select(Row, Timestamp, ID) %>%
arrange(Row)
This is the real expected result:
# A tibble: 9 × 3
# Groups: ID [3]
Row Timestamp ID
<dbl> <chr> <chr>
1 0020-06-29 12:14:00 B
2 0020-06-29 12:27:00 A
3 0020-06-29 12:27:22 B
6 0020-06-29 12:33:30 A
8 0020-06-29 12:43:00 B
9 0020-06-29 12:44:00 C
11 0020-06-29 12:55:00 A
12 0020-06-29 12:57:00 C
13 0020-06-29 13:04:00 B
EDIT 2: More clarifiations. This is the df
ordered by ID and with a comment on each row on whether it should be kept or removed:
Row Timestamp ID
<dbl> <chr> <chr>
2 0020-06-29 12:27:00 A <-- Keep (first in interval)
4 0020-06-29 12:28:30 A <-- Remove (<5 mins from Row 2)
5 0020-06-29 12:31:30 A <-- Remove (<5 mins from Row 2)
6 0020-06-29 12:33:30 A <-- Keep (first in interval, >5 mins from previous first Row 2)
7 0020-06-29 12:33:45 A <-- Remove (<5 mins from Row 6)
11 0020-06-29 12:55:00 A <-- Keep (first in interval, >5 mins from previous first Row 6)
1 0020-06-29 12:14:00 B <-- Keep (first in interval)
3 0020-06-29 12:27:22 B <-- Keep (first in interval, >5 mins from previous first Row 1)
8 0020-06-29 12:43:00 B <-- Keep (first in interval, >5 mins from previous first Row 3)
10 0020-06-29 12:45:00 B <-- Remove (<5 mins from Row 8)
13 0020-06-29 13:04:00 B <-- Keep (first in interval, >5 mins from previous first Row 8)
9 0020-06-29 12:44:00 C <-- Keep (first in interval)
12 0020-06-29 12:57:00 C <-- Keep (first in interval, >5 mins from previous first Row 9)
Even if this solution is not terrible, do you know whether the while statement can be removed in any way?
CodePudding user response:
You can create a recursive function, such as this based on this answer:
f <- function(d, ind = 1) {
ind.next <- first(which(difftime(d, d[ind], units = "mins") > 5))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
Then for each ID
use slice
for rows based on Timestamp
with the custom function:
library(tidyverse)
df %>%
mutate(Timestamp = as.POSIXct(Timestamp)) %>%
group_by(ID) %>%
slice(f(Timestamp))
Output
Row Timestamp ID
<dbl> <dttm> <chr>
1 2 0020-06-29 12:27:00 A
2 6 0020-06-29 12:33:30 A
3 11 0020-06-29 12:55:00 A
4 1 0020-06-29 12:14:00 B
5 3 0020-06-29 12:27:22 B
6 8 0020-06-29 12:43:00 B
7 13 0020-06-29 13:04:00 B
8 9 0020-06-29 12:44:00 C
9 12 0020-06-29 12:57:00 C