Removing rows if they occur within a certain time of first instance by a group value in R-CodePudding

This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R.

I have modified the df presented there with a particular case I am trying to filter.

This is the code with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question).

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

df %>% 
  group_by(ID) %>% 
  mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
         keep = is.na(d) | d > 5*60) %>% 
  filter(keep) %>% 
  select(-d, -keep) %>% 
  arrange(Row)

This is the result:

# A tibble: 8 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B

This is not what I want because the time difference is always obtained to the previous row (using lag). This means that Row 6 and Row 7 are removed because they are less than 5 minutes away from each other and also from Row 5. The truth is, though, that Row 6 is more than five minutes away from the first instance in group A. It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6.

What I have been unable to obtain (without a loop) is how to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.

EDIT 1: Possible solution using an hybrid tidyverse-loop approach:

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

loop <- TRUE

while(loop) {
  
  df <- df %>% 
    group_by(ID) %>% 
    mutate(d = abs(difftime(lag(Timestamp), Timestamp, units="secs")),
           first = is.na(d) | d > 300,
           prev_first = lag(first),
           keep = first | is.na(d) | (prev_first & d > 300) | (!first & !prev_first))
  
  if (all(df$keep)) loop <- FALSE
  
  df <- df %>% filter(keep)
  
}

df <- df %>% 
  select(Row, Timestamp, ID) %>% 
  arrange(Row)

This is the real expected result:

# A tibble: 9 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    6 0020-06-29 12:33:30 A    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B

EDIT 2: More clarifiations. This is the df ordered by ID and with a comment on each row on whether it should be kept or removed:

  Row Timestamp           ID   
<dbl> <chr>               <chr>
    2 0020-06-29 12:27:00 A  <-- Keep (first in interval)  
    4 0020-06-29 12:28:30 A  <-- Remove (<5 mins from Row 2)   
    5 0020-06-29 12:31:30 A  <-- Remove (<5 mins from Row 2)  
    6 0020-06-29 12:33:30 A  <-- Keep (first in interval, >5 mins from previous first Row 2)  
    7 0020-06-29 12:33:45 A  <-- Remove (<5 mins from Row 6)    
   11 0020-06-29 12:55:00 A  <-- Keep (first in interval, >5 mins from previous first Row 6)  
    1 0020-06-29 12:14:00 B  <-- Keep (first in interval)
    3 0020-06-29 12:27:22 B  <-- Keep (first in interval, >5 mins from previous first Row 1)
    8 0020-06-29 12:43:00 B  <-- Keep (first in interval, >5 mins from previous first Row 3)
   10 0020-06-29 12:45:00 B  <-- Remove (<5 mins from Row 8)
   13 0020-06-29 13:04:00 B  <-- Keep (first in interval, >5 mins from previous first Row 8)  
    9 0020-06-29 12:44:00 C  <-- Keep (first in interval)  
   12 0020-06-29 12:57:00 C  <-- Keep (first in interval, >5 mins from previous first Row 9)

Even if this solution is not terrible, do you know whether the while statement can be removed in any way?

CodePudding user response：

You can create a recursive function, such as this based on this answer:

f <- function(d, ind = 1) {
  ind.next <- first(which(difftime(d, d[ind], units = "mins") > 5))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

Then for each ID use slice for rows based on Timestamp with the custom function:

library(tidyverse)

df %>%
  mutate(Timestamp = as.POSIXct(Timestamp)) %>%
  group_by(ID) %>%
  slice(f(Timestamp))

Output

    Row Timestamp           ID   
  <dbl> <dttm>              <chr>
1     2 0020-06-29 12:27:00 A    
2     6 0020-06-29 12:33:30 A    
3    11 0020-06-29 12:55:00 A    
4     1 0020-06-29 12:14:00 B    
5     3 0020-06-29 12:27:22 B    
6     8 0020-06-29 12:43:00 B    
7    13 0020-06-29 13:04:00 B    
8     9 0020-06-29 12:44:00 C    
9    12 0020-06-29 12:57:00 C