Home > Software engineering >  Removing rows if they occur within a certain time of first instance by a group value in R
Removing rows if they occur within a certain time of first instance by a group value in R

Time:09-05

This is a follow up question to Removing rows if they occur within a certain time of each other by a group value in R.

I have modified the df presented there with a particular case I am trying to filter.

This is the code with a quick way to filter out rows with less than a 5 minutes interval using tidyverse (similar to the solution posted by PaulS in the original question).

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

df %>% 
  group_by(ID) %>% 
  mutate(d = abs(difftime(lag(Timestamp), Timestamp)),
         keep = is.na(d) | d > 5*60) %>% 
  filter(keep) %>% 
  select(-d, -keep) %>% 
  arrange(Row)

This is the result:

# A tibble: 8 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B   

This is not what I want because the time difference is always obtained to the previous row (using lag). This means that Row 6 and Row 7 are removed because they are less than 5 minutes away from each other and also from Row 5. The truth is, though, that Row 6 is more than five minutes away from the first instance in group A. It should be kept and become the first instance for a new interval and then Row 7 would be removed accordingly to the distance to the new first instance in Row 6.

What I have been unable to obtain (without a loop) is how to define this time interval groups of 5 minutes, not from successive rows, but from the first instance after 5 minues.

EDIT 1: Possible solution using an hybrid tidyverse-loop approach:

library(tidyverse)

df <- tribble(
  ~Row,   ~Timestamp,                ~ID,
  1,      "0020-06-29 12:14:00",     "B",
  2,      "0020-06-29 12:27:00",     "A", 
  3,      "0020-06-29 12:27:22",     "B",  
  4,      "0020-06-29 12:28:30",     "A", 
  5,      "0020-06-29 12:31:30",     "A", 
  6,      "0020-06-29 12:33:30",     "A", 
  7,      "0020-06-29 12:33:45",     "A", 
  8,      "0020-06-29 12:43:00",     "B", 
  9,      "0020-06-29 12:44:00",     "C", 
  10,     "0020-06-29 12:45:00",     "B", 
  11,     "0020-06-29 12:55:00",     "A", 
  12,     "0020-06-29 12:57:00",     "C", 
  13,     "0020-06-29 13:04:00",     "B", 
)

loop <- TRUE

while(loop) {
  
  df <- df %>% 
    group_by(ID) %>% 
    mutate(d = abs(difftime(lag(Timestamp), Timestamp, units="secs")),
           first = is.na(d) | d > 300,
           prev_first = lag(first),
           keep = first | is.na(d) | (prev_first & d > 300) | (!first & !prev_first))
  
  if (all(df$keep)) loop <- FALSE
  
  df <- df %>% filter(keep)
  
}

df <- df %>% 
  select(Row, Timestamp, ID) %>% 
  arrange(Row)

This is the real expected result:

# A tibble: 9 × 3
# Groups:   ID [3]
  Row Timestamp           ID   
<dbl> <chr>               <chr>
    1 0020-06-29 12:14:00 B    
    2 0020-06-29 12:27:00 A    
    3 0020-06-29 12:27:22 B    
    6 0020-06-29 12:33:30 A    
    8 0020-06-29 12:43:00 B    
    9 0020-06-29 12:44:00 C    
   11 0020-06-29 12:55:00 A    
   12 0020-06-29 12:57:00 C    
   13 0020-06-29 13:04:00 B 

EDIT 2: More clarifiations. This is the df ordered by ID and with a comment on each row on whether it should be kept or removed:

  Row Timestamp           ID   
<dbl> <chr>               <chr>
    2 0020-06-29 12:27:00 A  <-- Keep (first in interval)  
    4 0020-06-29 12:28:30 A  <-- Remove (<5 mins from Row 2)   
    5 0020-06-29 12:31:30 A  <-- Remove (<5 mins from Row 2)  
    6 0020-06-29 12:33:30 A  <-- Keep (first in interval, >5 mins from previous first Row 2)  
    7 0020-06-29 12:33:45 A  <-- Remove (<5 mins from Row 6)    
   11 0020-06-29 12:55:00 A  <-- Keep (first in interval, >5 mins from previous first Row 6)  
    1 0020-06-29 12:14:00 B  <-- Keep (first in interval)
    3 0020-06-29 12:27:22 B  <-- Keep (first in interval, >5 mins from previous first Row 1)
    8 0020-06-29 12:43:00 B  <-- Keep (first in interval, >5 mins from previous first Row 3)
   10 0020-06-29 12:45:00 B  <-- Remove (<5 mins from Row 8)
   13 0020-06-29 13:04:00 B  <-- Keep (first in interval, >5 mins from previous first Row 8)  
    9 0020-06-29 12:44:00 C  <-- Keep (first in interval)  
   12 0020-06-29 12:57:00 C  <-- Keep (first in interval, >5 mins from previous first Row 9)

Even if this solution is not terrible, do you know whether the while statement can be removed in any way?

CodePudding user response:

You can create a recursive function, such as this based on this answer:

f <- function(d, ind = 1) {
  ind.next <- first(which(difftime(d, d[ind], units = "mins") > 5))
  if (is.na(ind.next))
    return(ind)
  else
    return(c(ind, f(d, ind.next)))
}

Then for each ID use slice for rows based on Timestamp with the custom function:

library(tidyverse)

df %>%
  mutate(Timestamp = as.POSIXct(Timestamp)) %>%
  group_by(ID) %>%
  slice(f(Timestamp))

Output

    Row Timestamp           ID   
  <dbl> <dttm>              <chr>
1     2 0020-06-29 12:27:00 A    
2     6 0020-06-29 12:33:30 A    
3    11 0020-06-29 12:55:00 A    
4     1 0020-06-29 12:14:00 B    
5     3 0020-06-29 12:27:22 B    
6     8 0020-06-29 12:43:00 B    
7    13 0020-06-29 13:04:00 B    
8     9 0020-06-29 12:44:00 C    
9    12 0020-06-29 12:57:00 C 
  • Related