Home > database >  Recursive filtering in R
Recursive filtering in R

Time:08-06

I have data I need to clean, but not sure how. I need to remove all records that occured less than 7 days after the last observation, excluding those that need to be removed.

Data example:

library(dplyr)
library(lubridate)
df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(ymd("2022-01-01"), ymd("2022-01-03"), ymd("2022-01-05"), ymd("2022-01-09"), ymd("2022-01-20"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)

  id       date
1  1 2022-01-01
2  1 2022-01-03
3  1 2022-01-05
4  1 2022-01-09
5  1 2022-01-20
6  2 2022-01-02
7  2 2022-01-03
8  2 2022-01-09

And I want it to look like this

  id       date
1  1 2022-01-01
2  1 2022-01-09
3  1 2022-01-20
4  2 2022-01-02
5  2 2022-01-09

I tried using filter() and lag(), but they alone do not quite do it:

df %>% 
  group_by(id) %>% 
  mutate(prev = lag(date   days(7))) %>% 
  ungroup() %>% 
  filter(is.na(prev) | (date - prev >= 0))

     id       date      
1     1 2022-01-01
2     1 2022-01-20
3     2 2022-01-02

CodePudding user response:

The following code worked well for me.

library(dplyr)
library(lubridate)

df = data.frame(id = c(rep(1,5), rep(2,3)),
                date = c(seq.Date(from = ymd("2022-01-01"), to = ymd("2022-01-15"), by = "weeks"), ymd("2022-01-03"), ymd("2022-01-05"),
                         ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
  arrange(id, date)


df %>% 
  group_by(id) %>% 
  mutate(k = first(date)) %>% 
  ungroup() %>% 
  mutate(l = as.numeric(date-k)) %>% 
  filter(l%%7 == 0) %>% 
  select(id, date)

CodePudding user response:

As Ben suggested to look at this question, it does contain an answer. It did not work for me without any changes though, so I am posting slightly modified aichao's code here

library(rlang)
f <- function(d, ind = 1, minDiff = 7) {
  ind.next <- first(which(difftime(d,d[ind], units="days") >= all_of(minDiff))
  if (is_empty(ind.next))
    return(ind)
  else
    return(c(ind, f(d,ind.next)))
}

result <- df %>% 
  group_by(id) %>% 
  slice(f(date)) %>% 
  ungroup()

  • Related