I have data I need to clean, but not sure how. I need to remove all records that occured less than 7 days after the last observation, excluding those that need to be removed.
Data example:
library(dplyr)
library(lubridate)
df = data.frame(id = c(rep(1,5), rep(2,3)),
date = c(ymd("2022-01-01"), ymd("2022-01-03"), ymd("2022-01-05"), ymd("2022-01-09"), ymd("2022-01-20"),
ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
arrange(id, date)
id date
1 1 2022-01-01
2 1 2022-01-03
3 1 2022-01-05
4 1 2022-01-09
5 1 2022-01-20
6 2 2022-01-02
7 2 2022-01-03
8 2 2022-01-09
And I want it to look like this
id date
1 1 2022-01-01
2 1 2022-01-09
3 1 2022-01-20
4 2 2022-01-02
5 2 2022-01-09
I tried using filter()
and lag()
, but they alone do not quite do it:
df %>%
group_by(id) %>%
mutate(prev = lag(date days(7))) %>%
ungroup() %>%
filter(is.na(prev) | (date - prev >= 0))
id date
1 1 2022-01-01
2 1 2022-01-20
3 2 2022-01-02
CodePudding user response:
The following code worked well for me.
library(dplyr)
library(lubridate)
df = data.frame(id = c(rep(1,5), rep(2,3)),
date = c(seq.Date(from = ymd("2022-01-01"), to = ymd("2022-01-15"), by = "weeks"), ymd("2022-01-03"), ymd("2022-01-05"),
ymd("2022-01-02"), ymd("2022-01-03"), ymd("2022-01-09"))) %>%
arrange(id, date)
df %>%
group_by(id) %>%
mutate(k = first(date)) %>%
ungroup() %>%
mutate(l = as.numeric(date-k)) %>%
filter(l%%7 == 0) %>%
select(id, date)
CodePudding user response:
As Ben suggested to look at this question, it does contain an answer. It did not work for me without any changes though, so I am posting slightly modified aichao's code here
library(rlang)
f <- function(d, ind = 1, minDiff = 7) {
ind.next <- first(which(difftime(d,d[ind], units="days") >= all_of(minDiff))
if (is_empty(ind.next))
return(ind)
else
return(c(ind, f(d,ind.next)))
}
result <- df %>%
group_by(id) %>%
slice(f(date)) %>%
ungroup()