I have a have a dataset that looks something like this:
data <- data.frame(group = c("09081997", "13122006", "09081997", "22031969", "09081997"),
date1 = c("2021-08-09", "2021-08-10", "2021-08-21", "2021-07-19", "2021-07-15"),
date2 = c("2021-08-31", "2021-08-22", "2021-08-29", "2021-07-20", "2021-07-19"))
There are duplicated numbers in "group" variable. For example, I need to filter between the dates "2021-08-01" to "2021-08-31". By doing that, I would "delete" the last two numbers from "group", but I need to keep all duplicates, even if they're not between the time period I want to filter. I'd need to keep all "09081997" registers.
Is it possible to do that?
CodePudding user response:
If I understood correctly, this will work. I created an auxiliary variable counting by group, and only applied the filter for dates of groups that appear only one time.
library(dplyr)
library(lubridate)
data %>%
mutate(across(.cols = starts_with("date"),.fns = ymd)) %>%
add_count(group) %>%
filter(!(n == 1 & (date1 >= ymd("2021-08-01") & date2 <= ymd("2021-08-31"))))
group date1 date2 n
1 09081997 2021-08-09 2021-08-31 3
2 09081997 2021-08-21 2021-08-29 3
3 22031969 2021-07-19 2021-07-20 1
4 09081997 2021-07-15 2021-07-19 3
I am just unsure how date1 and date2 are supposed to be filtered.
CodePudding user response:
Using ave
, you could groupwise grepl
for '2021-08'
pat
tern and check if it occurs any
where. Since date*
columns are character we get "false" booleans, but we can easily turn the mode
to "logical"
. Finally we check if the rowSums
of the boolean are greater than zero, i.e. if any of the two dates in that row falls in '2021-08'
, which yields the desired boolean vector to subset the data frame.
data[with(data, ave(cbind(date1, date2), group, FUN=\(x) any(grepl(x, pat='2021-08')))) |>
`mode<-`('logical') |> rowSums() |> base::`>`(0), ]
# group date1 date2
# 1 09081997 2021-08-09 2021-08-31
# 2 13122006 2021-08-10 2021-08-22
# 3 09081997 2021-08-21 2021-08-29
# 5 09081997 2021-07-15 2021-07-19
Data:
data <- structure(list(group = c("09081997", "13122006", "09081997",
"22031969", "09081997"), date1 = c("2021-08-09", "2021-08-10",
"2021-08-21", "2021-07-19", "2021-07-15"), date2 = c("2021-08-31",
"2021-08-22", "2021-08-29", "2021-07-20", "2021-07-19")), class = "data.frame", row.names = c(NA,
-5L))