Home > Blockchain >  How to filter between dates but keep duplicates that aren't included in the time period?
How to filter between dates but keep duplicates that aren't included in the time period?

Time:12-04

I have a have a dataset that looks something like this:

data <- data.frame(group = c("09081997", "13122006", "09081997", "22031969", "09081997"),
                   date1 = c("2021-08-09", "2021-08-10", "2021-08-21", "2021-07-19", "2021-07-15"),
                   date2 = c("2021-08-31", "2021-08-22", "2021-08-29", "2021-07-20", "2021-07-19"))

There are duplicated numbers in "group" variable. For example, I need to filter between the dates "2021-08-01" to "2021-08-31". By doing that, I would "delete" the last two numbers from "group", but I need to keep all duplicates, even if they're not between the time period I want to filter. I'd need to keep all "09081997" registers.

Is it possible to do that?

CodePudding user response:

If I understood correctly, this will work. I created an auxiliary variable counting by group, and only applied the filter for dates of groups that appear only one time.

library(dplyr)
library(lubridate)

data %>% 
  mutate(across(.cols = starts_with("date"),.fns = ymd)) %>% 
  add_count(group) %>% 
  filter(!(n == 1 & (date1 >= ymd("2021-08-01") & date2 <= ymd("2021-08-31"))))

     group      date1      date2 n
1 09081997 2021-08-09 2021-08-31 3
2 09081997 2021-08-21 2021-08-29 3
3 22031969 2021-07-19 2021-07-20 1
4 09081997 2021-07-15 2021-07-19 3

I am just unsure how date1 and date2 are supposed to be filtered.

CodePudding user response:

Using ave, you could groupwise grepl for '2021-08' pattern and check if it occurs anywhere. Since date* columns are character we get "false" booleans, but we can easily turn the mode to "logical". Finally we check if the rowSums of the boolean are greater than zero, i.e. if any of the two dates in that row falls in '2021-08', which yields the desired boolean vector to subset the data frame.

data[with(data, ave(cbind(date1, date2), group, FUN=\(x) any(grepl(x, pat='2021-08')))) |> 
       `mode<-`('logical') |> rowSums() |> base::`>`(0), ]
#      group      date1      date2
# 1 09081997 2021-08-09 2021-08-31
# 2 13122006 2021-08-10 2021-08-22
# 3 09081997 2021-08-21 2021-08-29
# 5 09081997 2021-07-15 2021-07-19
 

Data:

data <- structure(list(group = c("09081997", "13122006", "09081997", 
"22031969", "09081997"), date1 = c("2021-08-09", "2021-08-10", 
"2021-08-21", "2021-07-19", "2021-07-15"), date2 = c("2021-08-31", 
"2021-08-22", "2021-08-29", "2021-07-20", "2021-07-19")), class = "data.frame", row.names = c(NA, 
-5L))
  • Related