Home > other >  How to filter one data frame on difference in date between other dataframe in R
How to filter one data frame on difference in date between other dataframe in R

Time:11-21

So I have a main dataframe 'df' that contains date ranges grouped by categories 'name' and 'values'. Values is a subcategory of name.

I have a second dataframe with a list of dates also by category, 'filterdf'. What I need to do is say given data from filterdf$baddates, exclude all dates in gooddates 0 to 2 days before the date in baddates by category.

df <- data.frame (name  = c("name_1", "name_1", "name_2", "name_2", "name_2", "name_3", "name_3", "name_3"),
                  values = c("value_1", "value_1", "value_2","value_4","value_4","value_3","value_3","value_3"),
                  gooddates = c("2022-02-02","2022-02-03","2022-02-04","2022-02-03","2022-02-04","2022-02-03","2022-02-04","2022-02-06"))

name  values  gooddates
1 name_1 value_1 2022-02-02
2 name_1 value_1 2022-02-03
3 name_2 value_2 2022-02-04
4 name_2 value_4 2022-02-03
5 name_2 value_4 2022-02-04
6 name_3 value_3 2022-02-03
7 name_3 value_3 2022-02-04
8 name_3 value_3 2022-02-06

filterdf <- data.frame(name  = c("name_1", "name_2", "name_3", "name_3"),
baddates = c("2022-02-03","2022-02-03","2022-02-04","2022-02-05"))

    name   baddates
1 name_1 2022-02-03
2 name_2 2022-02-03
3 name_3 2022-02-04
4 name_3 2022-02-05

Since I need an asymmetrical filter, I can't use the strategy I had hoped (when the dates were both in the original df), which is:

result <- df %>% filter( abs(baddates - gooddates) < 2 )

I need the result to be:

result <- data.frame (name  = c( "name_2", "name_2","name_3"),
                  values = c( "value_2","value_4","value_3"),
                  gooddates = c("2022-02-04","2022-02-04","2022-02-06"))

    name  values  gooddates
1 name_2 value_2 2022-02-04
2 name_2 value_4 2022-02-04
3 name_3 value_3 2022-02-06

This will be on a larger dataframe where name and values will need to be filtered by as a group on both name and value so I would like a dplyr solution if possible.

CodePudding user response:

Maybe this works with your data set

library(dplyr)

left_join(df, filterdf) %>% 
  mutate(dif = difftime(baddates, gooddates, units = "days")) %>% 
  filter(dif < 0 & dif >-2) %>% 
  select(name:gooddates)
Joining, by = "name"
    name  values  gooddates
1 name_2 value_2 2022-02-04
2 name_2 value_4 2022-02-04
3 name_3 value_3 2022-02-06

CodePudding user response:

With dplyr, we may do a join_by

library(dplyr)
df %>% 
  inner_join(filterdf, join_by(name,
      closest(gooddates > baddates))) %>%   
  select(-baddates)

-output

   name  values  gooddates
1 name_2 value_2 2022-02-04
2 name_2 value_4 2022-02-04
3 name_3 value_3 2022-02-06

Or use powerjoin

library(powerjoin)
df$gooddates <- as.Date(df$gooddates)
filterdf$baddates <- as.Date(filterdf$baddates)
power_inner_join(df, filterdf, by = c("name",
   ~   (.x$gooddates - .y$baddates) >= 1)) %>% 
  distinct(name, values, gooddates)

-output

   name  values  gooddates
1 name_2 value_2 2022-02-04
2 name_2 value_4 2022-02-04
3 name_3 value_3 2022-02-06
  • Related