Filter a timeseries for multiple ranges-CodePudding

Let's say I have this timeseries dataframe:

         date value1
1  2021-10-12  1.015
2  2021-10-13     NA
3  2021-10-14     NA
4  2021-10-15  1.015
5  2021-10-16  1.015
6  2021-10-17  1.015
7  2021-10-18  1.015
8  2021-10-19  1.015
9  2021-10-20  1.015
10 2021-10-21  1.015
11 2021-10-22  1.015
12 2021-10-23  1.015

df1 <- structure(list(date = structure(c(18912, 18913, 18914, 18915, 
                                       18916, 18917, 18918, 18919, 
                                       18920, 18921, 18922, 18923), class = "Date"), 
                       value1 = c(1.015, NA, NA, 1.015, 1.015, 1.015, 1.015, 1.015, 
                                  1.015, 1.015, 1.015, 1.015)), 
                  row.names = c(NA, -12L), class = "data.frame")

I want to filter this dataset for ranges of dates stored in a dataframe, e.g.

Start      End
2021-10-12 2021-10-14
2021-10-16 2021-10-18
2021-10-22 2021-10-23

dtr <- structure(list(Start = structure(c(18912, 18916, 18922), class = "Date"), 
                      End = structure(c(18914, 18918, 18923), class = "Date")), 
                 class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))

If I want to do this manually, I can write a case_when and use between or something to that effect to filter for each range. But there should be a way to loop through the ranges and use a vectorized solution for filtering.

Here's the case_when method:

df1 %>%
  filter(case_when(
    between(date, dtr$Start[1] & dtr$End[1]) ~ T,
    between(date, dtr$Start[2] & dtr$End[2]) ~ T,
    between(date, dtr$Start[3] & dtr$End[3]) ~ T,
    TRUE ~ F
  )


##         date value1
## 1 2021-10-12  1.015
## 2 2021-10-13     NA
## 3 2021-10-14     NA
## 4 2021-10-16  1.015
## 5 2021-10-17  1.015
## 6 2021-10-18  1.015
## 7 2021-10-22  1.015
## 8 2021-10-23  1.015

How can I vectorize filtering?

CodePudding user response：

We can loop through dtr rows, apply the filter, and then bind_rows:

library(dplyr)

bind_rows(lapply(1:nrow(dtr), 
                 function(i) filter(df1, between(date, dtr$Start[i], dtr$End[i]))))

##         date value1
## 1 2021-10-12  1.015
## 2 2021-10-13     NA
## 3 2021-10-14     NA
## 4 2021-10-16  1.015
## 5 2021-10-17  1.015
## 6 2021-10-18  1.015
## 7 2021-10-22  1.015
## 8 2021-10-23  1.015

data.table has %inrange% function which will be really efficient for this case; see Convenience functions for range subsets.

library(data.table)
setDT(df1)[date %inrange% setDT(dtr)]

CodePudding user response：

In the devel version of dplyr, we could use join_by

library(dplyr)
inner_join(df1, dtr, join_by(between(date, Start, End))) %>% 
   select(names(df1))

-output

        date value1
1 2021-10-12  1.015
2 2021-10-13     NA
3 2021-10-14     NA
4 2021-10-16  1.015
5 2021-10-17  1.015
6 2021-10-18  1.015
7 2021-10-22  1.015
8 2021-10-23  1.015

Or using pmap/map2

library(purrr)
pmap_dfr(dtr, ~ filter(df1, between(date, .x, .y)))
        date value1
1 2021-10-12  1.015
2 2021-10-13     NA
3 2021-10-14     NA
4 2021-10-16  1.015
5 2021-10-17  1.015
6 2021-10-18  1.015
7 2021-10-22  1.015
8 2021-10-23  1.015

CodePudding user response：

Here is an alternative option (also a join) with fuzzyjoin. What I like is the match_fun argument:

library(dplyr)
library(fuzzyjoin)
df1 %>%
  fuzzy_inner_join(y = dtr,
                   by = c("date" = "Start", "date" = "End"),
                   match_fun = list(`>=`, `<=`))

        date value1      Start        End
1 2021-10-12  1.015 2021-10-12 2021-10-14
2 2021-10-13     NA 2021-10-12 2021-10-14
3 2021-10-14     NA 2021-10-12 2021-10-14
4 2021-10-16  1.015 2021-10-16 2021-10-18
5 2021-10-17  1.015 2021-10-16 2021-10-18
6 2021-10-18  1.015 2021-10-16 2021-10-18
7 2021-10-22  1.015 2021-10-22 2021-10-23
8 2021-10-23  1.015 2021-10-22 2021-10-23