Let's say I have this timeseries dataframe:
date value1
1 2021-10-12 1.015
2 2021-10-13 NA
3 2021-10-14 NA
4 2021-10-15 1.015
5 2021-10-16 1.015
6 2021-10-17 1.015
7 2021-10-18 1.015
8 2021-10-19 1.015
9 2021-10-20 1.015
10 2021-10-21 1.015
11 2021-10-22 1.015
12 2021-10-23 1.015
df1 <- structure(list(date = structure(c(18912, 18913, 18914, 18915,
18916, 18917, 18918, 18919,
18920, 18921, 18922, 18923), class = "Date"),
value1 = c(1.015, NA, NA, 1.015, 1.015, 1.015, 1.015, 1.015,
1.015, 1.015, 1.015, 1.015)),
row.names = c(NA, -12L), class = "data.frame")
I want to filter this dataset for ranges of dates stored in a dataframe, e.g.
Start End
2021-10-12 2021-10-14
2021-10-16 2021-10-18
2021-10-22 2021-10-23
dtr <- structure(list(Start = structure(c(18912, 18916, 18922), class = "Date"),
End = structure(c(18914, 18918, 18923), class = "Date")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -3L))
If I want to do this manually, I can write a case_when
and use between
or something to that effect to filter for each range. But there should be a way to loop through the ranges and use a vectorized solution for filtering.
Here's the case_when
method:
df1 %>%
filter(case_when(
between(date, dtr$Start[1] & dtr$End[1]) ~ T,
between(date, dtr$Start[2] & dtr$End[2]) ~ T,
between(date, dtr$Start[3] & dtr$End[3]) ~ T,
TRUE ~ F
)
## date value1
## 1 2021-10-12 1.015
## 2 2021-10-13 NA
## 3 2021-10-14 NA
## 4 2021-10-16 1.015
## 5 2021-10-17 1.015
## 6 2021-10-18 1.015
## 7 2021-10-22 1.015
## 8 2021-10-23 1.015
How can I vectorize filtering?
CodePudding user response:
We can loop through dtr
rows, apply the filter
, and then bind_rows
:
library(dplyr)
bind_rows(lapply(1:nrow(dtr),
function(i) filter(df1, between(date, dtr$Start[i], dtr$End[i]))))
## date value1
## 1 2021-10-12 1.015
## 2 2021-10-13 NA
## 3 2021-10-14 NA
## 4 2021-10-16 1.015
## 5 2021-10-17 1.015
## 6 2021-10-18 1.015
## 7 2021-10-22 1.015
## 8 2021-10-23 1.015
data.table has %inrange%
function which will be really efficient for this case; see Convenience functions for range subsets.
library(data.table)
setDT(df1)[date %inrange% setDT(dtr)]
CodePudding user response:
In the devel
version of dplyr
, we could use join_by
library(dplyr)
inner_join(df1, dtr, join_by(between(date, Start, End))) %>%
select(names(df1))
-output
date value1
1 2021-10-12 1.015
2 2021-10-13 NA
3 2021-10-14 NA
4 2021-10-16 1.015
5 2021-10-17 1.015
6 2021-10-18 1.015
7 2021-10-22 1.015
8 2021-10-23 1.015
Or using pmap/map2
library(purrr)
pmap_dfr(dtr, ~ filter(df1, between(date, .x, .y)))
date value1
1 2021-10-12 1.015
2 2021-10-13 NA
3 2021-10-14 NA
4 2021-10-16 1.015
5 2021-10-17 1.015
6 2021-10-18 1.015
7 2021-10-22 1.015
8 2021-10-23 1.015
CodePudding user response:
Here is an alternative option (also a join) with fuzzyjoin
. What I like is the match_fun
argument:
library(dplyr)
library(fuzzyjoin)
df1 %>%
fuzzy_inner_join(y = dtr,
by = c("date" = "Start", "date" = "End"),
match_fun = list(`>=`, `<=`))
date value1 Start End
1 2021-10-12 1.015 2021-10-12 2021-10-14
2 2021-10-13 NA 2021-10-12 2021-10-14
3 2021-10-14 NA 2021-10-12 2021-10-14
4 2021-10-16 1.015 2021-10-16 2021-10-18
5 2021-10-17 1.015 2021-10-16 2021-10-18
6 2021-10-18 1.015 2021-10-16 2021-10-18
7 2021-10-22 1.015 2021-10-22 2021-10-23
8 2021-10-23 1.015 2021-10-22 2021-10-23