I have a data frame with daily observations from several years. Some days are missing from the dataset:
df <- tibble(time = seq(as.Date("2010/1/1"), as.Date("2020/12/31"), "days"),
value = runif(4018))
# reproducing missing days
df <- df[-sample.int(nrow(df), 100),]
I am trying to use dplyr::group_by
to group my data frame using the same date range between years. However, the range starts at one year and ends at the next year, e.g. a range between November 15th and February 15h for all the time series. I would like to have one group for each date range, e.g. one group for 2010-11-15 to 2011-02-15, another group for 2011-11-15 to 2012-02-15 and so on.
Any tips?
CodePudding user response:
One approach is to create a separate data.frame that transparently shows the groups assigned and date ranges. Then, you can use the data.frame with fuzzy_inner_join
to assign rows to groups, allowing you to use group_by
with these group numbers. Alternatives to consider would be using data.table
, cut
, and/or findInterval
. Let me know if this will address your needs.
library(lubridate)
library(tidyverse)
library(fuzzyjoin)
df_group <- data.frame(
group = seq.int(max(year(df$time)) - min(year(df$time)) 1),
start = seq.Date(as.Date(paste0(min(year(df$time)), "-11-15")), as.Date(paste0(max(year(df$time)), "-11-15")), "years"),
end = seq.Date(as.Date(paste0(min(year(df$time)) 1, "-02-15")), as.Date(paste0(max(year(df$time)) 1, "-11-15")), "years")
)
fuzzy_inner_join(
df,
df_group,
by = c("time" = "start", "time" = "end"),
match_fun = list(`>=`, `<=`)
)
Output
time value group start end
<date> <dbl> <int> <date> <date>
1 2010-11-15 0.901 1 2010-11-15 2011-02-15
2 2010-11-16 0.991 1 2010-11-15 2011-02-15
3 2010-11-17 0.430 1 2010-11-15 2011-02-15
4 2010-11-18 0.394 1 2010-11-15 2011-02-15
5 2010-11-19 0.142 1 2010-11-15 2011-02-15
6 2010-11-20 0.280 1 2010-11-15 2011-02-15
7 2010-11-21 0.565 1 2010-11-15 2011-02-15
8 2010-11-22 0.935 1 2010-11-15 2011-02-15
9 2010-11-23 0.358 1 2010-11-15 2011-02-15
10 2010-11-24 0.842 1 2010-11-15 2011-02-15
# … with 941 more rows
CodePudding user response:
Convert all date to dates of the same year, and check if they fall between
the desired date range:
library(dplyr)
f <- function(x) paste0("2012-", format(x, "%m-%d"))
df %>%
group_by(bet = between(f(time), as.Date("2012-02-15"), as.Date("2012-11-15")))