Home > Mobile >  How do I group my date variable by a date range between years in R?
How do I group my date variable by a date range between years in R?

Time:10-29

I have a data frame with daily observations from several years. Some days are missing from the dataset:

df <- tibble(time = seq(as.Date("2010/1/1"), as.Date("2020/12/31"), "days"),
             value = runif(4018))

# reproducing missing days 
df <- df[-sample.int(nrow(df), 100),]

I am trying to use dplyr::group_by to group my data frame using the same date range between years. However, the range starts at one year and ends at the next year, e.g. a range between November 15th and February 15h for all the time series. I would like to have one group for each date range, e.g. one group for 2010-11-15 to 2011-02-15, another group for 2011-11-15 to 2012-02-15 and so on.

Any tips?

CodePudding user response:

One approach is to create a separate data.frame that transparently shows the groups assigned and date ranges. Then, you can use the data.frame with fuzzy_inner_join to assign rows to groups, allowing you to use group_by with these group numbers. Alternatives to consider would be using data.table, cut, and/or findInterval. Let me know if this will address your needs.

library(lubridate)
library(tidyverse)
library(fuzzyjoin)

df_group <- data.frame(
  group = seq.int(max(year(df$time)) - min(year(df$time))   1),
  start = seq.Date(as.Date(paste0(min(year(df$time)), "-11-15")), as.Date(paste0(max(year(df$time)), "-11-15")), "years"),
  end = seq.Date(as.Date(paste0(min(year(df$time))   1, "-02-15")), as.Date(paste0(max(year(df$time))   1, "-11-15")), "years")
)

fuzzy_inner_join(
  df,
  df_group,
  by = c("time" = "start", "time" = "end"),
  match_fun = list(`>=`, `<=`)
)

Output

   time       value group start      end       
   <date>     <dbl> <int> <date>     <date>    
 1 2010-11-15 0.901     1 2010-11-15 2011-02-15
 2 2010-11-16 0.991     1 2010-11-15 2011-02-15
 3 2010-11-17 0.430     1 2010-11-15 2011-02-15
 4 2010-11-18 0.394     1 2010-11-15 2011-02-15
 5 2010-11-19 0.142     1 2010-11-15 2011-02-15
 6 2010-11-20 0.280     1 2010-11-15 2011-02-15
 7 2010-11-21 0.565     1 2010-11-15 2011-02-15
 8 2010-11-22 0.935     1 2010-11-15 2011-02-15
 9 2010-11-23 0.358     1 2010-11-15 2011-02-15
10 2010-11-24 0.842     1 2010-11-15 2011-02-15
# … with 941 more rows

CodePudding user response:

Convert all date to dates of the same year, and check if they fall between the desired date range:

library(dplyr)
f <- function(x) paste0("2012-", format(x, "%m-%d"))
df %>% 
  group_by(bet = between(f(time), as.Date("2012-02-15"), as.Date("2012-11-15")))
  • Related