Home > Mobile >  Examining two date variables in R to determine if they contain a fixed date for every year
Examining two date variables in R to determine if they contain a fixed date for every year

Time:11-01

Apologies if the wording of the title is confusing, it's difficult to describe exactly what I'm looking for. I've got some data with two date fields, let's call them start_date and end_date. I'm interested in knowing whether or not a particular observation "covered" June 30th of any given year (the data spans multiple years).

So, for instance, if start_date = "02-25-2021" and end_date = "01-12-2022", this observation would fit my criteria. By contrast, an observation with start_date = "07-02-2015" and end_date = "08-25-2015" would not, since June 30th does not occur in between the start and end date variables.

The issue is that because my data spans multiple years, it's not straightforward to me how I can identify cases which pass over a date regardless of year. How can I do this type of filtering without having to manually specify a range for every single year? Hope this is clear enough -- thanks for any assistance you can provide.

CodePudding user response:

You could use lubridate to add a column with your test date, and then test for it being %within% each interval. If you could share a sample of your data with dput() it might be easier to provide a clear example. Off my head I'd try something like:

library(tidyverse)
library(lubridate)
 df %>%
   mutate(test_date = ymd(paste0(year(end_date),'0630')),
          in_range = test_date %within% interval(start_date, end_date))

CodePudding user response:

Here is a solution with base R that can be used in a Tidyverse context. It is a bit hacky, but it does work.

The idea is to create a vector of dates between start_date and end_date and then strip away the year and the dash. When done in this order, the date can be matched as many times as it actually occurs in the vector. The rest is quite self-explanatory, by using basic dplyr functions, you can filter, count, etc.

# Packages
lapply(c("dplyr","tibble","stringr","lubridate"), library, character.only = TRUE)

# Create vector of dates without year
prep_m_d_vec <- function(start_date,
                         end_date){
  out <- seq.Date(from = start_date,
                  to = end_date,
                  by = 1) %>% 
    str_remove_all(pattern = "^[1-3]{1}[0-9]{3}-")
  
  return(out)
}

# Optional: RM year of date of choice
rm_year <- function(d){
  out <- format(d,
                format="%m-%d")
  return(out)
}

# Does not include date_of_choice
date_vec <- prep_m_d_vec(start_date = dmy("30-03-2021"),
                         end_date = dmy("30-05-2021"))

# Set date
date_of_choice <- rm_year(dmy("30-06-2022"))

# Filter rows
tibble(date = date_vec) %>% 
  filter(date == date_of_choice)

# date_of_choice is included 40x
date_vec <- prep_m_d_vec(start_date = dmy("30-03-2000"),
                         end_date = dmy("30-03-2040"))

# Filter rows
tibble(date = date_vec) %>% 
  filter(date == date_of_choice)

# Check if present and count
tibble(date = date_vec) %>% 
  summarise(n_date_of_choice = sum(date %in% date_of_choice),
            date_of_choice_present = (date_of_choice %in% date))
  • Related