How to aggregate dates by "Week starting July 01" "Week starting July 08" etc ov-CodePudding

Data

I'm working with the following data from 2019 to 2022

library(tidyverse)
library(lubridate)

set.seed(123)

df <- 
  tibble(
    date = sample(
      seq(as.Date('2019/01/01'), as.Date('2022/09/01'), by = "day"), 
      size = 100, 
      replace = T))

Desired output

I would like to create a new variable that summarises each week into the same month-day range across 2019-2022. The result should be the week staring the same month-day regardless of what weekday or week of the year that date happens to be each year.

For example

2019-07-03 should become "July 01" (Short for "Week starting July 01)
2021-07-03 should become "July 01" (New year, same month-day)
2021-07-04 should become "July 01"
2021-07-05 should become "July 01"
2021-07-06 should become "July 01"
2021-07-07 should become "July 01"
2021-07-08 should become "July 08" (New week has started, increment the month-day by 7)
2021-07-14 should become "July 08"
2021-07-15 should become "July 15" (New week has started, increment the month-day by 7)
2021-07-16 should become "July 15"
2022-07-15 should become "July 15" (New year, same month-day)
Etc.

Failed attempt

I've tried using cat.Date() and lubridate::floor_date() but these return the floor date for that particular year. For example

df %>% 
  mutate(
    date_month_week = lubridate::floor_date(date, unit = "week", week_start = 7),
    date_month_week = format(as.Date(date_month_week), "%m-%d")) %>% 
  arrange(desc(date_month_week)) %>% 
  distinct(date_month_week)

Yields

# A tibble: 77 × 1
   date_month_week
   <chr>          
 1 12-20          
 2 12-15          
 3 12-12          
 4 12-08          
 5 12-06          
 6 12-01          
 7 11-28          
 8 11-24          
 9 11-22          
10 11-21

Many thanks for your help.

CodePudding user response：

You could group_by week numbers format = "%V". The thing is that weeks starts at different days depending on the year, so you could summarise by using the first day of a week for a given year instead of week numbers (but you'll have to group by week numbers):

df %>% 
  mutate(date_month_week = floor_date(date, unit = "week", week_start = 7),
         week_number = format(date_month_week, "%V"),
         string = format(date_month_week, "Week of %b %d")) %>% 
  group_by(week_number) %>% 
  summarise(string = min(string),
            date_month_week = toString(date_month_week)) %>% 
  ungroup() %>% 
  select(-week_number)

# A tibble: 46 × 2
   string         date_month_week                               
   <chr>          <chr>                                         
 1 Week of Jan 05 2020-01-05, 2021-01-10, 2020-01-05, 2022-01-09
 2 Week of Jan 13 2022-01-16, 2021-01-17, 2019-01-13, 2022-01-16
 3 Week of Jan 20 2019-01-20, 2021-01-24                        
 4 Week of Feb 03 2021-02-07, 2019-02-03, 2022-02-06            
 5 Week of Feb 09 2022-02-13, 2019-02-10, 2022-02-13, 2020-02-09
 6 Week of Feb 16 2020-02-16                                    
 7 Week of Feb 27 2022-02-27                                    
 8 Week of Mar 01 2022-03-06, 2020-03-01, 2019-03-03            
 9 Week of Mar 10 2019-03-10, 2019-03-10                        
10 Week of Mar 27 2022-03-27, 2022-03-27                        
# … with 36 more rows

CodePudding user response：

I've developed a kludge solution, but would love hear a more elegant alternative.

We can create a separate df that matches every month-day to a corresponding "Week of %b-%d"

ls_groups <- 
  tibble(
    date = seq(
      as.Date('2018/01/01'), # 2018 starts on a Monday
      as.Date('2018/12/31'), # Dec 31 is week 53, need to fix later
      by = "day"),
    month_day = format(date, "%Y-%m-%d"), 
    group = sort(c(rep(seq(1,52), 7), 53))) %>% # Add week 53 manually
  group_by(group) %>% 
  mutate(
    group = min(month_day), # Take the minimum date so it's always starting on Monday
    group = format(as.Date(group), "Week start %b-%d"), # Convert to text format
    month_day = format(date, "%m-%d")) %>%   # Create a join key
  select(-date)

Then we join it into the original data frame

df %>% 
  mutate(month_day = format(date, "%m-%d")) %>% # Create a join key
  left_join(
    x = ., 
    y = ls_groups, 
    by = "month_day") %>% 
  select(-month_day) %>% 
  arrange(group) # Arrange to check that diff. years have same month-day

Which gives us

# A tibble: 100 × 2
   date       group            
   <date>     <chr>            
 1 2020-04-07 Week start Apr-02
 2 2021-04-19 Week start Apr-16
 3 2022-04-16 Week start Apr-16
 4 2021-04-23 Week start Apr-23
 5 2020-04-24 Week start Apr-23
 6 2020-05-04 Week start Apr-30
 7 2019-05-01 Week start Apr-30
 8 2021-08-10 Week start Aug-06
 9 2022-08-06 Week start Aug-06
10 2020-08-10 Week start Aug-06
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows