Home > front end >  Aggregate daily data into weeks
Aggregate daily data into weeks

Time:11-09

I have data resembling the following structure, where the when variable denotes the day of measurement:

## Generate data.
set.seed(1986)

n <- 1000
y <- rnorm(n)

when <- as.POSIXct(strftime(seq(as.POSIXct("2021-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"), 
                 as.POSIXct("2022-11-01 23:00:00 UTC", tryFormats = "%Y-%m-%d"), 
                 length.out = n), format = "%Y-%m-%d"))

dta <- data.frame(y, when)
head(dta)
#>             y       when
#> 1 -0.04625141 2021-11-01
#> 2  0.28000082 2021-11-01
#> 3  0.25317063 2021-11-01
#> 4 -0.96411077 2021-11-02
#> 5  0.49222664 2021-11-02
#> 6 -0.69874551 2021-11-02

I need to compute averages of y over time. For instance, the following computes daily averages:

## Compute daily averages of y.
library(dplyr)

daily_avg <- dta %>% 
  group_by(when) %>%
  summarise(daily_mean = mean(y)) %>% 
  ungroup()

daily_avg
#> # A tibble: 366 × 2
#>    when                daily_mean
#>    <dttm>                   <dbl>
#>  1 2021-11-01 00:00:00     0.162 
#>  2 2021-11-02 00:00:00    -0.390 
#>  3 2021-11-03 00:00:00    -0.485 
#>  4 2021-11-04 00:00:00    -0.152 
#>  5 2021-11-05 00:00:00     0.425 
#>  6 2021-11-06 00:00:00     0.726 
#>  7 2021-11-07 00:00:00     0.855 
#>  8 2021-11-08 00:00:00     0.0608
#>  9 2021-11-09 00:00:00    -0.995 
#> 10 2021-11-10 00:00:00     0.395 
#> # … with 356 more rows

I am having a hard time computing weekly averages. Here is what I have tried so far:

## Fail - compute weekly averages of y.
library(lubridate)

dta$week <- week(dta$when) # This is wrong.
dta[165: 171, ]
#>              y       when week
#> 165  0.9758333 2021-12-30   52
#> 166 -0.8630091 2021-12-31   53
#> 167  0.3054031 2021-12-31   53
#> 168  1.2814421 2022-01-01    1
#> 169  0.1025440 2022-01-01    1
#> 170  1.3665411 2022-01-01    1
#> 171 -0.5373058 2022-01-02    1

Using the week function from the lubridate package ignores the fact that my data spawn across years. So, if I were to use a code similar to the one I used for the daily averages, I would aggregate observations belonging to different years (but to the same week number). How can I solve this?

CodePudding user response:

You can use %V (from ?strptime) for weeks, combining it with the year.

dta %>%
  group_by(week = format(when, format = "%Y-%V")) %>%
  summarize(daily_mean = mean(y)) %>%
  ungroup()
# # A tibble: 54 x 2
#    week    daily_mean
#    <chr>        <dbl>
#  1 2021-44     0.179 
#  2 2021-45     0.0477
#  3 2021-46     0.0340
#  4 2021-47     0.356 
#  5 2021-48     0.0544
#  6 2021-49    -0.0948
#  7 2021-50    -0.0419
#  8 2021-51     0.209 
#  9 2021-52     0.251 
# 10 2022-01    -0.197 
# # ... with 44 more rows

There are different variants of "week", depending on your preference.

%V
    Week of the year as decimal number (01–53) as defined in ISO 8601.
 If the week (starting on Monday) containing 1 January has four or more
 days in the new year, then it is considered week 1. Otherwise, it is
 the last week of the previous year, and the next week is week 1.
 (Accepted but ignored on input.)

%W
    Week of the year as decimal number (00–53) using Monday as the first
 day of week (and typically with the first Monday of the year as day 1
 of week 1). The UK convention.

CodePudding user response:

You can extract year and week from the dates and group by both:

dta %>% 
  mutate(year = year(when),
         week = week(when)) %>%
  group_by(year, week) %>%
  summarise(y_mean = mean(y)) %>%
  ungroup()

#  # A tibble: 54 x 3
#  # Groups:   year, week [54]
#     year  week  y_mean
#    <dbl> <dbl>   <dbl>
#  1  2021    44 -0.222 
#  2  2021    45  0.234 
#  3  2021    46  0.0953
#  4  2021    47  0.206 
#  5  2021    48  0.192 
#  6  2021    49 -0.0831
#  7  2021    50  0.0282
#  8  2021    51  0.196 
#  9  2021    52  0.132 
# 10  2021    53 -0.279 
#  # ... with 44 more rows
  • Related