How to separate daily data into weekly or monthly data in R-CodePudding

I have daily discharge data from a local stream near me. I am trying to sum and take the average of the daily data into weekly or monthly chunks so I can plot discharge_m3d(discharge) and Qs_sum(depletion) by weekly and monthly timeframes. Does anyone know how I can do this? I attached a figure of how my data frame looks.

CodePudding user response：

One way to approach this is using the lubridate and dplyr packages in the tidyverse. I assume here that your dates are year-month-day which they appear to be and that you only have one calendar year or at least no repeated months/weeks across two years.


monthly_discharge <- discharge %>%
filter(variable == "discharge") # First select just the rows that represent discharge (not clear if that's necessary here)
mutate(date = ymd(date), # convert date to a lubridate date object
       month = month(date), # extract the numbered month from the date
       week = week(date)) %>% # extract the numbered week in a year from the date
group_by(month, stream) %>% # group your data by month and stream
summarize(discharge_summary = mean(discharge_m3d)) # summarize your data so that each month has a single row with a single (mean) discharge value
# you can include multiple summary variables within the summarize function

This should produce a data frame with one row per month for each stream and a summary value for discharge. You could summarize by week by changing the month label in group_by to week.

CodePudding user response：

Make use of the functions week() and month() from the package lubridate to get the corresponding values for your date column. Afterwards we can find the means per week (or month):

library(dplyr)
library(lubridate)

data <- data %>% mutate(Week = week(date), Month = month(date)) %>% group_by(Week, Month) %>% 
  mutate(mean_Week_Qs = mean(Qs_sum)) %>% ungroup()

> head(data)
# A tibble: 6 x 6
  date       discharge_m3d  Qs_sum  Week Month mean_Week_Qs
  <date>             <dbl>   <dbl> <int> <int>        <dbl>
1 2014-03-01           797 0           9     3       0.0409
2 2014-03-02           826 0.00833     9     3       0.0409
3 2014-03-03          3760 0.114       9     3       0.0409
4 2014-03-04          4330 0.292      10     3       0.785 
5 2014-03-05          2600 0.480      10     3       0.785 
6 2014-03-06          4620 0.656      10     3       0.785

Now we can plot, for example Qs_sum per week, and add the mean as a red dot:

ggplot(data, aes(factor(Week), Qs_sum))  
  geom_point(size = 2)  
  geom_point(aes(factor(Week), mean_Week_Qs), color = "red", size = 5, alpha = 0.6)

Data

data <- structure(list(date = structure(16130:16140, class = "Date"), 
discharge_m3d = c(797, 826, 3760, 4330, 2600, 4620, 2510, 
1620, 2270, 5650, 2530), Qs_sum = c(0, 0.00833424, 0.114224781, 
0.291812109, 0.479780482, 0.656321971, 0.816140731, 0.959334606, 
1.087579095, 1.20284046, 1.30695595), Week = c(9L, 9L, 9L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L), Month = c(3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), row.names = c(NA, -11L
 ), class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response：

People often use floor_date() from lubridate for these purposes. You can floor to a unit of month or week and then group by the resulting date column. Then you can use summarize() to compute the monthly or weekly sums/averages. From there you can use your plotting library of choice to visualize the result (like ggplot2, not shown).

This works even if you have more than one year of data (i.e. where the month or week number might repeat).

library(dplyr)
library(lubridate)

set.seed(123)

df <- tibble(
  date = seq(
    from = as.Date("2014-03-01"), 
    to = as.Date("2016-12-31"),
    by = 1
  ),
  Qs_sum = runif(length(date)),
  discharge_m3d = runif(length(date))
)

df
#> # A tibble: 1,037 × 3
#>    date       Qs_sum discharge_m3d
#>    <date>      <dbl>         <dbl>
#>  1 2014-03-01 0.288         0.560 
#>  2 2014-03-02 0.788         0.427 
#>  3 2014-03-03 0.409         0.448 
#>  4 2014-03-04 0.883         0.833 
#>  5 2014-03-05 0.940         0.720 
#>  6 2014-03-06 0.0456        0.457 
#>  7 2014-03-07 0.528         0.521 
#>  8 2014-03-08 0.892         0.242 
#>  9 2014-03-09 0.551         0.0759
#> 10 2014-03-10 0.457         0.391 
#> # … with 1,027 more rows

df %>%
  mutate(date = floor_date(date, unit = "month")) %>%
  group_by(date) %>%
  summarise(
    n = n(),
    qs_total = sum(Qs_sum),
    qs_average = mean(Qs_sum),
    discharge_total = sum(discharge_m3d),
    discharge_average = mean(discharge_m3d),
    .groups = "drop"
  )
#> # A tibble: 34 × 6
#>    date           n qs_total qs_average discharge_total discharge_average
#>    <date>     <int>    <dbl>      <dbl>           <dbl>             <dbl>
#>  1 2014-03-01    31     18.1      0.585            15.3             0.494
#>  2 2014-04-01    30     12.9      0.429            15.2             0.507
#>  3 2014-05-01    31     15.5      0.500            15.3             0.493
#>  4 2014-06-01    30     15.8      0.525            16.3             0.542
#>  5 2014-07-01    31     15.1      0.487            13.9             0.449
#>  6 2014-08-01    31     14.8      0.478            16.2             0.522
#>  7 2014-09-01    30     15.3      0.511            13.1             0.436
#>  8 2014-10-01    31     15.6      0.504            14.7             0.475
#>  9 2014-11-01    30     16.0      0.532            15.1             0.502
#> 10 2014-12-01    31     14.2      0.458            15.5             0.502
#> # … with 24 more rows

# Assert that the "start of the week" is Sunday.
# So groups are made of data from [Sunday -> Monday]
sunday <- 7L

df %>%
  mutate(date = floor_date(date, unit = "week", week_start = sunday)) %>%
  group_by(date) %>%
  summarise(
    n = n(),
    qs_total = sum(Qs_sum),
    qs_average = mean(Qs_sum),
    discharge_total = sum(discharge_m3d),
    discharge_average = mean(discharge_m3d),
    .groups = "drop"
  )
#> # A tibble: 149 × 6
#>    date           n qs_total qs_average discharge_total discharge_average
#>    <date>     <int>    <dbl>      <dbl>           <dbl>             <dbl>
#>  1 2014-02-23     1    0.288      0.288           0.560             0.560
#>  2 2014-03-02     7    4.49       0.641           3.65              0.521
#>  3 2014-03-09     7    3.77       0.539           3.88              0.554
#>  4 2014-03-16     7    4.05       0.579           3.45              0.493
#>  5 2014-03-23     7    4.43       0.632           3.08              0.440
#>  6 2014-03-30     7    4.00       0.572           4.74              0.677
#>  7 2014-04-06     7    2.50       0.357           3.15              0.449
#>  8 2014-04-13     7    2.48       0.355           2.44              0.349
#>  9 2014-04-20     7    2.30       0.329           2.45              0.349
#> 10 2014-04-27     7    3.44       0.492           4.40              0.629
#> # … with 139 more rows

^{Created on 2022-04-13 by the reprex package (v2.0.1)}