Cumulative sum across first element of groups in data.table-CodePudding

I have some time series temperature measurements taken at half hour intervals. I want to calculate an average cumulative growing degree days style metric. I am using the variable "datetime", but leaving out actual datetimes for simplicity's sake. Also don't worry about if this is actually the right calculation for growing degree days, it isn't. The following toy data emulate the challenge.

library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
           date=rep(1:10, each=10),
           T=floor(runif(100,1,10)))

Now I calculate a 'daily' average:

dt[,T_mean_daily:=mean(T), by=date]

Now what I want to do is calculate the cumulative sum of T_mean_daily and have it displayed in a new column but repeated for each 'datetime' on a date as with T_mean_daily. I am having some trouble visualizing that with cumsum. The final output would look like:

   datetime   date T T_mean_daily T_sum
  1:      1.0    1 4          5.6   5.6
  2:      1.1    1 6          5.6   5.6
  3:      1.2    1 9          5.6   5.6
  4:      1.3    1 7          5.6   5.6
  5:      1.4    1 3          5.6   5.6
  6:      1.5    1 8          5.6   5.6
  7:      1.6    1 3          5.6   5.6
  8:      1.7    1 7          5.6   5.6
  9:      1.8    1 8          5.6   5.6
 10:      1.9    1 1          5.6   5.6
 11:      2.0    2 2          3.6   9.2
 12:      2.1    2 5          3.6   9.2
 13:      2.2    2 4          3.6   9.2
 14:      2.3    2 1          3.6   9.2
 15:      2.4    2 9          3.6   9.2
 16:      2.5    2 5          3.6   9.2
 17:      2.6    2 2          3.6   9.2
 18:      2.7    2 5          3.6   9.2
 19:      2.8    2 2          3.6   9.2
 20:      2.9    2 1          3.6   9.2
 21:      3.0    3 1          5.9   15.1
 22:      3.1    3 4          5.9   15.1

Looking for the data.table solution. This is not the cumsum by group, I am looking for the cumsum of each first row or unique value across all groups.

CodePudding user response：

here is another data.table approach...

setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")

explanation

Since we only need to use the first row of each date, we can select there rows using !duplicated(date) in the i of data.table. In the j, we can now calculate the cumulative sum of T_Mean_Daily.
Now we are left with a column with the correct cumsum value on all first date-rows, and NA's in between, so use setnafill fo locf-fill in the value over the NA-rows in the T_sum-column.

benchmarks

set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
           date=rep(1:10, each=10),
           T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
  r2evans = {
    test <- copy(dt)
    test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
    },
  wimpel = {
    test <- copy(dt)
    setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
  }
)



Unit: microseconds
    expr    min      lq     mean  median      uq    max neval cld
 r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1   100   b
  wimpel  425.4  437.45  465.313  451.75  485.35  608.3   100  a

CodePudding user response：

If we do a temporary subset to just the first row of each date, we can then use cumsum and join it back into the original data.

set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
           date=rep(1:10, each=10),
           T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
#      datetime  date     T T_mean_daily
#         <num> <int> <num>        <num>
#   1:      1.0     1     9          6.2
#   2:      1.1     1     9          6.2
#   3:      1.2     1     3          6.2
#   4:      1.3     1     8          6.2
#   5:      1.4     1     6          6.2
#   6:      1.5     1     5          6.2
#   7:      1.6     1     7          6.2
#   8:      1.7     1     2          6.2
#   9:      1.8     1     6          6.2
#  10:      1.9     1     7          6.2
#  ---                                  
#  91:     10.0    10     7          5.6
#  92:     10.1    10     1          5.6
#  93:     10.2    10     2          5.6
#  94:     10.3    10     9          5.6
#  95:     10.4    10     9          5.6
#  96:     10.5    10     7          5.6
#  97:     10.6    10     3          5.6
#  98:     10.7    10     5          5.6
#  99:     10.8    10     7          5.6
# 100:     10.9    10     6          5.6

The aggregation is simply:

dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
#      date datetime     T T_mean_daily T_sum
#     <int>    <num> <num>        <num> <num>
#  1:     1        1     9          6.2   6.2
#  2:     2        2     5         12.2  12.2
#  3:     3        3     9         18.3  18.3
#  4:     4        4     7         23.6  23.6
#  5:     5        5     4         29.6  29.6
#  6:     6        6     4         34.1  34.1
#  7:     7        7     7         40.1  40.1
#  8:     8        8     1         43.1  43.1
#  9:     9        9     6         47.0  47.0
# 10:    10       10     7         52.6  52.6

which we can join back on the original data as:

dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
#      datetime  date     T T_mean_daily T_sum
#         <num> <int> <num>        <num> <num>
#   1:      1.0     1     9          6.2   6.2
#   2:      1.1     1     9          6.2   6.2
#   3:      1.2     1     3          6.2   6.2
#   4:      1.3     1     8          6.2   6.2
#   5:      1.4     1     6          6.2   6.2
#   6:      1.5     1     5          6.2   6.2
#   7:      1.6     1     7          6.2   6.2
#   8:      1.7     1     2          6.2   6.2
#   9:      1.8     1     6          6.2   6.2
#  10:      1.9     1     7          6.2   6.2
#  ---                                        
#  91:     10.0    10     7          5.6  52.6
#  92:     10.1    10     1          5.6  52.6
#  93:     10.2    10     2          5.6  52.6
#  94:     10.3    10     9          5.6  52.6
#  95:     10.4    10     9          5.6  52.6
#  96:     10.5    10     7          5.6  52.6
#  97:     10.6    10     3          5.6  52.6
#  98:     10.7    10     5          5.6  52.6
#  99:     10.8    10     7          5.6  52.6
# 100:     10.9    10     6          5.6  52.6