I have some time series temperature measurements taken at half hour intervals. I want to calculate an average cumulative growing degree days style metric. I am using the variable "datetime", but leaving out actual datetimes for simplicity's sake. Also don't worry about if this is actually the right calculation for growing degree days, it isn't. The following toy data emulate the challenge.
library(data.table)
#generate some approximate data.
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
Now I calculate a 'daily' average:
dt[,T_mean_daily:=mean(T), by=date]
Now what I want to do is calculate the cumulative sum of T_mean_daily
and have it displayed in a new column but repeated for each 'datetime' on a date as with T_mean_daily
. I am having some trouble visualizing that with cumsum
. The final output would look like:
datetime date T T_mean_daily T_sum
1: 1.0 1 4 5.6 5.6
2: 1.1 1 6 5.6 5.6
3: 1.2 1 9 5.6 5.6
4: 1.3 1 7 5.6 5.6
5: 1.4 1 3 5.6 5.6
6: 1.5 1 8 5.6 5.6
7: 1.6 1 3 5.6 5.6
8: 1.7 1 7 5.6 5.6
9: 1.8 1 8 5.6 5.6
10: 1.9 1 1 5.6 5.6
11: 2.0 2 2 3.6 9.2
12: 2.1 2 5 3.6 9.2
13: 2.2 2 4 3.6 9.2
14: 2.3 2 1 3.6 9.2
15: 2.4 2 9 3.6 9.2
16: 2.5 2 5 3.6 9.2
17: 2.6 2 2 3.6 9.2
18: 2.7 2 5 3.6 9.2
19: 2.8 2 2 3.6 9.2
20: 2.9 2 1 3.6 9.2
21: 3.0 3 1 5.9 15.1
22: 3.1 3 4 5.9 15.1
Looking for the data.table
solution. This is not the cumsum
by group, I am looking for the cumsum of each first row or unique value across all groups.
CodePudding user response:
here is another data.table
approach...
setnafill(dt[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
explanation
Since we only need to use the first row of each date, we can select there rows using !duplicated(date)
in the i
of data.table. In the j
, we can now calculate the cumulative sum of T_Mean_Daily
.
Now we are left with a column with the correct cumsum value on all first date-rows, and NA's in between, so use setnafill
fo locf
-fill in the value over the NA-rows in the T_sum
-column.
benchmarks
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
microbenchmark::microbenchmark(
r2evans = {
test <- copy(dt)
test[ test[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
},
wimpel = {
test <- copy(dt)
setnafill(test[!duplicated(date), T_sum := cumsum(T_mean_daily)], "locf", cols = "T_sum")
}
)
Unit: microseconds
expr min lq mean median uq max neval cld
r2evans 3287.9 3488.20 3662.044 3560.65 3758.85 4833.1 100 b
wimpel 425.4 437.45 465.313 451.75 485.35 608.3 100 a
CodePudding user response:
If we do a temporary subset to just the first row of each date
, we can then use cumsum
and join it back into the original data.
set.seed(42)
dt<-data.table(datetime=seq(1,10.9, by=0.1),
date=rep(1:10, each=10),
T=floor(runif(100,1,10)))
dt[,T_mean_daily:=mean(T), by=date]
dt
# datetime date T T_mean_daily
# <num> <int> <num> <num>
# 1: 1.0 1 9 6.2
# 2: 1.1 1 9 6.2
# 3: 1.2 1 3 6.2
# 4: 1.3 1 8 6.2
# 5: 1.4 1 6 6.2
# 6: 1.5 1 5 6.2
# 7: 1.6 1 7 6.2
# 8: 1.7 1 2 6.2
# 9: 1.8 1 6 6.2
# 10: 1.9 1 7 6.2
# ---
# 91: 10.0 10 7 5.6
# 92: 10.1 10 1 5.6
# 93: 10.2 10 2 5.6
# 94: 10.3 10 9 5.6
# 95: 10.4 10 9 5.6
# 96: 10.5 10 7 5.6
# 97: 10.6 10 3 5.6
# 98: 10.7 10 5 5.6
# 99: 10.8 10 7 5.6
# 100: 10.9 10 6 5.6
The aggregation is simply:
dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)][]
# date datetime T T_mean_daily T_sum
# <int> <num> <num> <num> <num>
# 1: 1 1 9 6.2 6.2
# 2: 2 2 5 12.2 12.2
# 3: 3 3 9 18.3 18.3
# 4: 4 4 7 23.6 23.6
# 5: 5 5 4 29.6 29.6
# 6: 6 6 4 34.1 34.1
# 7: 7 7 7 40.1 40.1
# 8: 8 8 1 43.1 43.1
# 9: 9 9 6 47.0 47.0
# 10: 10 10 7 52.6 52.6
which we can join back on the original data as:
dt[ dt[, .SD[1,], by = date][, T_mean_daily := cumsum(T_mean_daily)], T_sum := i.T_mean_daily, on = .(date)]
dt
# datetime date T T_mean_daily T_sum
# <num> <int> <num> <num> <num>
# 1: 1.0 1 9 6.2 6.2
# 2: 1.1 1 9 6.2 6.2
# 3: 1.2 1 3 6.2 6.2
# 4: 1.3 1 8 6.2 6.2
# 5: 1.4 1 6 6.2 6.2
# 6: 1.5 1 5 6.2 6.2
# 7: 1.6 1 7 6.2 6.2
# 8: 1.7 1 2 6.2 6.2
# 9: 1.8 1 6 6.2 6.2
# 10: 1.9 1 7 6.2 6.2
# ---
# 91: 10.0 10 7 5.6 52.6
# 92: 10.1 10 1 5.6 52.6
# 93: 10.2 10 2 5.6 52.6
# 94: 10.3 10 9 5.6 52.6
# 95: 10.4 10 9 5.6 52.6
# 96: 10.5 10 7 5.6 52.6
# 97: 10.6 10 3 5.6 52.6
# 98: 10.7 10 5 5.6 52.6
# 99: 10.8 10 7 5.6 52.6
# 100: 10.9 10 6 5.6 52.6