Hello I am interested in making an hourly electric load forecast for more than 7 months ahead. My data set includes about 5 and a half years of hourly load and temperature data. The model I am trying to implement is a multiple linear regression that includes temperature as an independent variable and month, weekday and hour as classification variables, as well as 24 variables of load lags; lag1 is the value of the electric load in the previous hour, lag2 is the value of energy load 2 hours before the current value and so on.
my_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
library(Hmisc)
mod_lm <- lm(LOAD ~ TEMPERATURE MONTH WEEKDAY HOUR Lag(LOAD,1) Lag(LOAD, 2) Lag(LOAD, 3) Lag(LOAD, 4) Lag(LOAD,5)
Lag(LOAD, 6) Lag(LOAD, 7) Lag(LOAD, 8) Lag(LOAD,9) Lag(LOAD, 10) Lag(LOAD,11) Lag(LOAD, 12)
Lag(LOAD, 13) Lag(LOAD, 14) Lag(LOAD, 15) Lag(LOAD, 16) Lag(LOAD, 17) Lag(LOAD, 18)
Lag(LOAD, 19) Lag(LOAD,20) Lag(LOAD, 21) Lag(LOAD, 22) Lag(LOAD, 23)
Lag(LOAD,24), data=my_df)
summary(mod_lm)
The model looks like this:
Call:
lm(formula = dyn(LOAD ~ TEMPERATURE MONTH WEEKDAY HOUR
lag(LOAD, 1) lag(LOAD, 2) lag(LOAD, 3) lag(LOAD, 4)
lag(LOAD, 5) lag(LOAD, 6) lag(LOAD, 7) lag(LOAD, 8)
lag(LOAD, 9) lag(LOAD, 10) lag(LOAD, 11) lag(LOAD,
12) lag(LOAD, 13) lag(LOAD, 14) lag(LOAD, 15) lag(LOAD,
16) lag(LOAD, 17) lag(LOAD, 18) lag(LOAD, 19) lag(LOAD,
20) lag(LOAD, 21) lag(LOAD, 22) lag(LOAD, 23) lag(LOAD,
24)), data = my_df)
Residuals:
Min 1Q Median 3Q Max
-1155.48 -76.38 -3.80 72.12 1540.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.297801 5.399303 17.835 < 2e-16 ***
TEMPERATURE 0.147311 0.087598 1.682 0.092638 .
MONTH -0.013815 0.186592 -0.074 0.940980
WEEKDAY -20.396726 0.361845 -56.369 < 2e-16 ***
HOUR 1.290159 0.171243 7.534 5.01e-14 ***
lag(LOAD, 1) 1.375390 0.004447 309.307 < 2e-16 ***
lag(LOAD, 2) -0.666860 0.007378 -90.379 < 2e-16 ***
lag(LOAD, 3) 0.205219 0.007890 26.010 < 2e-16 ***
lag(LOAD, 4) -0.176901 0.007905 -22.377 < 2e-16 ***
lag(LOAD, 5) 0.128568 0.007932 16.208 < 2e-16 ***
lag(LOAD, 6) -0.028096 0.007960 -3.530 0.000417 ***
lag(LOAD, 7) -0.058609 0.007950 -7.372 1.71e-13 ***
lag(LOAD, 8) 0.164145 0.007905 20.765 < 2e-16 ***
lag(LOAD, 9) -0.225412 0.007868 -28.650 < 2e-16 ***
lag(LOAD, 10) 0.133046 0.007940 16.757 < 2e-16 ***
lag(LOAD, 11) 0.014815 0.007948 1.864 0.062318 .
lag(LOAD, 12) -0.035893 0.007951 -4.515 6.36e-06 ***
lag(LOAD, 13) 0.025532 0.007956 3.209 0.001332 **
lag(LOAD, 14) -0.028748 0.007962 -3.611 0.000306 ***
lag(LOAD, 15) -0.095531 0.007928 -12.050 < 2e-16 ***
lag(LOAD, 16) 0.227563 0.007876 28.894 < 2e-16 ***
lag(LOAD, 17) -0.189406 0.007912 -23.939 < 2e-16 ***
lag(LOAD, 18) 0.070704 0.007947 8.897 < 2e-16 ***
lag(LOAD, 19) 0.020112 0.007954 2.528 0.011462 *
lag(LOAD, 20) -0.103368 0.007936 -13.025 < 2e-16 ***
lag(LOAD, 21) 0.181176 0.007901 22.931 < 2e-16 ***
lag(LOAD, 22) -0.204949 0.007907 -25.919 < 2e-16 ***
lag(LOAD, 23) 0.533351 0.007334 72.723 < 2e-16 ***
lag(LOAD, 24) -0.271700 0.004480 -60.654 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 131.8 on 46795 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.9871, Adjusted R-squared: 0.9871
F-statistic: 1.28e 05 on 28 and 46795 DF, p-value: < 2.2e-16
How do I form my predict
function so that it produces a forecast that has the length of my forecasted temperature table (5736 values) and takes into account "forecasted" lagged load variables? I have been having difficulties using the dyn package , for some reason the lagged variables produce zero estimates.
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")
This clearly won't work:
pred <-predict(mod_lm, newdata = forecast_df)
Thank you in advance, any idea on the matter is appreciated.
CodePudding user response:
I wasn't able to load your data, but I think you'd need to run the regression inside a loop - at the end of each loop the new forecast gets recoded to lag1
, the old lag1
becomes lag2
etc... at the loop runs from the first time (presumably now?) through all 5736 values.
You could store the results in a long table, with time as one column and forecast in the other column. Then at the end (or start) of each loop spread the most recent 24 values into new columns to be used as predictors. Add on your other predictors (temperature etc.) then run predict
again, add the new prediction to your long table... Repeat until done.
CodePudding user response:
Using my_df from the question convert it to zoo and then run dyn$lm. The problem was that there was a datetime field that was character. If we convert it to POSIXct then it works. Also note that there are date times that conflict with daylight savings time so the timezone must be specified as UTC. Be sure that dplyr is not loaded since it clobbers lag.
library(dyn)
z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fm <- dyn$lm(LOAD ~ TEMPERATURE MONTH WEEKDAY HOUR lag(LOAD, -(1:24)), z)