Home > Net >  Making a long-term forecast with multiple linear regression and lagged variables in R
Making a long-term forecast with multiple linear regression and lagged variables in R

Time:05-20

Hello I am interested in making an hourly electric load forecast for more than 7 months ahead. My data set includes about 5 and a half years of hourly load and temperature data. The model I am trying to implement is a multiple linear regression that includes temperature as an independent variable and month, weekday and hour as classification variables, as well as 24 variables of load lags; lag1 is the value of the electric load in the previous hour, lag2 is the value of energy load 2 hours before the current value and so on.

my_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")

    
library(Hmisc)
mod_lm <- lm(LOAD ~ TEMPERATURE   MONTH   WEEKDAY   HOUR   Lag(LOAD,1)   Lag(LOAD, 2)   Lag(LOAD, 3)   Lag(LOAD, 4)   Lag(LOAD,5) 
               Lag(LOAD, 6)   Lag(LOAD, 7)   Lag(LOAD, 8)   Lag(LOAD,9)   Lag(LOAD, 10)   Lag(LOAD,11)  Lag(LOAD, 12) 
               Lag(LOAD, 13)  Lag(LOAD, 14)   Lag(LOAD, 15)   Lag(LOAD, 16)   Lag(LOAD, 17)   Lag(LOAD, 18) 
               Lag(LOAD, 19)   Lag(LOAD,20)   Lag(LOAD, 21)   Lag(LOAD, 22)  Lag(LOAD, 23)  
               Lag(LOAD,24), data=my_df)

summary(mod_lm)

The model looks like this:

Call:
lm(formula = dyn(LOAD ~ TEMPERATURE   MONTH   WEEKDAY   HOUR   
    lag(LOAD, 1)   lag(LOAD, 2)   lag(LOAD, 3)   lag(LOAD, 4)   
    lag(LOAD, 5)   lag(LOAD, 6)   lag(LOAD, 7)   lag(LOAD, 8)   
    lag(LOAD, 9)   lag(LOAD, 10)   lag(LOAD, 11)   lag(LOAD, 
    12)   lag(LOAD, 13)   lag(LOAD, 14)   lag(LOAD, 15)   lag(LOAD, 
    16)   lag(LOAD, 17)   lag(LOAD, 18)   lag(LOAD, 19)   lag(LOAD, 
    20)   lag(LOAD, 21)   lag(LOAD, 22)   lag(LOAD, 23)   lag(LOAD, 
    24)), data = my_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1155.48   -76.38    -3.80    72.12  1540.34 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    96.297801   5.399303  17.835  < 2e-16 ***
TEMPERATURE     0.147311   0.087598   1.682 0.092638 .  
MONTH          -0.013815   0.186592  -0.074 0.940980    
WEEKDAY       -20.396726   0.361845 -56.369  < 2e-16 ***
HOUR            1.290159   0.171243   7.534 5.01e-14 ***
lag(LOAD, 1)    1.375390   0.004447 309.307  < 2e-16 ***
lag(LOAD, 2)   -0.666860   0.007378 -90.379  < 2e-16 ***
lag(LOAD, 3)    0.205219   0.007890  26.010  < 2e-16 ***
lag(LOAD, 4)   -0.176901   0.007905 -22.377  < 2e-16 ***
lag(LOAD, 5)    0.128568   0.007932  16.208  < 2e-16 ***
lag(LOAD, 6)   -0.028096   0.007960  -3.530 0.000417 ***
lag(LOAD, 7)   -0.058609   0.007950  -7.372 1.71e-13 ***
lag(LOAD, 8)    0.164145   0.007905  20.765  < 2e-16 ***
lag(LOAD, 9)   -0.225412   0.007868 -28.650  < 2e-16 ***
lag(LOAD, 10)   0.133046   0.007940  16.757  < 2e-16 ***
lag(LOAD, 11)   0.014815   0.007948   1.864 0.062318 .  
lag(LOAD, 12)  -0.035893   0.007951  -4.515 6.36e-06 ***
lag(LOAD, 13)   0.025532   0.007956   3.209 0.001332 ** 
lag(LOAD, 14)  -0.028748   0.007962  -3.611 0.000306 ***
lag(LOAD, 15)  -0.095531   0.007928 -12.050  < 2e-16 ***
lag(LOAD, 16)   0.227563   0.007876  28.894  < 2e-16 ***
lag(LOAD, 17)  -0.189406   0.007912 -23.939  < 2e-16 ***
lag(LOAD, 18)   0.070704   0.007947   8.897  < 2e-16 ***
lag(LOAD, 19)   0.020112   0.007954   2.528 0.011462 *  
lag(LOAD, 20)  -0.103368   0.007936 -13.025  < 2e-16 ***
lag(LOAD, 21)   0.181176   0.007901  22.931  < 2e-16 ***
lag(LOAD, 22)  -0.204949   0.007907 -25.919  < 2e-16 ***
lag(LOAD, 23)   0.533351   0.007334  72.723  < 2e-16 ***
lag(LOAD, 24)  -0.271700   0.004480 -60.654  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 131.8 on 46795 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.9871,    Adjusted R-squared:  0.9871 
F-statistic: 1.28e 05 on 28 and 46795 DF,  p-value: < 2.2e-16

How do I form my predict function so that it produces a forecast that has the length of my forecasted temperature table (5736 values) and takes into account "forecasted" lagged load variables? I have been having difficulties using the dyn package , for some reason the lagged variables produce zero estimates.

forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")

This clearly won't work:

pred <-predict(mod_lm, newdata = forecast_df)

Thank you in advance, any idea on the matter is appreciated.

CodePudding user response:

I wasn't able to load your data, but I think you'd need to run the regression inside a loop - at the end of each loop the new forecast gets recoded to lag1, the old lag1 becomes lag2 etc... at the loop runs from the first time (presumably now?) through all 5736 values.

You could store the results in a long table, with time as one column and forecast in the other column. Then at the end (or start) of each loop spread the most recent 24 values into new columns to be used as predictors. Add on your other predictors (temperature etc.) then run predict again, add the new prediction to your long table... Repeat until done.

CodePudding user response:

Using my_df from the question convert it to zoo and then run dyn$lm. The problem was that there was a datetime field that was character. If we convert it to POSIXct then it works. Also note that there are date times that conflict with daylight savings time so the timezone must be specified as UTC. Be sure that dplyr is not loaded since it clobbers lag.

library(dyn)

z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fm <- dyn$lm(LOAD ~ TEMPERATURE   MONTH   WEEKDAY   HOUR   lag(LOAD, -(1:24)), z)
  • Related