I'm working to forecast the number of employees in the United States by month. The data is located at:
library(tidyverse)
library(fpp3)
# Source: https://beta.bls.gov/dataViewer/view/timeseries/CES0000000001
All_Employees <- read_csv('https://raw.githubusercontent.com/InfiniteCuriosity/predicting_labor/main/All_Employees.csv', col_select = c(Label, Value), show_col_types = FALSE)
All_Employees <- All_Employees %>%
rename(Month = Label, Total_Employees = Value)
All_Employees <- All_Employees %>%
mutate(Month = yearmonth(Month)) %>%
as_tsibble(index = Month)
I'm using the excellent text and this is the page that discusses cross-validation: Forecasting Principles and Practice, 3rd Edition
Here is the code I'm running using cross-validation:
All_Employees_train <- All_Employees %>%
stretch_tsibble()
All_Employees_train %>%
model(
linear = TSLM(Total_Employees ~ trend() season()),
Exponential = TSLM(log(Total_Employees) ~ trend() season()),
Arima = ARIMA(Total_Employees ~ trend() season()),
Ets = ETS(Total_Employees),
Mean = MEAN(Total_Employees),
Naive = NAIVE(Total_Employees),
SNaive = SNAIVE(Total_Employees),
Drift = SNAIVE(Total_Employees ~ drift())) %>%
forecast(h = 3) %>%
accuracy(All_Employees) %>%
arrange(RMSE)
That code is returning this result and more than 50 errors, here are the results:
# A tibble: 8 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naive Test 162. 2168. 747. 0.101 0.541 0.226 0.487 0.685
2 Ets Test 214. 4227. 774. 0.145 0.563 0.234 0.949 0.608
3 SNaive Test 806. 4453. 3303. 0.515 2.36 1 1.00 0.866
4 Drift Test 535. 4469. 3170. 0.343 2.28 0.960 1.00 0.868
5 Exponential Test 1861. 4692. 3942. 1.27 2.81 1.19 1.05 0.934
6 linear Test 1887. 4697. 3952. 1.29 2.81 1.20 1.05 0.934
7 Mean Test 3565. 6724. 5410. 2.37 3.77 1.64 1.51 0.959
8 Arima Test -488. 11113. 2290. -0.383 1.65 0.693 2.50 0.673
There were 50 or more warnings (use warnings() to see the first 50)
Here are a few of the 50 errors:
Warning messages:
1: In for (i in namD) if (is.character(data[[i]])) data[[i]] <- factor(data[[i]]) :
closing unused connection 12 (<-localhost:11913)
11: Provided exogenous regressors are rank deficient, removing regressors: `season()year2`, `season()year3`, `season()year4`, `season()year5`, `season()year6`, `season()year7`, `season()year8`, `season()year9`, `season()year10`, `season()year11`, `season()year12`
24: In sqrt(diag(best$var.coef)) : NaNs produced
27: 12 errors (2 unique) encountered for Arima
28: 3 errors (2 unique) encountered for Ets
[2] Not enough data to estimate this ETS model.
[1] only 1 case, but 2 variables
50: Problem while computing `Exponential = (function (object, ...) ...`.
ℹ prediction from a rank-deficient fit may be misleading
However, if I simply make a training set and run it against the exact same code, no errors are returned, the best results have a much lower RMSE than cross-validation, and the results are returned much faster than cross-validation (for obvious reasons). Here is the code to make the training set, and the results:
All_Employees_train <- All_Employees %>%
filter(Month <= yearmonth("2022 Feb"))
# A tibble: 8 × 10
.model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Naive Test 819. 885. 819. 0.541 0.541 0.252 0.201 -0.000688
2 Ets Test 825. 891. 825. 0.545 0.545 0.254 0.202 -0.000688
3 Arima Test 1656. 1861. 1656. 1.09 1.09 0.509 0.422 -0.120
4 Exponential Test 3075. 3178. 3075. 2.03 2.03 0.946 0.720 -0.151
5 linear Test 3172. 3265. 3172. 2.10 2.10 0.976 0.740 -0.143
6 Drift Test 5810. 5810. 5810. 3.84 3.84 1.79 1.32 -0.378
7 SNaive Test 6521. 6522. 6521. 4.31 4.31 2.01 1.48 -0.378
8 Mean Test 11457. 11462. 11457. 7.57 7.57 3.53 2.60 -0.000688
How can the cross-validation method be run without errors (and hopefully better results)?
CodePudding user response:
Your stretched data set contains very short time series, and fitting models to them is causing these warnings. When you use stretch_tsibble()
, set .init
to a larger number -- this controls the length of the smallest time series. For example, use at least 2 years of data in each of the training sets:
All_Employees_train <- All_Employees %>%
stretch_tsibble(.init = 24)