Home > Back-end >  Manipulating data for Regression Model using dplyr in R
Manipulating data for Regression Model using dplyr in R

Time:10-06

I have data like this.

library(lubridate)
set.seed(2021)
gen_date <- seq(ymd_h("2021-01-01-00"), ymd_h("2021-09-30-23"), by = "hours")
hourx <- hour(gen_date)
datex <- date(gen_date)
sales <- round(runif(length(datex), 10, 50), 0)*100
mydata <- data.frame(datex, hourx, sales)

head(mydata)
#       datex hourx sales
#1 2021-01-01     0  2800
#2 2021-01-01     1  4100
#3 2021-01-01     2  3800
#4 2021-01-01     3  2500
#5 2021-01-01     4  3500
#6 2021-01-01     5  3800

tail(mydata
#          datex hourx sales
#6547 2021-09-30    18  3900
#6548 2021-09-30    19  3600
#6549 2021-09-30    20  3000
#6550 2021-09-30    21  4700
#6551 2021-09-30    22  4700
#6552 2021-09-30    23  3600

I have task to do modelling using Linear Regression but with tricky data. Assume we have data from January to March, we need those data to forecast April data. Here the steps:

  • We use January and February data as Independent Variables (X) and March data as Dependent Variable (Y) for building regression model, because February has the fewest days, which is 28 days, then we cut January & March data into 28 days too.
data_jan <- mydata[1:672,]
data_feb <- mydata[745:1416,]
data_mar <- mydata[1417:2088,]
  • Modelling Regression using lm function
mydata_reg <- data.frame(x1 = data_jan$sales, 
                         x2 = data_feb$sales,
                         y = data_mar$sales)
model_reg <- lm(y~., data = mydata_reg)
  • After get model, we use new data within February & March as independent data (X)
mydata_reg_for <- data.frame(x1 = data_feb$sales, 
                             x2 = data_mar$sales)
pred_data_apr <- predict(model_reg, newdata = mydata_reg_for)
  • Check lenght of the month, Because april has 30 days and we only get 28 days forecast data, so we still need 2 days data to complete our forecast. February only has 28 days, so we use first two dates from March, which are "2021-03-01" & "2021-03-02". Now, March has 31 days, then we don't need do anything, we just add "2021-03-29" & "2021-03-30".
data_feb_add <- mydata[1417:1464,]
data_mar_add <- mydata[2089:2136,]
mydata_reg_add <- data.frame(x1 = data_feb_add$sales,
                             x2 = data_mar_add$sales)
  • After that we do modelling using model_reg function before and Add all april forecast.
pred_data_apr_add <- predict(model_reg, newdata = mydata_reg_add)
data_apr <- c(as.numeric(pred_data_apr), as.numeric(pred_data_apr_add))

My question is how do we make this process run automatically every month using dplyr package? Because every month has different days. I use february data because it has the fewest days. This condition also is applied to other months. Many Thank You.

CodePudding user response:

You can simply split data by group_split

mydata %>%
  group_split(month(datex))

this code will split mydata into 12 lists, and each list elements are dataframe with each 12 month

CodePudding user response:

If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.

I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.

##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April

##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days

##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)

##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date), 
                        days_in_month(month_2_date), 
                        days_in_month(month_2_date))) #28 days

##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date   shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date   shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date   shortest_month - 1),]

##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date   forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date   forcast_month_length - 1),]

  • Related