I want to make sure the weights_column arguments in h2o.glm() is the same as the weights argument in glm(). To compare, I am looking at the rmse of both models using the Seatbelts dataset in R. I don't think a weight is needed in this model, but for the sake of demonstration I added one.
head(Seatbelts)
Seatbelts<-Seatbelts[complete.cases(Seatbelts),]
## 75% of the sample size
smp_size <- floor(0.75 * nrow(Seatbelts))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(Seatbelts)), size = smp_size)
train <- Seatbelts[train_ind, ]
test <- Seatbelts[-train_ind, ]
# glm()
m1 <- glm(DriversKilled ~ front rear kms PetrolPrice VanKilled law,
family=poisson(link = "log"),
weights = drivers,
data=train)
pred <- predict(m1, test)
RMSE(pred = pred, obs = test$DriversKilled)
The rmse is 120.5797.
# h2o.glm()
library(h2o)
h2o.init()
train <- as.h2o(train)
test <- as.h2o(test)
m2 <- h2o.glm(x = c("front", "rear", "kms", "PetrolPrice", "VanKilled", "law"),
y = "DriversKilled",
training_frame = train,
family = 'poisson',
link = 'log',
lambda = 0,
weights_column = "drivers")
# performance metrics on test data
h2o.performance(m2, test)
The rmse is 18.65627. Why do these models have such different rmse? Am I using the weights_column argument in h2o.glm() incorrectly?
CodePudding user response:
With the glm your predictions are in log form. To compare them you need to use the exponential of the predictions.
Metrics::rmse(exp(pred), test$DriversKilled)
[1] 18.09796
If you make a prediction with h2o you will see that it has already taken care of the exponential operation.
Note that the models differ slightly in the rmse. h2o.glm
has a lot more going on in the background.