I have a 2.2 Million row dataset. RandomForest throws an error if I have a training data set with more than 1 000 000 rows. So I split the data sets in two pieces and the models learn seperately. How do I combine() the models so I can make a predicition with both of their knowledge?
rtask <- makeClassifTask(data=Originaldaten,target="geklaut")
set.seed(1)
ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask, subset = 1:1000000)
rtask.train2 = subsetTask(rtask, subset = 1000001:2000000)
rtask.test = subsetTask(rtask, subset = 2000000:2227502)
rlearn_lm <- makeWeightedClassesWrapper(makeLearner("classif.randomForest"), wcw.weight = 0.1209123724417812)
param_lm <- makeParamSet(
makeIntegerParam("ntree", 500, 500),
makeLogicalParam("norm.votes", FALSE, FALSE),
makeLogicalParam("importance", TRUE, TRUE),
makeIntegerParam("maxnodes" ,4,4)
)
tune_lm <- tuneParams(rlearn_lm,
rtask.train,
cv5, #kreuzvalidierung 5-fach
mmce, #fehler
param_lm,
makeTuneControlGrid(resolution=5)) #wertebereiche
rlearn_lm <- setHyperPars(rlearn_lm,par.vals = tune_lm$x)
model_lm <- train(rlearn_lm,rtask.train)
model_lm2 <- train(rlearn_lm,rtask.train2)
modelGesamt <- combine(model_lm$,model_lm2)
EDIT
you guys are right. actually reading my own code helped me a lot. I have a working resampling here for anyone interested in the future
ho = makeResampleInstance("CV",task=rtask, iters = 20)
rtask.train = subsetTask(rtask,ho$train.inds[[1]])
rtask.test = subsetTask(rtask,ho$test.inds[[1]] )
CodePudding user response:
This is not possible and you should also not do it. Train one model, even if it takes longer.
Models can't be merged to fusion their knowledge if they were trained on different datasets.