How to fit a model using cross validation-CodePudding

I need help understanding and implementing K-Fold cross validation.

As far as I understand cross validation, it is used to divide the dataset into k subgroups, to have a greater variation in the training and test data. So my first question is: does it replace an ordinary train_test_split(), or is it used as an addition, when I want to save and reuse the model?

When I create a model like this:

X = df["A"]
y = df["B"]

model = LinearRegression()

kfold = KFold(shuffle=True, n_splits=10)
scores = cross_val_score(model, X, y, cv=kfold)

And I save it to reuse it on real data, I get the error that the model has no predict-method and has not been fitted yet. If I just call model.fit(X, y) in the end, the scores don‘t get considered, do they?

How do I correctly fit the model using cross validation, so I can save and use it afterwards?

CodePudding user response：

To your first question:
It is an alternative method. You can think of it as doing a train_test_split 10 times on the dataset with no overlap in the test data. What you get is a more valid score because you use the model on the whole dataset.

To your second question:
You just need to fit the model. You either define a training set (by train_test_split) and fit it on that and use cv for evaluation , use the whole training set, or the most sophisticated: you loop over your folds, and fit a seperate model for each training set. For the final prediction you average over your models.

kf = KFold(n_splits=10)
for fold,(train_ids, valid_ids) in enumerate(kf.split(
            df)): 
    Xtrain = df.loc[train_ids,'A']
    ytrain = df.loc[train_ids,'B']
    Xvalid = df.loc[valid_ids,'A']
    yvalid = df.loc[valid_ids,'B']
    model.fit(Xtrain, ytrain)
    model.score(Xvalid, yvalid)
    model.save_model(..)

Then you can either use the best performing or as mentioned above use all 10 models and averaging the predictions.

CodePudding user response：

You can't. It doesn't really make sense in that context. In your code, there are actually 10 individual models, and if you shuffle the models themselves will change. The cross_val_score is essentially for you to evaluate whether or not your features and model types work well for your problem.

You need a separate fit step in order to use it downstream.

CodePudding user response：

I think you may want to train/test your model first. Then utilize k fold to further validate your model based on splitting your sample set.

The two concepts are separate.