Save scikit-learn model without datasets-CodePudding

I've trained a RandomForestClassifier model with the sklearn library and saved it with joblib. Now, I have a joblib file of nearly 1GB which I'm deploying on a Nginx/Flask/Guincorn stack. The issue is I have to find an efficient way to load this model from file and serve API requests. Is it possible to save the model without the datasets when doing:

joblib.dump(model, '/kaggle/working/mymodel.joblib')
print("random classifier saved")

CodePudding user response：

The persistent representation of Scikit-Learn estimators DOES NOT include any training data.

Speaking about decision trees and their ensembles (such as random forests), then the size of the estimator object scales quadratically to the depth of decision trees (ie. the max_depth parameter). This is so, because decision tree configuration is represented using (max_depth, max_depth) matrices (float64 data type).

You can make your random forest objects smaller by limiting the max_depth parameter. If you're worried about potential loss of predictive performance, you may increase the number of child estimators.

Longer term, you may wish to explore alternative representations for Scikit-Learn models. For example, converting them to PMML data format using the SkLearn2PMML package.

CodePudding user response：

The easiest and recomemnded way of saving sklearn's models in pickle: https://scikit-learn.org/stable/modules/model_persistence.html Try this.

IDK why joblib saves your model with the dataset (are you sure about that). Also, could you please provide a little piece of code (model initialization, fitting, your dataset shape and dtype, model memory usage)?