How to create reproducible machine learning models in Python using jobs=-1?-CodePudding

I've read at https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3 to create reproducible machine learning models in Python, you need to set the random seed and pin the package versions.

I would like to be able to save models after training, that is, e.g. using pickle.dump(), load them up again and then get the same results.

At https://docs.python.org/3/library/random.html#notes-on-reproducibility it says:

"Sometimes it is useful to be able to reproduce the sequences given by a pseudo-random number generator. By re-using a seed value, the same sequence should be reproducible from run to run as long as multiple threads are not running."

I'm using a RandomForestClassifier with jobs=-1 so I'm wondering whether I need to do more or whether this is handled internally already.

For the random seed now I have:

os.environ['PYTHONHASHSEED'] = str(42)
random.seed(42)
np.random.seed(42)

And for the classifier I'm setting the random state:

rf = RandomForestClassifier(random_state=42)

CodePudding user response：

According to the documentation, you must also set the random_state parameter in the RandomForestClassifier:

random_state: int, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.

For example:

from sklearn.ensemble import RandomForestClassifier
SEED = 42

clf = RandomForestClassifier(random_state = SEED)

CLARIFICATIONS:

In order for the experiment to be fully reproducible, all steps in the preparation of the dataset must be checked (e.g. train and test splits) even with fixed seed. np.random.seed does not guarantee a fixed random state for sklearn. We need to set random_state parameter corresponding to each sklearn function to ensure repeatability.

It is also sufficient to set the random_state in multithreading. Make sure you use the latest version of sklearn if possible to avoid possible bugs on earlier versions.