sklearn RandomForestClassifier.fit() not reproducible despite set random state and same input-CodePudding

While tuning a random forest model using Scikit-learn I noticed that its accuracy score was different after different runs, even though I used the same RandomForestClassifier instance and the same data as input. I tried googling and the stackExchange search function, but the only case I could find vaguely similar to this one is this post, but there the problem was instantiating the classifier without proper random state, which is not the case for my problem.

I'm using the following code:

clf = RandomForestClassifier( n_estimators=65, max_features = 9, max_depth= 'sqrt', random_state = np.random.RandomState(123) )

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state = np.random.RandomState(159) )
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)

X and y are my data and the corresponding labels, but I found the dataset didn't influence the problem. When I run the train_test_split line I get the same split every time, so no randomness there. Running predict() with the same fitted model also gives the same results every time, suggesting my problem is different from the post I linked to above. However, after every time I run fit(), predict() will give a different prediction! This happens even when I don't touch X_train and y_train. So just running these two lines

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

gives a different result every time. As far as I can tell from the documentation .fit() is not supposed to do anything random. Without reproducible output it is impossible to tune the model, so I'm pretty sure there is an error somewhere. What am I missing? Has anyone encountered this before, or does anyone have any clue as to why this is happening?

CodePudding user response：

Do not use a numpy RandomState object if you'll be rerunning fits and expect the same results. Use just an integer for random_state instead.

From sklearn's Glossary, using a numpy RandomState:

Calling the function multiple times will reuse the same instance, and will produce different results.

The RandomState object gets seeded (with your 123) but then persists for each call to fit, continuing to grab new random numbers, without ever being reset.

A quick check:

clf = RandomForestClassifier(random_state=314)
preds = {}
for i in range(10):
    preds[i] = clf.fit(X, y).predict_proba(X)
all(np.allclose(preds[i], preds[i 1]) for i in range(9))
# > True

clf = RandomForestClassifier(random_state=np.random.RandomState(314))
preds = {}
for i in range(10):
    preds[i] = clf.fit(X, y).predict_proba(X)
all(np.allclose(preds[i], preds[i 1]) for i in range(9))
# > False