Cannot Reproduce Results in databricks for sklearn Random forest-CodePudding

I'm running some machine learning experiments in databricks. For random forest algorithm when i restart the cluster, each time the training output is changes even though random state is set. Anyone has any clue about this issue?

Note : I tried the same algorithm with same code in anaconda environment in my local machine, there is no different in the result even though I restart the machine.

clf_rf =  RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall =  metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)

print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")

output of this code changes every time, I restart the cluster.

CodePudding user response：

Try this:

from numpy.random import seed
seed(1)
clf_rf =  RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall =  metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)

print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")

CodePudding user response：

The randomness can come into your workflow when you do the train-test splitting. If you set the random_state in train_test_split, I think you would be fine.

Example to showcase that fixing randomness in a dataset can produce reproducible results.

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12)
clf_rf =  RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()

accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall =  metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)

print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")

Output:

TP:9
FP:1
TN:12
FN:3
Accuracy : 0.84
Precision : 0.9
Recall : 0.75
F1 Score : 0.8181818181818182