When this code executes the results are not consistent. Where is the randomness coming from?
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
seed = 42
iris = datasets.load_iris()
X = iris.data
y = iris.target
pipeline = Pipeline([('std', StandardScaler()),
('pca', PCA(n_components = 4)),
('Decision_tree', DecisionTreeClassifier())],
verbose = False)
kfold = KFold(n_splits = 10, random_state = seed, shuffle = True)
results = cross_val_score(pipeline, X, y, cv = kfold)
print(results.mean())
0.9466666666666667
0.9266666666666665
0.9466666666666667
0.9400000000000001
0.9266666666666665
CodePudding user response:
DecisionTreeClassifier
does not use all columns, but by default the sqrt of the number of columns for each split. You assigned the seed to KFold
, but not to DecisionTreeClassifier
. So different columns will be selected each run. PCA
also accepts a random state.
See DecisionTreeClassifier and PCA