I am trying to learn about machine learning, and I am having trouble understanding when and how to use the validation set. I have understood that it is used to evaluate the candidate models, before checking with the test set, but I don't understand how to properly write it in code. Take for example this code I am working on:
# Split the set into train, validation, and test set (70:15:15 for train:valid:test)
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.7) # Split the data in training and remaining set
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5) # Split the remaining data 50/50 into validation and test set
print("Properties (shapes):\nTraining set: {}\nValidation set: {}\nTest set: {}".format(X_train.shape, X_valid.shape, X_test.shape))
import warnings # supress warnings
warnings.filterwarnings('ignore')
# SCALING
std = StandardScaler()
minmax = MinMaxScaler()
rob = RobustScaler()
# Transforming the TRAINING set
X_train_Standard = std.fit_transform(X_train) # Standardization: each value has mean = 0 and std = 1
X_train_MinMax = minmax.fit_transform(X_train) # Normalization: each value is between 0 and 1
X_train_Robust = rob.fit_transform(X_train) # Robust scales each values variance and quartiles (ignores outliers)
# Transforming the TEST set
X_test_Standard = std.fit_transform(X_test)
X_test_MinMax = minmax.fit_transform(X_test)
X_test_Robust = rob.fit_transform(X_test)
# Test scalers for decision tree classifier
treeStd = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Standard, y_train)
treeMinMax = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_MinMax, y_train)
treeRobust = DecisionTreeRegressor(max_depth=3, random_state=0).fit(X_train_Robust, y_train)
print("Decision tree with standard scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeStd.score(X_train_Standard, y_train), treeStd.score(X_test_Standard, y_test)))
print("Decision tree with min/max scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeMinMax.score(X_train_MinMax, y_train), treeMinMax.score(X_test_MinMax, y_test)))
print("Decision tree with robust scaler:\nTraining set score: {:.4f}\nTest set score: {:.4f}\n".format(treeRobust.score(X_train_Robust, y_train), treeRobust.score(X_test_Robust, y_test)))
# Now we train our model for different values of `max_depth`, ranging from 1 to 20.
max_depths = range(1, 30)
training_error = []
for max_depth in max_depths:
model_1 = DecisionTreeRegressor(max_depth=max_depth)
model_1.fit(X,y)
training_error.append(mean_squared_error(y, model_1.predict(X)))
testing_error = []
for max_depth in max_depths:
model_2 = DecisionTreeRegressor(max_depth=max_depth)
model_2.fit(X, y)
testing_error.append(mean_squared_error(y_test, model_2.predict(X_test)))
plt.plot(max_depths, training_error, color='blue', label='Training error')
plt.plot(max_depths, testing_error, color='green', label='Testing error')
plt.xlabel('Tree depth')
plt.axvline(x=25, color='orange', linestyle='--')
plt.annotate('optimum = 25', xy=(20, 20), color='red')
plt.ylabel('Mean squared error')
plt.title('Hyperparameters tuning', pad=20, size=30)
plt.legend()
Where would I run the tests on the validation set? How do I incorporate it into the code?
CodePudding user response:
First of all make sure to only create one model keep using this one model. Currently you create a model in every training step and overwrite the old one. Otherwise your model will never improve.
Secondly: The Idea behind the validation set is to evaluate the progress of your training, to see how your model performs on data it hasn't seen before. Therefore you need to incorporate it into your training process.
So in your case it would look like that.
model = DecisionTreeRegressor(max_depth=max_depth) # here we create the model we want to use
for max_depth in max_depths:
model.fit(X_train,y_train) # here we train the model
training_error.append(mean_squared_error(y_train, model.predict(X_train))) # here we calculate the training error
val_error.append(mean_squared_error(y_val, model.predict(X_val))) # here we calculate the validation error
test_error = mean_squared_error(y_test, model.predict(X_test)) # here we calculate the test error
Make sure that you only train on your training data, never on your validation or test data.