How to use k-fold cross-validation instead of train_test_split for Regression Neural Network-CodePudding

We have developed an Artificial Neural Network (ANN), where we split our data into training and testing data with train_test_split. As we want a better and more generalized estimate of our performance scores, we would like to split data with k-fold instead.

Now, we split the data into 70% training and 30% testing data with train_test_split

def splitData(dataframe):

    X = dataframe[Predictors].values
    y = dataframe[TargetVariable].values

    ### Sandardization of data ###
    PredictorScaler = StandardScaler()
    TargetVarScaler = StandardScaler()

    # Storing the fit object for later reference
    PredictorScalerFit = PredictorScaler.fit(X)
    TargetVarScalerFit = TargetVarScaler.fit(y)

    # Generating the standardized values of X and y
    X = PredictorScalerFit.transform(X)
    y = TargetVarScalerFit.transform(y)

    # Split the data into training and testing set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    return X_train, X_test, y_train, y_test, PredictorScalerFit, TargetVarScalerFit

How do we split our data with k-fold instead?

And are we right in our assumptions that the performance scores will result in better and more generalized estimates if using k-fold cross-validation instead of train_test_split?

Our main

if __name__ == '__main__':

    print('--------------')

    # ...
    dataframe = pd.read_csv("/.../file.csv")

    # Splitting data into training and tesing data
    X_train, X_test, y_train, y_test, PredictorScalerFit, TargetVarScalerFit = splitData(dataframe=dataframe)

    # Making the Regression Artificial Neural Network (ANN)
    ann = ANN(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test, PredictorScalerFit=PredictorScalerFit, TargetVarScalerFit=TargetVarScalerFit)

    # Evaluation of the performance of the Aritifical Neural Network (ANN)
    performanceEvaluation(y_test_orig=ann['temp'], y_test_pred=ann['Predicted_temp'])

Our ANN Prediction function

def ANN(X_train, y_train, X_test, y_test, TargetVarScalerFit, PredictorScalerFit):

    # Making the regression Artificial Neural Network (ANN) Model
    model = make_regression_ann()

    # Fitting the ANN to the Training set
    model.fit(X_train, y_train, batch_size=5, epochs=50, verbose=1)

    # Generating Predictions on testing data
    Predictions = model.predict(X_test)

    # Scaling the predicted temp data back to original price scale
    Predictions = TargetVarScalerFit.inverse_transform(Predictions)

    # Scaling the y_test temp data back to original temp scale
    y_test_orig = TargetVarScalerFit.inverse_transform(y_test)

    # Scaling the test data back to original scale
    Test_Data = PredictorScalerFit.inverse_transform(X_test)

    TestingData = pd.DataFrame(data=Test_Data, columns=Predictors)
    TestingData['temp'] = y_test_orig
    TestingData['Predicted_temp'] = Predictions
    TestingData.head()

    TestingData.to_csv("TestingData.csv")
    
    return TestingData

Making the regression ann model

def make_regression_ann(initializer='normal', activation='relu', optimizer='adam', loss='mse'):
    # create ANN model
    model = Sequential()

    # Defining the Input layer and FIRST hidden layer, both are same!
    model.add(Dense(units=8, input_dim=7, kernel_initializer=initializer, activation=activation))

    # Defining the Second layer of the model
    # after the first layer we don't have to specify input_dim as keras configure it automatically
    model.add(Dense(units=6, kernel_initializer=initializer, activation=activation))

    # The output neuron is a single fully connected node
    # Since we will be predicting a single number
    model.add(Dense(1, kernel_initializer=initializer))

    # Compiling the model
    model.compile(loss=loss, optimizer=optimizer)

    return model

Our function to generate performance scores

def performanceEvaluation(y_test_orig, y_test_pred):

    # Computing the Mean Absolute Percent Error
    MAPE = mean_absolute_percentage_error(y_test_orig, y_test_pred)

    # Computing R2 Score
    r2 = r2_score(y_test_orig, y_test_pred)

    # Computing Mean Square Error (MSE)
    MSE = mean_squared_error(y_test_orig, y_test_pred)

    # Computing Root Mean Square Error (RMSE)
    RMSE = mean_squared_error(y_test_orig, y_test_pred, squared=False)

    # Computing Mean Absolute Error (MAE)
    MAE = mean_absolute_error(y_test_orig, y_test_pred)

    # Computing Mean Bias Error (MBE)
    MBE = np.mean(y_test_pred - y_test_orig)  # here we calculate MBE

    print('--------------')

    print('The Coefficient of Determination (R2) of ANN model is:', r2)
    print("The Root Mean Squared Error (RMSE) of ANN model is:", RMSE)
    print("The Mean Squared Error (MSE) of ANN model is:", MSE)
    print('The Mean Absolute Percent Error (MAPE) of ANN model is:', MAPE)
    print("The Mean Absolute Error (MAE) of ANN model is:", MAE)
    print("The Mean Bias Error (MBE) of ANN model is:", MBE)

    print('--------------')

    eval_list = [r2, RMSE, MSE, MAPE, MAE, MBE]
    columns = ['Coefficient of Determination (R2)', 'Root Mean Square Error (RMSE)', 'Mean Squared Error (MSE)',
               'Mean Absolute Percent Error (MAPE)', 'Mean Absolute Error (MAE)', 'Mean Bias Error (MBE)']

    dataframe = pd.DataFrame([eval_list], columns=columns)

    return dataframe

Our performance scores

Coefficient of Determination (R2)    Root Mean Square Error (RMSE)    Mean Squared Error (MSE)    Mean Absolute Percent Error (MAPE)    Mean Absolute Error (MAE)    Mean Bias Error (MBE)
0.982052940563799                    0.7293977725380798               0.5320211105835124          0.0894734801108027                    0.5711224962560332           0.049541171482965995

What we tried to reach our goal

Splitting whole dataset into X (Predictors) , y (TargetVariable)

def splitData(dataframe):

    X = dataframe[Predictors].values
    y = dataframe[TargetVariable].values

    ### Sandardization of data ###
    PredictorScaler = StandardScaler()
    TargetVarScaler = StandardScaler()

    # Storing the fit object for later reference
    PredictorScalerFit = PredictorScaler.fit(X)
    TargetVarScalerFit = TargetVarScaler.fit(y)

    # Generating the standardized values of X and y
    X = PredictorScalerFit.transform(X)
    y = TargetVarScalerFit.transform(y)

    return X, y

Utilize ANN

def ANN_test(X,y):

    # Fitting the ANN to the Training set
    model = KerasRegressor(build_fn=make_regression_ann(), epochs=50, batch_size=5)

    cv = KFold(n_splits=10, random_state=1, shuffle=True)

    test = cross_val_score(model, X=X, y=y, cv=cv, scoring="neg_mean_squared_error", n_jobs=1)

    print(test)

    mean = test.mean()

    print(mean)

Error receiving from using this function:

2021-12-24 16:16:47.909705: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/Users/yusuf/PycharmProjects/MasterThesis-ArtificialNeuralNetwork/main.py:168: DeprecationWarning: KerasRegressor is deprecated, use Sci-Keras (https://github.com/adriangb/scikeras) instead.
  model = KerasRegressor(build_fn=make_regression_ann(), epochs=50, batch_size=5)
2021-12-24 16:16:48.193312: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
[nan nan nan nan nan nan nan nan nan nan]
nan
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning: 
10 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/wrappers/scikit_learn.py", line 152, in fit
    self.model = self.build_fn(
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/keras/engine/base_layer.py", line 3076, in _split_out_first_arg
    raise ValueError(
ValueError: The first argument to `Layer.call` must always be passed.

  warnings.warn(some_fits_failed_message, FitFailedWarning)

We are operating on:
Mac Monterey, Version: 12.0.1
Tensorflow Version: 2.7.0
Keras Version: 2.7.0

Libraries

import os
import time

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path
from sklearn.metrics import accuracy_score, make_scorer, mean_absolute_percentage_error, mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score
from keras.wrappers.scikit_learn import KerasRegressor

from keras.models import Sequential
from keras.layers import Dense

CodePudding user response：

You need to use KerasRegressor to wrap your keras model as a scikit learn model.

Take a look at example 1 here

from keras.wrappers.scikit_learn import KerasRegressor

model = KerasRegressor(build_fn=make_regression_ann, epochs=512, batch_size=3)
kfold = KFold(n_splits=10, random_state=seed)
scores = cross_val_score(model, x, y, cv=kfold)