Home > Software engineering >  Splitting a dataset into training and test datasets given a ratio
Splitting a dataset into training and test datasets given a ratio

Time:09-21

For a school project, I need to split a dataset into training and testing sets given a ratio. The ratio is the amount of data to be used as training sets, while the rest are to be used as testing. I created a base implementation based on my professor's requirements but I can't get it to pass the tests that he created. Below is my implementation as well as what the parameters and return variables represent

def splitData(X, y, split_ratio = 0.8):
'''
X: numpy.ndarray. Shape = [n 1, m]
y: numpy.ndarray. Shape = [m, ]
split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
Split the whole dataset into Training, Validation, and Test sets.
:return: return (training_X, training_y), (test_X, test_y).
        training_X is a (n 1, m_tr) matrix with m_tr training examples;
        training_y is a (m_tr, ) column vector;
        test_X is a (n 1, m_test) matrix with m_test test examples;
        test_y is a (m_test, ) column vector.
'''
## Need to possible shuffle X array and Y array

## amount used for training
m_tr = len(X) * train_ratio

##m_test = len(X) - m_tr Amount that is used for testing

training_X = X[1:m_tr]
training_y = y[1:m_tr]
test_X = [m_tr:len(X)]
test_y = [m_tr:len(y)]
return training_X, training_y, test_X, test_y

I included my comment for declaring m_test because of the instructions but I'm pretty sure that splitting the array from the first element to m_tr gives the total training amount and the rest is testing data. Testing data is found by iterating each list from m_tr to len(x) or len(y). Am I misunderstanding how the splitting works?

PS - the professor said we can skip the splitting for validation.

CodePudding user response:

There are 3 main issues:

  1. In the docs it is specified that you need to cut columns, not rows
  2. You are supposed to return 2 pairs, not tuple of length 4
  3. For some reason you remove 0th sample as you cut with "1:" instead of "0:"
def splitData(X, y, split_ratio = 0.8):
'''
X: numpy.ndarray. Shape = [n 1, m]
y: numpy.ndarray. Shape = [m, ]
split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
Split the whole dataset into Training, Validation, and Test sets.
:return: return (training_X, training_y), (test_X, test_y).
        training_X is a (n 1, m_tr) matrix with m_tr training examples;
        training_y is a (m_tr, ) column vector;
        test_X is a (n 1, m_test) matrix with m_test test examples;
        test_y is a (m_test, ) column vector.
'''
  m_tr = int(len(X) * train_ratio)
  training_X = X[:, :m_tr]
  training_y = y[:m_tr]
  test_X = X[:, m_tr:]
  test_y = y[m_tr:]
  return (training_X, training_y), (test_X, test_y)

CodePudding user response:

  1. The function argument is called split_ratio, but when implementing the function you use train_ratio.
  2. The variable m_tr is the result of multiplying the length of the list (data) by the ratio (split_ratio) the result of such an operation can be a floating point number. And the slices you use to split data only accept integers.
  3. For test_X and test_y, you didn't provide data before the slice.
  4. For training_X and training_y, you start the slice from the second element because you specified 1, not 0. You lose the first data element because of this.

I corrected your mistakes:

def splitData(X, y, split_ratio = 0.8):

    training_X = X[:, :m_tr]
    training_y = y[:m_tr]
    test_X = X[:, m_tr:]
    test_y = y[m_tr:]
    return (training_X, training_y), (test_X, test_y)

CodePudding user response:

You might want to take a look at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

  • Related