For a school project, I need to split a dataset into training and testing sets given a ratio. The ratio is the amount of data to be used as training sets, while the rest are to be used as testing. I created a base implementation based on my professor's requirements but I can't get it to pass the tests that he created. Below is my implementation as well as what the parameters and return variables represent
def splitData(X, y, split_ratio = 0.8):
'''
X: numpy.ndarray. Shape = [n 1, m]
y: numpy.ndarray. Shape = [m, ]
split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
Split the whole dataset into Training, Validation, and Test sets.
:return: return (training_X, training_y), (test_X, test_y).
training_X is a (n 1, m_tr) matrix with m_tr training examples;
training_y is a (m_tr, ) column vector;
test_X is a (n 1, m_test) matrix with m_test test examples;
test_y is a (m_test, ) column vector.
'''
## Need to possible shuffle X array and Y array
## amount used for training
m_tr = len(X) * train_ratio
##m_test = len(X) - m_tr Amount that is used for testing
training_X = X[1:m_tr]
training_y = y[1:m_tr]
test_X = [m_tr:len(X)]
test_y = [m_tr:len(y)]
return training_X, training_y, test_X, test_y
I included my comment for declaring m_test because of the instructions but I'm pretty sure that splitting the array from the first element to m_tr gives the total training amount and the rest is testing data. Testing data is found by iterating each list from m_tr to len(x) or len(y). Am I misunderstanding how the splitting works?
PS - the professor said we can skip the splitting for validation.
CodePudding user response:
There are 3 main issues:
- In the docs it is specified that you need to cut columns, not rows
- You are supposed to return 2 pairs, not tuple of length 4
- For some reason you remove 0th sample as you cut with "1:" instead of "0:"
def splitData(X, y, split_ratio = 0.8):
'''
X: numpy.ndarray. Shape = [n 1, m]
y: numpy.ndarray. Shape = [m, ]
split_ratio: the ratio of examples go into the Training, Validation, and Test sets.
Split the whole dataset into Training, Validation, and Test sets.
:return: return (training_X, training_y), (test_X, test_y).
training_X is a (n 1, m_tr) matrix with m_tr training examples;
training_y is a (m_tr, ) column vector;
test_X is a (n 1, m_test) matrix with m_test test examples;
test_y is a (m_test, ) column vector.
'''
m_tr = int(len(X) * train_ratio)
training_X = X[:, :m_tr]
training_y = y[:m_tr]
test_X = X[:, m_tr:]
test_y = y[m_tr:]
return (training_X, training_y), (test_X, test_y)
CodePudding user response:
- The function argument is called split_ratio, but when implementing the function you use train_ratio.
- The variable m_tr is the result of multiplying the length of the list (data) by the ratio (split_ratio) the result of such an operation can be a floating point number. And the slices you use to split data only accept integers.
- For test_X and test_y, you didn't provide data before the slice.
- For training_X and training_y, you start the slice from the second element because you specified 1, not 0. You lose the first data element because of this.
I corrected your mistakes:
def splitData(X, y, split_ratio = 0.8):
training_X = X[:, :m_tr]
training_y = y[:m_tr]
test_X = X[:, m_tr:]
test_y = y[m_tr:]
return (training_X, training_y), (test_X, test_y)
CodePudding user response:
You might want to take a look at: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html