Home > Mobile >  Why does sampling the DataFrame of my entire dataset have better results in a prediction model than
Why does sampling the DataFrame of my entire dataset have better results in a prediction model than

Time:11-11

Let's say I have a dataframe, called original_df, of 20,000 rows. I split the first 18,000 rows to be used as my training set and the last 2,000 rows to be used as my testing set. When I use the sample function on the original_df before splitting and run a classifier model on the training set, it produces reasonable prediction results: some false positives, some false negatives, some true positives, and some true negatives.

However, when I use the sample function on the training set and the testing set after splitting the non-shuffled original_df, the classifier is never able to make a positive prediction: I would only get true negatives and false negatives; and zero false positives and true positives.

I'm just trying to understand why this happens despite having the same sampling techniques, below are some example snippets.

# This example samples the original dataset directly

training_len = math.ceil(len(X) * 0.9)
X.sample(frac=1, random_state=2) # Features 
Y.sample(frac=1, random_state=2) # Labels
X_train = X.loc[:training_len]
Y_train = Y.loc[:training_len]
X_test = X.loc[training_len 1:]
Y_test = Y.loc[training_len 1:]

# fp, fn, tp, tn
# 1314, 1703, 455, 8842
# This example samples the training set directly

training_len = math.ceil(len(X) * 0.9)
X # Features 
Y # Labels
X_train = X.loc[:training_len].sample(frac=1, random_state=2)
Y_train = Y.loc[:training_len].sample(frac=1, random_state=2)
X_test = X.loc[training_len 1:]
Y_test = Y.loc[training_len 1:]

# fp, fn, tp, tn
# 0, 425, 0, 2518

I'm using GaussianNB() from sklearn.naive_bayes

I tried checking to see if there were any index mismatching between the training and testing sets, but it wasn't.

I tried to not sample anything from the training and original sets and it had the same prediction results as when sampling just the training sets dataset. This made me think that X_train and Y_train was not being shuffled at all, but I printed the contents of the training sets after sampling and they were indeed shuffled(with matching indices for X_train and Y_train).

CodePudding user response:

Is your initial dataset sorted by label? If so, in the second case your training set might be only one label (negative) and your classifier learns to just always predict that label.

CodePudding user response:

It's because I had to reset_index() the training set after sampling. If the row with index training_length is shuffled to be the first row of the dataframe, X.loc[:training_length] would only get all elements before and at the row with index training_length which would train my model with just one row; thus producing such bad results.

  • Related