Home > OS >  Low accuracy in K-NN model due to OneHotEncoding
Low accuracy in K-NN model due to OneHotEncoding

Time:04-01

I tried building a K-Nearest Neighbor model for a dataset in which the dependent variable can take 3 different categorical values.

I built 2 different models, one where I OneHotEncoded the dependent variable and one where I didn't use any encoding.

x_3class = class3.iloc[:,:-1].values
y_3class = class3.iloc[:,-1:].values 

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categories="auto")
y_3class_ohencoded = ohe.fit_transform(y_3class).toarray() 

from sklearn.model_selection import train_test_split
#non-encoded split
x3c_train,x3c_test,y3c_train,y3c_test = train_test_split(x_3class,y_3class,test_size=0.2,random_state=1)

#onehotencoded split
x_train3,x_test3,y_train3,y_test3 = train_test_split(x_3class,y_3class_ohencoded,test_size=0.2,random_state=1)

#Feature Scaling
sc_3class = StandardScaler()
x3c_train = sc_3class.fit_transform(x3c_train)
x3c_test = sc_3class.transform(x3c_test)
sc_3class_ohe = StandardScaler()
x_train3 = sc_3class_ohe.fit_transform(x_train3)
x_test3 = sc_3class.transform(x_test3)

#Model Building 
from sklearn.neighbors import KNeighborsClassifier 
knn_classifier_3class = KNeighborsClassifier(n_neighbors=18)
knn_classifier_ohe = KNeighborsClassifier(n_neighbors=18)

knn_classifier_3class.fit(x3c_train,y3c_train)
knn_classifier_ohe.fit(x_train3,y_train3)

#Accuracy Evaluation
nonencoded_accuracy_=cross_val_score(knn_classifier_3class,x3c_test,y3c_test,cv=10)
onehotencoded_accuracy=cross_val_score(knn_classifier_ohe,x_test3,y_test3,cv=10)

print("NonEncoded Model Accuracy: %0.2f" %(nonencoded_accuracy.mean()),"\n",
"OHEncoded Model Accuracy: %0.2f"%(onehotencoded_accuracy.mean()))

Accuracy score of non-encoded model was 13% higher than the OneHotEncoded model.

NonEncoded Model Accuracy: 0.63 
 OHEncoded Model Accuracy: 0.50

What would be the reason for such a big difference?

CodePudding user response:

When you one-hot encode the target, sklearn sees multiple columns and assumes you have a multilabel problem; that is, that each row can have more than one (or even no) label.

In kNN, this likely results in some points receiving no label: with k=18 as in your case, consider a point with 8, 6, 4 nearest neighbors of classes 0,1,2 respectively. Without encoding, it gets label 0. With encoding, we have the separate kNN models in a one-vs-rest fashion. (The first label after encoding is 1 for class 0 and 0 for either of class 1 or 2, etc.) So the first model sees 8, 6 4 examples, and predicts "not class 0". Similarly the other two models predict zero, and the output is all zeros, i.e. no class was predicted. If you do cross_val_predict instead, I expect you'll see this.

The default scoring for multilabel problems is also pretty harsh, but in this case it doesn't matter: your model will only ever predict no or exactly one class (erm, except maybe for ties?).

CodePudding user response:

I notice that you are one-hot encoding the labels which is quite unusual as one would usually one-hot encode a categorical feature that has no inherent ordering.

I also notice that you aren't using a power parameter or passing a metric to your KNeighborsClassifier. By default the power parameter is equal to 2 and the metric is a Minkowsi metric. A Minkowsi metric with a power parameter of 2 corresponds to a Euclidean metric for distances. Now if you are dealing with one-hot vectors, a Euclidean metric doesn't really make sense since the distance between any two vectors will be 1.

If all distances are the same, your KNN model is basically just playing the guessing game (for example, it might always return the same option from the possible labels) and from that perfect 50% accuracy I have a strong suspicion that your dataset has 2 classes and is perfectly balanced. In that case, the KNN just by always picking the same option will indeed achieve 50% accuracy.

  • Related