I want to use the class sklearn.impute.KNNImputer
to impute missing values in my dataset.
I have 2 questions regarding this:
I have seen multiple implementations on Medium and also the example on the official Sklearn website. None of them normalize the data. Shouldn’t one normalize the data before using KNN? Or does the KNNImputer normalize the data behind the scenes?
The KNNImputer only accepts numerical input. So for categorical data, should I one-hot encode them and then use the Impute function?
Thank you
CodePudding user response:
No, there is no implicit normalisation in the KNNImputer. You can see in the source that it is just using KNN logic to compute weighted average of the features of its neighbours.
Correct, you need to one hot encode them, and then you will need to argmax over these, as the imputer will create not one-hot representations (e.g. [0.2, 0.1, 0.4])