I'm having an issue with the scikit learn library. I want to practice ML by doing some projects on kaggle.com but a problem I'm running into is the fact that scikit only takes integers (or floats I think) as inputs for their algorithms. A solution would be to map every non-integer value to an integer, keeping track of the categories that they belong to. This seems tedious though. Are there different solutions to this?
CodePudding user response:
If I understand correctly you could revert to one-hot encoding (Sklearn has a very useful preprocessor for exactly this purpose). There is a wonderful tutorial about how to do that by Jason Brownlee. In general, if you want to use categorical variables (e.g. 'red', 'amber', 'green') in order to make this usable by a machine learning model it needs somehow to be converted into numerical representation as you have already noted. One-hot encoding means encoding each possible categorical value as 1 or 0, i.e. 'Is it red?' if yes variable 'light_is_red' = 1, else it is = 0; 'Is it green?' if yes variable 'light_is_green' = 1, else it is = 0. Very importantly, when one decomposes a categorical variable (say 'light', with three potential values - 'red', 'amber' and 'green') only one of them can take value 1 and all others should be zero (because it cannot assume two values at the same time, they are mutually exclusive).
CodePudding user response:
Luuk you need to first read about Encoding Categorical Variables. Look at this post from Analytics Vidhya. They go through each one of the most popular methods and their pros and cons.
I sincerely urge you to read about it and then act. Additionally, I would also recommend you to compute the cardinality of the categorical variable you would like to encode; that is the number of unique elements / categories that exists in that column of your dataframe. This can tell you a lot about which encoding technique you are gonna pick at the end.