Conversion of categorical attributes from text to numbers-CodePudding

Currently, I am trying to convert the categorical text data to numbers using encoders provided by the scikit-learn library. I have tested using OrdinalEncoder and OneHotEncoder encoders. This is what I understand:

When a Categorial attribute(e.g species_cat) has a large number of possible categories(e.g species), then the one-hot encoding(OneHotEncoder) will result in a large number of input features. But, It may result slow down training and degrade performance too.

from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
species_cat_1hot = cat_encoder.fit_transform(species_cat)

Similarly, we can't use an ordinal encoder(OrdinalEncoder) to encode categorical attributes because the algorithm assumes that two nearby values are more similar than two distant values.

from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
species_cat_encoded = ordinal_encoder.fit_transform(species_cat)

Thus, my question is how to convert categorical attributes with large numbers of text categories to numbers using the scikit-learn library and without the degradation of the algorithm performance. Thank you!

CodePudding user response：

OneHotEncoder has (in version 1.1 at least) two options: min_frequency and max_categories. You can use these to group infrequent categorical attributes together into a miscellaneous grouping.

If v1.1 is not available you can do the same thing by hand. I'd first set up a count of different attributes, maybe with groupby.value_counts(), and then group anything that occurs infrequently enough.

You are right that OrdinalEncoder shouldn't be used with categories that can't be ordered in a meaningful way, e.g. "cold", "cool", "warm" can be thought of as an ordinal variable, whereas "cat", "dog", "horse" can't.

CodePudding user response：

Take a look at category_encoders unit (https://contrib.scikit-learn.org/category_encoders/) and this nice encoding method selection flowchart.

CodePudding user response：

I think you are correct on understanding the two encoding methods.

What I want to say is the 'large number of input feature' usually wont be a problem for machine learning models, as the computing power of most of the PCs can handle these sparse matrix easily.

If you really want to deal with this problem, my suggestions is:

follow what njp saids, limit the number of amount of attributes by setting min_frequency. The frequency can be determent by tfidf(text data) or your PC ram.
First compute the sparse matrix, then use PCA , t-SNE or other dimension reduction algo to reduce the feature size.
(case specific) If you need to deal with special cats name, maybe you can convert their name using Binomial Nomenclature. (e.g. cat -> Felis domestica), by this, species name can then use ordinal encoder to work on as the name order now is meaningful.
(other than sklearn) maybe you can try word embedding methods?

Sorry for my bad english :( but i think you can get the idea.