Home > Mobile >  A huge number of discrete features
A huge number of discrete features

Time:08-26

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data? I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined. Unfortunately, I can't attach the dataset, as it is confidential How do I prepare the data to achieve the best results in the learning process?

CodePudding user response:

It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.

In general, it really depends on the model you want to use.

One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20! Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.

TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

  • Related