Home > Enterprise >  Building machine learning with a dataset which has only string values
Building machine learning with a dataset which has only string values

Time:10-20

I am working with a Dataset consist of 190 columns and more than 3mln rows.

But unfortunately it has all the data as string values.

is there any way of building a model with such kind of data

except tokenising

Thank you and regards!

CodePudding user response:

This may not answer your question fully, but I hope it will shed some light on what you can and cannot do.

Firstly, I don't think there is any straight-forward answer, as most ML models depend on the data that you have. If your string data is simply Yes or No (binary), it could be easily dealt with by replacing Yes = 1 and No = 0, but it doesn't work on something like country.

  1. One-Hot Encoding - For features like country, it would be fairly simple to just one-hot encode it and start training the model with the thus obtained numerical data. But with the number of columns that you have and based on the unique values in such a large amount of data, the dimension will be increased by a lot.

  2. Assigning numeric values - We also cannot simply assign numeric values to the strings and expect our model to work, as there is a very high chance that the model will pick up the numeric order which we do not have in the first place. more info

  3. Bag of words, Word2Vec - Since you excluded tokenization, I don't know if you want to do this but there are also these option.

  4. Breiman's randomForest in R - ""This implementation would allow you to use actual strings as inputs."" I am not familiar with R so cannot confirm to how far this is true. Nevertheless, find more about it here

  5. One-Hot Vector Assembler - I only came up with this in theory. If you manage to somehow convert your string data into numeric values (using one-hot or other encoders) while preserving the underlying representation of the data, the numerical features can be converted into a single vector using VectorAssembler in PySpark(spark). More info about VectorAssembler is here

  • Related