Home > database >  Should I change my object variables to integers or create dummy variables?
Should I change my object variables to integers or create dummy variables?

Time:10-02

I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4 categories (e.g. type of job).

I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:

  1. Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
  2. Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.

Is there a difference between these two methods? If so, what is the recommended best path forward?

CodePudding user response:

Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.

Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...

CodePudding user response:

Go the dummy variable route.

Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".

The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.

CodePudding user response:

Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future. Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict

  • Related