PySpark: Performing One-Hot-Encoding-CodePudding

I am new to PySpark, and I need to perform classification task on a dataset which consists categorical variables. I performed the one-hot encoding on that data. But I am confused that whether I am doing it right way or not.

Step 1: Lets say, for example, this is a dataset:

 ----- ----- 
| name|class|
 ----- ----- 
| Alex|    B|
|  Bob|    A|
|Cathy|    B|
| Dave|    C|
| Eric|    D|
 ----- -----

Step 2: After performing one-hot encoding it gives this data:

 ----- ----- ------------- ------------- 
| name|class|class_numeric| class_onehot|
 ----- ----- ------------- ------------- 
| Alex|    B|          0.0|(3,[0],[1.0])|
|  Bob|    A|          1.0|(3,[1],[1.0])|
|Cathy|    B|          0.0|(3,[0],[1.0])|
| Dave|    C|          2.0|(3,[2],[1.0])|
| Eric|    D|          3.0|    (3,[],[])|
 ----- ----- ------------- -------------

Step 3: Here the fourth column is in sequence which is different than Pandas. So I convert this sequence into the dataframe like this:

 ----- ----- ------------- ------------- ------------- 
| name|class|col_onehot[0]|col_onehot[1]|col_onehot[2]|
 ----- ----- ------------- ------------- ------------- 
| Alex|    B|          1.0|          0.0|          0.0|
|  Bob|    A|          0.0|          1.0|          0.0|
|Cathy|    B|          1.0|          0.0|          0.0|
| Dave|    C|          0.0|          0.0|          1.0|
| Eric|    D|          0.0|          0.0|          0.0|
 ----- ----- ------------- ------------- -------------

I did this because I need to train the machine learning models like random forest, decision tree, and naive bayes through this dataset.

But I thought that do I actually need to convert each value into a column as mentioned in the third step. Can I train the model on a dataset having one-hot encoded sequence like in step 2?

I am asking this because the dataset step three, in my original dataset, is taking too much driver memory and it is keep giving errors after some time.

CodePudding user response：

You should use OneHotEncoder in spark ml library after you encode the categorical feature instead of exploding to multiple column.

In fact, if you are using the classification model in spark ml, your input feature also need a array type column but not multiple columns, that means you need to re-assemble to vector again. You can check the example: https://spark.apache.org/docs/latest/ml-features#onehotencoder