Home > Back-end >  How to substitute the one-hot-encoded data for the raw data associated with them using scikit-learn?
How to substitute the one-hot-encoded data for the raw data associated with them using scikit-learn?

Time:10-20

I have a csv file like below. It includes 2 data columns (X and Y), and 2 label columns (C1 and C2), and 4 records.

X Y C1 C2

0.2 75 1 2

0.8 12 1 4

0.9 23 2 3

0.33 24 3 4

I need to one-hot-encode the label columns in such a way that the encoded values are substituted for their current values.

Here is what I have tried:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.read_csv("some_file.csv", delimiter=" ")

categorical_cols = ["C1", "C2"]
one_hot_encoder = OneHotEncoder()
transformed = one_hot_encoder.fit_transform(df[categorical_cols])
encoded_data = pd.DataFrame(transformed, index=df.index)
df = pd.concat([df, encoded_data], axis=1).drop(["C1", "C2"], axis=1)

However, what I get lacks C1 and C2 of the header:

      X   Y                             0
0  0.20  75    (0, 0)\t1.0\n  (0, 3)\t1.0
1  0.80  12    (0, 0)\t1.0\n  (0, 5)\t1.0
2  0.90  23    (0, 1)\t1.0\n  (0, 4)\t1.0
3  0.33  24    (0, 2)\t1.0\n  (0, 5)\t1.0

How can I preserve them after concatenation and also replace those tab delimiters with whitespaces?

CodePudding user response:

transformed is a dense array, you want to convert it to a usual numpy array first:

encoded_data = pd.DataFrame(transformed.toarray(), index=df.index)

However, Pandas has a built-in function for that:

pd.get_dummies( df, columns=['C1','C2'])

Output:

      X   Y  C1_1  C1_2  C1_3  C2_2  C2_3  C2_4
0  0.20  75     1     0     0     1     0     0
1  0.80  12     1     0     0     0     0     1
2  0.90  23     0     1     0     0     1     0
3  0.33  24     0     0     1     0     0     1
  • Related