I have a csv
file like below. It includes 2 data columns (X
and Y
), and 2 label columns (C1
and C2
), and 4 records.
X Y C1 C2
0.2 75 1 2
0.8 12 1 4
0.9 23 2 3
0.33 24 3 4
I need to one-hot-encode the label columns in such a way that the encoded values are substituted for their current values.
Here is what I have tried:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv("some_file.csv", delimiter=" ")
categorical_cols = ["C1", "C2"]
one_hot_encoder = OneHotEncoder()
transformed = one_hot_encoder.fit_transform(df[categorical_cols])
encoded_data = pd.DataFrame(transformed, index=df.index)
df = pd.concat([df, encoded_data], axis=1).drop(["C1", "C2"], axis=1)
However, what I get lacks C1
and C2
of the header:
X Y 0
0 0.20 75 (0, 0)\t1.0\n (0, 3)\t1.0
1 0.80 12 (0, 0)\t1.0\n (0, 5)\t1.0
2 0.90 23 (0, 1)\t1.0\n (0, 4)\t1.0
3 0.33 24 (0, 2)\t1.0\n (0, 5)\t1.0
How can I preserve them after concatenation and also replace those tab delimiters with whitespaces?
CodePudding user response:
transformed
is a dense array, you want to convert it to a usual numpy array first:
encoded_data = pd.DataFrame(transformed.toarray(), index=df.index)
However, Pandas has a built-in function for that:
pd.get_dummies( df, columns=['C1','C2'])
Output:
X Y C1_1 C1_2 C1_3 C2_2 C2_3 C2_4
0 0.20 75 1 0 0 1 0 0
1 0.80 12 1 0 0 0 0 1
2 0.90 23 0 1 0 0 1 0
3 0.33 24 0 0 1 0 0 1