I am using the classical Titanic dataset. I used OneHotEncoder
to encode surnames of people.
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic['Encoded_Surname'] = list(encoded_surname.astype(np.float64))
Here is what my data frame looks like:
This is what I get when I look for the .info()
:
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Sex 891 non-null int64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
6 Encoded_Surname 891 non-null object
dtypes: float64(1), int64(5), object(1)
Since the Encoded_Surname
label is an object and not numeric like the rest, I cannot fit the data into the classifier model.
How do I turn the np.array
I got from OneHotEncoder
into numeric data?
CodePudding user response:
IIUC, create a new dataframe for encoded_surname
data and join it to your original dataset:
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic = titanic.join(pd.DataFrame(encoded_surname, dtype=int).add_prefix('Encoded_Surname'))
CodePudding user response:
I would suggest you use pd.get_dummies
instead of OneHotEncoder
. If you really want to use the OneHotEncoder
:
ohe_df = pd.DataFrame(encoded_surname, columns=transformer.get_feature_names())
#concat with original data
titanic = pd.concat([titanic, ohe_df], axis=1).drop(['Surname'], axis=1)
If you can use pd.get_dummies
:
titanic = pd.get_dummies(titanic, prefix=['Surname'], columns=['Surname'], drop_first=True)