How to make Pandas Series with np.arrays into numerical value?-CodePudding

I am using the classical Titanic dataset. I used OneHotEncoder to encode surnames of people.

transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)
titanic['Encoded_Surname'] = list(encoded_surname.astype(np.float64))

Here is what my data frame looks like:

This is what I get when I look for the .info():

Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Survived         891 non-null    int64  
 1   Pclass           891 non-null    int64  
 2   Sex              891 non-null    int64  
 3   SibSp            891 non-null    int64  
 4   Parch            891 non-null    int64  
 5   Fare             891 non-null    float64
 6   Encoded_Surname  891 non-null    object 
dtypes: float64(1), int64(5), object(1)

Since the Encoded_Surname label is an object and not numeric like the rest, I cannot fit the data into the classifier model.

How do I turn the np.array I got from OneHotEncoder into numeric data?

CodePudding user response：

IIUC, create a new dataframe for encoded_surname data and join it to your original dataset:

transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Surname']), remainder = "drop")
encoded_surname = transformer.fit_transform(titanic)

titanic = titanic.join(pd.DataFrame(encoded_surname, dtype=int).add_prefix('Encoded_Surname'))

CodePudding user response：

I would suggest you use pd.get_dummies instead of OneHotEncoder. If you really want to use the OneHotEncoder:

ohe_df = pd.DataFrame(encoded_surname, columns=transformer.get_feature_names())
#concat with original data
titanic = pd.concat([titanic, ohe_df], axis=1).drop(['Surname'], axis=1)

If you can use pd.get_dummies:

titanic = pd.get_dummies(titanic, prefix=['Surname'], columns=['Surname'], drop_first=True)