ValueError: could not convert string to float: 'Mme'-CodePudding

When I run the following code in Jupyter Lab

import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])

Then it went errors and note that ValueError: could not convert string to float: 'Mme',details are like these:

  ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
          5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
          6 selector = SelectKBest(f_classif,k=5)
    ----> 7 selector.fit(titanic[predictors],titanic["Survived"])
     ......
    
    ValueError: could not convert string to float: 'Mme'

I tried to print titanic[predictors] and titanic["Survived"],then the details are follows:

    Pclass  Sex Age SibSp   Parch   Fare    Embarked    FamilySize  Title   NameLength
0   3   0   22.0    1   0   7.2500  0   1   1   23
1   1   1   38.0    1   0   71.2833 1   1   3   51
2   3   1   26.0    0   0   7.9250  0   0   2   22
3   1   1   35.0    1   0   53.1000 0   1   3   44
4   3   0   35.0    0   0   8.0500  0   0   1   24
... ... ... ... ... ... ... ... ... ... ...
886 2   0   27.0    0   0   13.0000 0   0   6   21
887 1   1   19.0    0   0   30.0000 0   0   2   28
888 3   1   28.0    1   2   23.4500 0   3   2   40
889 1   0   26.0    0   0   30.0000 1   0   1   21
890 3   0   32.0    0   0   7.7500  2   0   1   19
891 rows × 10 columns

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

How to Solve this Problem?

CodePudding user response：

is it printing column labels in first line? if so then you do proper data assigning so assign the array starting from second row array[1:,:] otherwise try to look into it and see where is "Mme" string located so you understand how the code is fetching it.

CodePudding user response：

When you are trying to fit some algorithm (in your case SelectKBest), you need to be aware of your data. And, almost all time you need to preprocess it.

Take a look to your data:

Do you have categorical features or they are numerical? Or a mix?
Do you have NaN values?
...

Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder).

You will have the same problem with NaN values.

In conclusion, before start fitting, you have to preprocess your data.