When I run the following code in Jupyter Lab
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])
Then it went errors and note that ValueError: could not convert string to float: 'Mme'
,details are like these:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
6 selector = SelectKBest(f_classif,k=5)
----> 7 selector.fit(titanic[predictors],titanic["Survived"])
......
ValueError: could not convert string to float: 'Mme'
I tried to print titanic[predictors]
and titanic["Survived"]
,then the details are follows:
Pclass Sex Age SibSp Parch Fare Embarked FamilySize Title NameLength
0 3 0 22.0 1 0 7.2500 0 1 1 23
1 1 1 38.0 1 0 71.2833 1 1 3 51
2 3 1 26.0 0 0 7.9250 0 0 2 22
3 1 1 35.0 1 0 53.1000 0 1 3 44
4 3 0 35.0 0 0 8.0500 0 0 1 24
... ... ... ... ... ... ... ... ... ... ...
886 2 0 27.0 0 0 13.0000 0 0 6 21
887 1 1 19.0 0 0 30.0000 0 0 2 28
888 3 1 28.0 1 2 23.4500 0 3 2 40
889 1 0 26.0 0 0 30.0000 1 0 1 21
890 3 0 32.0 0 0 7.7500 2 0 1 19
891 rows × 10 columns
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
How to Solve this Problem?
CodePudding user response:
is it printing column labels in first line? if so then you do proper data assigning so assign the array starting from second row array[1:,:] otherwise try to look into it and see where is "Mme" string located so you understand how the code is fetching it.
CodePudding user response:
When you are trying to fit some algorithm (in your case SelectKBest
), you need to be aware of your data. And, almost all time you need to preprocess it.
Take a look to your data:
- Do you have categorical features or they are numerical? Or a mix?
- Do you have NaN values?
- ...
Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder
).
You will have the same problem with NaN values.
In conclusion, before start fitting, you have to preprocess your data.