I have a dataframe that looks like this:
_id Points Averages Averages_2 Media Rank
a 324 858.2 NaN 0 Good
b 343 873.2 4.465e 06 1 Good
c 934 113.4 NaN 0 Bad
d 222 424.2 NaN 1 Bad
e 432 234.2 3.605e 06 1 Good
I want to predict the rank. Note that this is just a sample of a dataframe with 2000 rows and ca. 20 columns, but I tried to point out that there are columns, such as Averages_2
, with lots of NaNs, and there are columns with values 0 or 1.
I did the following:
import xgboost as xgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd
data = 'C:\\me\\my_table.csv'
df = pd.read_csv(data)
cols_to_drop = ['_id'] #no need to write two lines if there's just one column to drop
#but since my original df is much bigger I used this to drop
#multiple columns
df.drop(cols_to_drop, axis=1, inplace=True)
X = df.drop('Rank', axis=1)
y = df['Rank']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
lc = LabelEncoder()
lc = lc.fit(y)
lc_y = lc.transform(y)
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(int(value)) for value in y_pred]
I get ValueError: invalid literal for int() with base 10: 'Good'
I thought encoding the classes would work but what else does one do when the classes are strings?
CodePudding user response:
It fails since your y_pred
contains strings like ["Good","Bad"]
, thus your last line tries to call e.g round(int("Good"))
which it, of course, cannot do (try call print(y_pred[:5])
and see what it shows).
You are actually not using your label-encoder on neither your training or test-set (since you train it on y
and never use it to transform y_pred
nor y_train
), and no need to when using XGboost, it handles the classes automatically.
CodePudding user response:
Generally, all the machine learning models accept only the numerical values either in feature or target.
before split the data for train and test, you can do LabelEncoder
or direct value changes like df['rank'].replace({"Good":1,"Bad":0})