XGBoost when classes are strings-CodePudding

I have a dataframe that looks like this:

_id    Points      Averages     Averages_2           Media               Rank 
a         324         858.2            NaN               0               Good
b         343         873.2      4.465e 06               1               Good
c         934         113.4            NaN               0                Bad
d         222         424.2            NaN               1                Bad
e         432         234.2      3.605e 06               1               Good

I want to predict the rank. Note that this is just a sample of a dataframe with 2000 rows and ca. 20 columns, but I tried to point out that there are columns, such as Averages_2, with lots of NaNs, and there are columns with values 0 or 1.

I did the following:

import xgboost as xgb

from sklearn import datasets

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder 
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

data = 'C:\\me\\my_table.csv'
df = pd.read_csv(data)

cols_to_drop = ['_id']  #no need to write two lines if there's just one column to drop
                        #but since my original df is much bigger I used this to drop 
                        #multiple columns

df.drop(cols_to_drop, axis=1, inplace=True)

X = df.drop('Rank', axis=1)
y = df['Rank']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)

lc = LabelEncoder() 
lc = lc.fit(y) 
lc_y = lc.transform(y)

model = XGBClassifier() 
model.fit(X_train, y_train)

y_pred = model.predict(X_test) 
predictions = [round(int(value)) for value in y_pred]

I get ValueError: invalid literal for int() with base 10: 'Good' I thought encoding the classes would work but what else does one do when the classes are strings?

CodePudding user response：

It fails since your y_pred contains strings like ["Good","Bad"], thus your last line tries to call e.g round(int("Good")) which it, of course, cannot do (try call print(y_pred[:5]) and see what it shows).

You are actually not using your label-encoder on neither your training or test-set (since you train it on y and never use it to transform y_pred nor y_train), and no need to when using XGboost, it handles the classes automatically.

CodePudding user response：

Generally, all the machine learning models accept only the numerical values either in feature or target.

before split the data for train and test, you can do LabelEncoder or direct value changes like df['rank'].replace({"Good":1,"Bad":0})