How to pass test data for classification model if independent variables are categorical in python?-CodePudding

I am building a classification model for a data set containing independent variables as categorical values. As fit() is not supported for non-numeric values, I need to apply either LabelEncoder or OneHotEncoder.

My data set looks like this:

index	outlook	temperature	humidity	windy	play
0	sunny	hot	high	false	no
1	sunny	hot	high	true	no
2	overcast	hot	high	false	yes
3	rainy	mild	high	false	yes
4	rainy	cool	normal	false	yes
5	rainy	cool	normal	true	no

My code is as follows:

w = pd.read_csv("/content/drive/MyDrive/weather.csv")

from sklearn import preprocessing
lencoder = preprocessing.LabelEncoder()
    
w['humidity'] = lencoder.fit_transform(w['humidity'])
w['outlook'] = lencoder.fit_transform(w['outlook'])
w['temperature'] = lencoder.fit_transform(w['temperature'])
w['windy'] = lencoder.fit_transform(w['windy'])

x = w.iloc[:, :4].values
y = w.iloc[:, -1].values
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=0.1)

model = LinearRegression()
model.fit(X_train, Y_train)

How can I now predict an individual test sample such as [sunny, hot, high, false]?

CodePudding user response：

You need to encode it with same values that the LabelEncoder had assigned to each of these values in each column. So it will probably look like

[0,0,0,0]

However, you should note that the docs mention "This transformer should be used to encode target values, i.e. y, and not the input X." https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

So you may want to use onehotencoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

CodePudding user response：

You need to know what the mapping from categories to encodings is, in order to apply to new data. When you use the same lencoder object in the second line:

w['humidity'] = lencoder.fit_transform(w['humidity'])
w['outlook'] = lencoder.fit_transform(w['outlook'])

you (and python) have completely forgotten what that mapping is for the humidity column, by refitting the same encoder.

You could set up separate encoders then, and then just use .transform on your test data. However, that's not the real solution. As Nohman says, LabelEncoder is meant for labels, i.e. target values; the better solution is to use OrdinalEncoder, which has roughly the same effect, but can be applied to a 2D array, encoding each column separately. So:

oencoder = preprocessing.OrdinalEncoder()
cat_cols = ['humidity', 'outlook', 'temperature', 'windy']
w[cat_cols] = oencoder.fit_transform(w[cat_cols])
...

# I don't know what other columns you have, so I can't make this completely faithful:
test = [['sunny', 'hot', 'high', 'false']]
test_enc = oencoder.transform(test)
model.predict(test_enc)

You can make this even nicer, using Pipeline and (if you have non-categorical columns) ColumnTransformer.

Finally, a bit of advice given your example: the categorical features you have might well have an implicit order to them. You can specify the order to encode in OrdinalEncoder, which is likely to produce better results than encoding them in the default order (alphabetical).