I am building a classification model for a data set containing independent variables as categorical values. As fit()
is not supported for non-numeric values, I need to apply either LabelEncoder
or OneHotEncoder
.
My data set looks like this:
index | outlook | temperature | humidity | windy | play |
---|---|---|---|---|---|
0 | sunny | hot | high | false | no |
1 | sunny | hot | high | true | no |
2 | overcast | hot | high | false | yes |
3 | rainy | mild | high | false | yes |
4 | rainy | cool | normal | false | yes |
5 | rainy | cool | normal | true | no |
My code is as follows:
w = pd.read_csv("/content/drive/MyDrive/weather.csv")
from sklearn import preprocessing
lencoder = preprocessing.LabelEncoder()
w['humidity'] = lencoder.fit_transform(w['humidity'])
w['outlook'] = lencoder.fit_transform(w['outlook'])
w['temperature'] = lencoder.fit_transform(w['temperature'])
w['windy'] = lencoder.fit_transform(w['windy'])
x = w.iloc[:, :4].values
y = w.iloc[:, -1].values
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(x, y, test_size=0.1)
model = LinearRegression()
model.fit(X_train, Y_train)
How can I now predict an individual test sample such as [sunny, hot, high, false]
?
CodePudding user response:
You need to encode it with same values that the LabelEncoder had assigned to each of these values in each column. So it will probably look like
[0,0,0,0]
However, you should note that the docs mention "This transformer should be used to encode target values, i.e. y, and not the input X." https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
So you may want to use onehotencoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
CodePudding user response:
You need to know what the mapping from categories to encodings is, in order to apply to new data. When you use the same lencoder
object in the second line:
w['humidity'] = lencoder.fit_transform(w['humidity'])
w['outlook'] = lencoder.fit_transform(w['outlook'])
you (and python) have completely forgotten what that mapping is for the humidity
column, by refitting the same encoder.
You could set up separate encoders then, and then just use .transform
on your test data. However, that's not the real solution. As Nohman says, LabelEncoder
is meant for labels, i.e. target values; the better solution is to use OrdinalEncoder
, which has roughly the same effect, but can be applied to a 2D array, encoding each column separately. So:
oencoder = preprocessing.OrdinalEncoder()
cat_cols = ['humidity', 'outlook', 'temperature', 'windy']
w[cat_cols] = oencoder.fit_transform(w[cat_cols])
...
# I don't know what other columns you have, so I can't make this completely faithful:
test = [['sunny', 'hot', 'high', 'false']]
test_enc = oencoder.transform(test)
model.predict(test_enc)
You can make this even nicer, using Pipeline
and (if you have non-categorical columns) ColumnTransformer
.
Finally, a bit of advice given your example: the categorical features you have might well have an implicit order to them. You can specify the order to encode in OrdinalEncoder
, which is likely to produce better results than encoding them in the default order (alphabetical).