Home > Software design >  OneHotEncoder returns unexpected result
OneHotEncoder returns unexpected result

Time:03-01

I have an array that looks like this:

[['Team A', 'Team B', 5000]
['Team C', 'Team D', 4000]]

Using this OneHotEncoder to transform the teamnames

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

ct = ColumnTransformer(
    transformers=[(
        'encoder',
        OneHotEncoder(),
        [0, 1]
    )],
    remainder='passthrough'
)

X = np.array(
    ct.fit_transform(X)
)

What I would expect is X to be something like:

[[1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 5000]
[0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 4000]]

However when I print it looks more like this:

(0, 12) 1.0
(0, 17) 1.0
(0, 28) 5000.0
(1, 5)  1.0
(1, 25) 1.0
(1, 28) 4000.0

What am I missing in this case? If I include more columns in the X variable to begin with I get the expected result:

[['Team A' 'Team B' 2 0 1 0.0 0.0 0 1 2 0.0 1.0 5000]
 ['Team C' 'Team D' 2 3 0 nan nan 2 1 1 nan nan 4000]]

I do not understand what the differens is in these cases? Why do I get different results on the same columns when only other columns has changed?

CodePudding user response:

Cannot reproduce the issue with the sample data you have posted; for

[['Team A', 'Team B', 5000]
['Team C', 'Team D', 4000]]

I get X as

array([[1.0, 0.0, 1.0, 0.0, 5000],
       [0.0, 1.0, 0.0, 1.0, 4000]], dtype=object)

Nevertheless, most probably the issue is with the default setting sparse=True of the OneHoteEncoder (see the docs) - the representation you end up getting looks like a sparse one. If you don't want it like that, change the OneHotEncoder() definition in your ColumnTransformer to OneHotEncoder(sparse=False) (probably the reason I cannot reproduce it is that the sample dataset is way too small, and the sparse representation simply does not kick in in such cases).

  • Related