I have an array that looks like this:
[['Team A', 'Team B', 5000]
['Team C', 'Team D', 4000]]
Using this OneHotEncoder to transform the teamnames
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(
transformers=[(
'encoder',
OneHotEncoder(),
[0, 1]
)],
remainder='passthrough'
)
X = np.array(
ct.fit_transform(X)
)
What I would expect is X to be something like:
[[1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 5000]
[0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 4000]]
However when I print it looks more like this:
(0, 12) 1.0
(0, 17) 1.0
(0, 28) 5000.0
(1, 5) 1.0
(1, 25) 1.0
(1, 28) 4000.0
What am I missing in this case? If I include more columns in the X variable to begin with I get the expected result:
[['Team A' 'Team B' 2 0 1 0.0 0.0 0 1 2 0.0 1.0 5000]
['Team C' 'Team D' 2 3 0 nan nan 2 1 1 nan nan 4000]]
I do not understand what the differens is in these cases? Why do I get different results on the same columns when only other columns has changed?
CodePudding user response:
Cannot reproduce the issue with the sample data you have posted; for
[['Team A', 'Team B', 5000]
['Team C', 'Team D', 4000]]
I get X
as
array([[1.0, 0.0, 1.0, 0.0, 5000],
[0.0, 1.0, 0.0, 1.0, 4000]], dtype=object)
Nevertheless, most probably the issue is with the default setting sparse=True
of the OneHoteEncoder (see the docs) - the representation you end up getting looks like a sparse one. If you don't want it like that, change the OneHotEncoder()
definition in your ColumnTransformer
to OneHotEncoder(sparse=False)
(probably the reason I cannot reproduce it is that the sample dataset is way too small, and the sparse representation simply does not kick in in such cases).