In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot use pd_dummies. In train I have x rows and in test 1 row, after OHE the test row is shorter and I have no Idea how to compare it with train.
le = LabelEncoder()
dfle = df.apply(le.fit_transform)
X = dfle.values
ohe = OneHotEncoder(handle_unknown='ignore')
X = ohe.fit_transform(X).toarray()
le = LabelEncoder()
testle = test.apply(le.fit_transform)
y = testle.values
two = OneHotEncoder(handle_unknown='ignore')
y = two.fit_transform(y).toarray()
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)
rf.predict([[ ? ]])
Output of X and y:
X:
[[0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0.
0. 1.]
[0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0.
1. 1.]
[0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.
0. 1.]
[1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1.
0. 1.]]
y:
[[1. 1. 1. 1. 1. 1. 1. 1. 1.]]
CodePudding user response:
First, I think you misunderstand what X
and y
mean. X
represents your features, y
your target(s). It's different from X_train
, X_test
, y_train
, y_test
. If y
represents your test data, you should rename it to be clearer.
Here, it seems y
is your test data:
In train I have x rows and in test 1 row
You should use your first transformers (used for X
) to transform (and only transform, not fit!) your data.
What you should not do:
df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
ohe = OneHotEncoder(sparse=False)
X_train = ohe.fit_transform(df1)
df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
ohe = OneHotEncoder(sparse=False)
X_test = ohe.fit_transform(df2)
# X_train
# array([[0., 1., 1., 0.],
# [1., 0., 0., 1.]])
# X_test
# array([[1., 1.]]) # shape differs from X_train
What you should do:
df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
ohe = OneHotEncoder(sparse=False)
X_train = ohe.fit_transform(df1)
df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
X_test = ohe.transform(df2)
# X_train
# array([[0., 1., 1., 0.],
# [1., 0., 0., 1.]])
# X_test
# array([[0., 1., 1., 0.]]) # same shape as X_train