Found input variables with inconsistent numbers of samples: OHE-CodePudding

In categorical label encoding. I know that I need to use OneHotEncoder() because Feature names differs in test so cannot use pd_dummies. In train I have x rows and in test 1 row, after OHE the test row is shorter and I have no Idea how to compare it with train.

le = LabelEncoder()
dfle = df.apply(le.fit_transform)
X = dfle.values
ohe = OneHotEncoder(handle_unknown='ignore')
X = ohe.fit_transform(X).toarray()


le = LabelEncoder()
testle = test.apply(le.fit_transform)
y = testle.values
two = OneHotEncoder(handle_unknown='ignore')
y = two.fit_transform(y).toarray()


rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)

rf.predict([[ ? ]])

Output of X and y:

X:
[[0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0.
  0. 1.]
 [0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0.
  1. 1.]
 [0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.
  0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1.
  0. 1.]]

y:
[[1. 1. 1. 1. 1. 1. 1. 1. 1.]]

CodePudding user response：

First, I think you misunderstand what X and y mean. X represents your features, y your target(s). It's different from X_train, X_test, y_train, y_test. If y represents your test data, you should rename it to be clearer.

Here, it seems y is your test data:

In train I have x rows and in test 1 row

You should use your first transformers (used for X) to transform (and only transform, not fit!) your data.

What you should not do:

df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
ohe = OneHotEncoder(sparse=False)
X_train = ohe.fit_transform(df1)

df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
ohe = OneHotEncoder(sparse=False)
X_test = ohe.fit_transform(df2)

# X_train
# array([[0., 1., 1., 0.],
#        [1., 0., 0., 1.]])

# X_test
# array([[1., 1.]])  # shape differs from X_train

What you should do:

df1 = pd.DataFrame({'country': ['USA', 'France'], 'language': ['EN', 'FR']})
ohe = OneHotEncoder(sparse=False)
X_train = ohe.fit_transform(df1)

df2 = pd.DataFrame({'country': ['USA'], 'language': ['EN']})
X_test = ohe.transform(df2)

# X_train
# array([[0., 1., 1., 0.],
#        [1., 0., 0., 1.]])

# X_test
# array([[0., 1., 1., 0.]])  # same shape as X_train