Key Error when passing list of input in .predict() using Pipeline-CodePudding

From what I found out when trying myself and reading here on stackoverflow, When I pass a pandas dataframe to .predict(), it successfully gives me a prediction value. Like below:

pipe = Pipeline([('OneHotEncoder', encoder), ('RobustScaler', scaler),('RandomForestRegressor',RFregsr)])
pipe.fit(X_train, y_train)
with open('trained_RFregsr.pkl','wb') as f:
    pickle.dump(pipe, f)
test = pipe.predict(X[0:1])
print(test)

>> [10.82638889]

But when I try to pass in a list of all input values required, 25 in my case, it returns a key error. This is related to how pandas dataframe only returns column names when iterated, and not the values.

test = pipe.predict([['M', 15, 'U', 'LE3', 'T', 4, 3, 'teacher', 'services', 1, 3, 0,
        'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 5, 4, 4, 2, 15, 16]])
print(test)
>> KeyError : 'sex'

I have trained a model using 25 values consisting of categoricals and numerical values to predict a single int value. As to why I am pickle-ing the file, I have to deploy it using FastAPI and it has to receive input from API endpoints. If required I can post complete code somewhere. Please tell me how I can safely pass a list of required inputs so that the model can predict on them?

EDIT: This is how I have used the OneHotEncoder:

import category_encoders as ce
encoder = ce.OneHotEncoder()

x_train = encoder.fit_transform(X_train)

x_test = encoder.transform(X_test)

CodePudding user response：

This looks like an error where encoder is a ColumnTransformer expecting a pandas dataframe. pipe.predict is looking for a column named sex, but not finding one.

For example, this:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
import pandas as pd

df = pd.DataFrame({
    "zero":  ["A", "B", "C", "A", "B", "C"],
    "one":   [1, 1, 2, 1, 2, 1],
    "two":   [0.5, 0.3, 0.2, 0.1, 0.9, 0.7],
    "label": [0, 0, 0, 1, 1, 1]})

encoder = ColumnTransformer(
    [('ohe', OneHotEncoder(), ["zero", "one"])], remainder="passthrough")

X, y = df.drop(["label"], axis=1), df["label"]

pipe = Pipeline([('ohe', encoder), ('clf', RandomForestClassifier())])
pipe.fit(X, y)
pipe.predict([["A", 1, 0.5]])

Results in (scikit-learn==1.2.0):

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

But switching to:

X_test = pd.DataFrame([["A", 1, 0.5]], columns=["zero", "one", "two"])
print(pipe.predict(X_test))
# [0]