From what I found out when trying myself and reading here on stackoverflow, When I pass a pandas dataframe to .predict(), it successfully gives me a prediction value. Like below:
pipe = Pipeline([('OneHotEncoder', encoder), ('RobustScaler', scaler),('RandomForestRegressor',RFregsr)])
pipe.fit(X_train, y_train)
with open('trained_RFregsr.pkl','wb') as f:
pickle.dump(pipe, f)
test = pipe.predict(X[0:1])
print(test)
>> [10.82638889]
But when I try to pass in a list of all input values required, 25 in my case, it returns a key error. This is related to how pandas dataframe only returns column names when iterated, and not the values.
test = pipe.predict([['M', 15, 'U', 'LE3', 'T', 4, 3, 'teacher', 'services', 1, 3, 0,
'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 5, 4, 4, 2, 15, 16]])
print(test)
>> KeyError : 'sex'
I have trained a model using 25 values consisting of categoricals and numerical values to predict a single int value. As to why I am pickle-ing the file, I have to deploy it using FastAPI and it has to receive input from API endpoints. If required I can post complete code somewhere. Please tell me how I can safely pass a list of required inputs so that the model can predict on them?
EDIT: This is how I have used the OneHotEncoder:
import category_encoders as ce
encoder = ce.OneHotEncoder()
x_train = encoder.fit_transform(X_train)
x_test = encoder.transform(X_test)
CodePudding user response:
This looks like an error where encoder
is a ColumnTransformer
expecting a pandas
dataframe. pipe.predict
is looking for a column named sex
, but not finding one.
For example, this:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
import pandas as pd
df = pd.DataFrame({
"zero": ["A", "B", "C", "A", "B", "C"],
"one": [1, 1, 2, 1, 2, 1],
"two": [0.5, 0.3, 0.2, 0.1, 0.9, 0.7],
"label": [0, 0, 0, 1, 1, 1]})
encoder = ColumnTransformer(
[('ohe', OneHotEncoder(), ["zero", "one"])], remainder="passthrough")
X, y = df.drop(["label"], axis=1), df["label"]
pipe = Pipeline([('ohe', encoder), ('clf', RandomForestClassifier())])
pipe.fit(X, y)
pipe.predict([["A", 1, 0.5]])
Results in (scikit-learn==1.2.0
):
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
But switching to:
X_test = pd.DataFrame([["A", 1, 0.5]], columns=["zero", "one", "two"])
print(pipe.predict(X_test))
# [0]