I have completed training the scikit-learn
model and saved it as a pickle
file. Now I want to load the model and run the prediction but I don't know how to preprocess the input data.
dataset = {
'airline': ['SpiceJet', 'Indigo', 'Air_India']
}
df = pd.DataFrame.from_dict(dataset)
The airline
column has 3 airlines which will be used to create dummy columns with this code:
def preprocessing(df):
dummies = pd.get_dummies(df["airline"], drop_first=True)
return dummies
The dataset for training will have the schema like this:
| airline_SpiceJet | airline_Indigo | airline_Air_India |
My question is with the input below, how can I map the input to the corresponding column?
input = {
'airline': ['SpiceJet']
}
The expected output for the dataset:
| airline_SpiceJet | airline_Indigo | airline_Air_India |
| ---------------- | -------------- | ----------------- |
| 1 | 0 | 0 |
CodePudding user response:
I think the problem with pandas get_dummies() method is that it defines the columns for the dummy based on the input data, as described in this issue Dummy variables when not all categories are present.
Based on the answers there, you can adjust your code to get dummies like this:
dataset = {
'airline': ['SpiceJet', 'Indigo', 'Air_India']
}
input = {
'airline': ['SpiceJet']
}
possible_categories = dataset["airline"]
dummy_input = pd.Series(input["airline"])
display(pd.get_dummies(dummy_input.astype(pd.CategoricalDtype(categories=possible_categories))))
Output:
SpiceJet | Indigo | Air_India |
---|---|---|
1 | 0 | 0 |
With more input data, it could look like this:
input_2 = {
'airline': ['SpiceJet','Indigo','SpiceJet','Indigo','Air_India']
}
dummy_input_2 = pd.Series(input_2["airline"])
display(pd.get_dummies(dummy_input_2.astype(pd.CategoricalDtype(categories=possible_categories))))
SpiceJet | Indigo | Air_India |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |