How to create dummy columns for prediction?-CodePudding

I have completed training the scikit-learn model and saved it as a pickle file. Now I want to load the model and run the prediction but I don't know how to preprocess the input data.

dataset = {
    'airline': ['SpiceJet', 'Indigo', 'Air_India']
}
df = pd.DataFrame.from_dict(dataset)

The airline column has 3 airlines which will be used to create dummy columns with this code:

def preprocessing(df):
    dummies = pd.get_dummies(df["airline"], drop_first=True)
    return dummies

The dataset for training will have the schema like this:

| airline_SpiceJet | airline_Indigo | airline_Air_India |

My question is with the input below, how can I map the input to the corresponding column?

input = {
    'airline': ['SpiceJet']
}

The expected output for the dataset:

| airline_SpiceJet | airline_Indigo | airline_Air_India |
| ---------------- | -------------- | ----------------- |
|                1 |              0 |                 0 |

CodePudding user response：

I think the problem with pandas get_dummies() method is that it defines the columns for the dummy based on the input data, as described in this issue Dummy variables when not all categories are present.

Based on the answers there, you can adjust your code to get dummies like this:

dataset = {
    'airline': ['SpiceJet', 'Indigo', 'Air_India']
}

input = {
    'airline': ['SpiceJet']
}

possible_categories = dataset["airline"]


dummy_input = pd.Series(input["airline"])
display(pd.get_dummies(dummy_input.astype(pd.CategoricalDtype(categories=possible_categories))))

Output:

SpiceJet	Indigo	Air_India
1	0	0

With more input data, it could look like this:

input_2 = {
    'airline': ['SpiceJet','Indigo','SpiceJet','Indigo','Air_India']
}

dummy_input_2 = pd.Series(input_2["airline"])
display(pd.get_dummies(dummy_input_2.astype(pd.CategoricalDtype(categories=possible_categories))))

SpiceJet	Indigo	Air_India
1	0	0
0	1	0
1	0	0
0	1	0
0	0	1