How to keep the number and names of columns in training and test dataset equal after one hot encodin-CodePudding

Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
                   'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
                   'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
                   'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})


# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
    
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
    
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
    
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())

The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
       'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
       'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
       'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
       'remainder__Age', 'remainder__DaysSinceCreation',
       'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
       'BookingsCheckedIn'],
      dtype='object')

After modeling, trying to test on a completely different test set:

df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
                   'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
                   'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
                   'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})

# Encoding Variables
transformer = make_column_transformer(
    (OneHotEncoder(sparse=False), ['Nationality']),
    remainder='passthrough')

transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())

# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)

# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())

The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
       'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
       'remainder__Age', 'remainder__DaysSinceCreation',
       'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
       'BookingsCheckedIn'],
      dtype='object')

As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?

CodePudding user response：

I would like to contribute 2 inputs:

(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.

(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:

# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
    
####transformed = transformer.fit_transform(df)  #delete this
transformer.fit(df)                              #use this instead
transformed = transformer.transform(df)          #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
    
# Concat the two tables
<truncated>

print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)

Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:

df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
                   'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
                   'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
                   'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})

# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')    #do not repeat, use the previous fitted model

####transformed = transformer.fit_transform(df)   #delete this, NO fitting on test set
transformed = transformer.transform(df)           #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())

# Concat the two tables
<truncated>

print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)

So the number of columns (14) are the same for both training set and test set