Home > front end >  How to keep the number and names of columns in training and test dataset equal after one hot encodin
How to keep the number and names of columns in training and test dataset equal after one hot encodin

Time:08-24

Shape of the original dataset is 82580×30 with multiple string columns. Example dataset:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

df = pd.DataFrame({'Nationality': {0: 'DEU', 1: 'PRT', 2: 'PRT', 3: 'PRT', 4: 'FRA', 5: 'DEU', 6: 'CHE', 7: 'DEU', 8: 'GBR', 9: 'AUT', 10: 'PRT', 11: 'FRA', 12: 'OTR', 13: 'GBR', 14: 'ESP', 15: 'PRT', 16: 'OTR', 17: 'PRT', 18: 'ESP', 19: 'AUT'},
                   'Age': {0: 27.0, 1: 45.46, 2: 45.46, 3: 58.0, 4: 57.0, 5: 27.0, 6: 49.0, 7: 62.0, 8: 44.0, 9: 61.0, 10: 54.0, 11: 53.0, 12: 50.0, 13: 30.0, 14: 51.0, 15: 45.46, 16: 40.0, 17: 49.0, 18: 49.0, 19: 14.0},
                   'DaysSinceCreation': {0: 370, 1: 213, 2: 206, 3: 1018, 4: 835, 5: 52, 6: 597, 7: 217, 8: 999, 9: 1004, 10: 402, 11: 879, 12: 393, 13: 923, 14: 249, 15: 52, 16: 159, 17: 929, 18: 49, 19: 131},
                   'BookingsCheckedIn': {0: 1, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 0, 16: 0, 17: 1, 18: 1, 19: 0}})


# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
    
transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
    
# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)
    
# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())

The shape after encoding: (20, 14)
Index(['onehotencoder__Nationality_AUT', 'onehotencoder__Nationality_CHE',
       'onehotencoder__Nationality_DEU', 'onehotencoder__Nationality_ESP',
       'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_GBR',
       'onehotencoder__Nationality_OTR', 'onehotencoder__Nationality_PRT',
       'remainder__Age', 'remainder__DaysSinceCreation',
       'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
       'BookingsCheckedIn'],
      dtype='object')

After modeling, trying to test on a completely different test set:

df = pd.DataFrame({'Nationality': {0: 'CAN', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
                   'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
                   'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
                   'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})

# Encoding Variables
transformer = make_column_transformer(
    (OneHotEncoder(sparse=False), ['Nationality']),
    remainder='passthrough')

transformed = transformer.fit_transform(df)
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())

# Concat the two tables
transformed_df.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
df = pd.concat([transformed_df, df], axis=1)

# Remove old columns
df.drop(['Nationality'], axis = 1, inplace = True)
print('The shape after encoding: {}'.format(df.shape))
print(df.columns.unique())

The shape after encoding: (5, 10)
Index(['onehotencoder__Nationality_CAN', 'onehotencoder__Nationality_DEU',
       'onehotencoder__Nationality_FRA', 'onehotencoder__Nationality_PRT',
       'remainder__Age', 'remainder__DaysSinceCreation',
       'remainder__BookingsCheckedIn', 'Age', 'DaysSinceCreation',
       'BookingsCheckedIn'],
      dtype='object')

As can be seen, testing dataset has some features that were not present in the original training set and many features of training set are not present in test set. If I only use .values of X_train, y_train, X_test, y_test, I can run from logistic regression to Neural Net with >99% accuracy, but that feels like cheating and is not working out with Decision Trees. How do we deal with this?

CodePudding user response:

I would like to contribute 2 inputs:

(1) the test set should be a subset of the training set, so the unknown Nationality 'CAN' is not allowed. Either: try to include the new 'CAN' in the training data, or try to replace it with 'GBR' instead in the test data.

(2) you should not do fit_transform() separately on training and test set. The right way is to fit on training set, then... transform on training set and transform on test set. To illustrate:

# Encoding Variables
transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')
    
####transformed = transformer.fit_transform(df)  #delete this
transformer.fit(df)                              #use this instead
transformed = transformer.transform(df)          #use this instead
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())
    
# Concat the two tables
<truncated>

print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (20, 14)

Second part, note that I have replaced 'CAN' with 'GBR'. And only use the previously fitted transformer to transform the test set:

df = pd.DataFrame({'Nationality': {0: 'GBR', 1: 'DEU', 2: 'PRT', 3: 'PRT', 4: 'FRA'},
                   'Age': {0: 27.0, 1: 29.0, 2: 24.0, 3: 24.0, 4: 46.0},
                   'DaysSinceCreation': {0: 222, 1: 988, 2: 212, 3: 685, 4: 1052},
                   'BookingsCheckedIn': {0: 0, 1: 1, 2: 1, 3: 1, 4: 0}})

# Encoding Variables
####transformer = make_column_transformer((OneHotEncoder(sparse=False), ['Nationality']), remainder='passthrough')    #do not repeat, use the previous fitted model

####transformed = transformer.fit_transform(df)   #delete this, NO fitting on test set
transformed = transformer.transform(df)           #only do transform on test set
transformed_df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())

# Concat the two tables
<truncated>

print('The shape after encoding: {}'.format(df.shape))
The shape after encoding: (5, 14)

So the number of columns (14) are the same for both training set and test set

  • Related