How to make scikit-learn's columnTransformer automatically drop extra columns?-CodePudding

Good Day, I googled this without luck. It seems like it's possible, but I might be reading the API wrong. How can I have scikit-learn automatically drop the extra columns in my pandas dataframe, on my testing data, instead of explicitly having to drop those columns?

I am currently running Python 3.6 in my environment and v 0.24.2 of sklearn.

To show this with an example here's the code:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

import pandas as pd

from random import randint
from random import choice
import random

random.seed(42)

df = pd.DataFrame({
    'cont_A': [randint(1,10) for _ in range(10)], 
    'cont_B': [randint(-20,20) for _ in range(10)],
    'cat_A': [choice('ABC') for _ in range(10)],
    'cat_B': [choice('XYZ') for _ in range(10)],
})

This will create a dataframe with two categorical columns and two continuous columns.

t = [
    ('cat', OneHotEncoder(), ['cat_A', 'cat_B']),
    ('nums', MinMaxScaler(), ['cont_A', 'cont_B'])
]

columnTransformer = ColumnTransformer(t, remainder='drop')
X_train = columnTransformer.fit_transform(df)
X_train

We can fit-transform our columnTransformer on our initial training data. Now let's say we generate our testing or input data before we want to run our model.

df_test = pd.DataFrame({
    'cont_A': [randint(2,9) for _ in range(3)], 
    'cont_B': [randint(-19,19) for _ in range(3)],
    'cat_A': [choice('ABC') for _ in range(3)],
    'cat_B': [choice('XYZ') for _ in range(3)],
    'extra_A': [randint(1,5) for _ in range(3)], 
    'extra_B': [randint(1,5) for _ in range(3)], 
    'extra_C': [randint(1,5) for _ in range(3)], 
})

This testing dataframe has 3 extra columns that are not of value to us. I want the columnTransformer to automatically drop them and process the remaining (if this is possible) without having to explicitly drop them.

If I run the columnTransformer on this data:

LST = columnTransformer.transform(df_test)

It will cause an error:

ValueError: X has 7 features, but ColumnTransformer is expecting 4 features as input.

However, if I explicitly drop those columns, it will run. I thought defining the remainder='drop' would have addressed this issue but it does not seem to help:

df_test_dropped = df_test.drop(['extra_A', 'extra_B', 'extra_C'], axis=1)
LST = columnTransformer.transform(df_test_dropped)

How can I (if it's even possible) have columnTransformer automatically drop non-relevant columns (instead of having to explicitly drop them)?

CodePudding user response：

remainder='drop' tells the transformer to drop columns in the training set that don't fit into any of the transformers. It doesn't say to ignore additional columns in a test set, and there is currently no way to accomplish that: all estimators expect to receive inputs in the same format at fitting and transform/prediction.

CodePudding user response：

In this line

columnTransformer = ColumnTransformer(t, remainder='drop')

the kwarg remainder=drop is already dropping the columns not specified by the list of transformers t.