How can I exclude specific columns from a ColumnTransformer/Pipeline in sklearn?-CodePudding

I'm trying to build a data preprocessing pipeline with sklearn Pipeline and ColumnTransformer. The preprocessing steps consist of parallel imputting values and transforming (power transform, scaling or OHE) to specific columns. This preprocessing ColumnTransform works perfectly.

However, after doing some analysis on the preprocessed result, I decided to exclude some columns from the final result. My goal is to have one pipeline starting from the original dataframe that inputs and transforms values, excludes pre-selected columns, and triggers the model fitting all in one. So to be clear, I don't want to drop columns after the pipeline is fitted/transformed. I want instead that the process of dropping columns is part of the column transformation.

It's easy to remove the numerical columns from the model (by simply not adding them), but how can I exclude the columns created by OHE? I don't want to exclude all columns created by OHE, just some of them. For example, if categorical column "Example" becomes Example_1, Example_2, and Example_3, how can I exclude only Example_2?

Example code:

### Importing libraries
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
                       'Example' : ['A','B','C','A','B','A','A','C','C'], 
                       'another_col' : range(10,100,10)})

### Pipelines
SimpImpMean_MinMaxScaler = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="mean")),
    ('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
    ('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])

### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
    ('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
    ('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
     ],
    remainder='drop',
    verbose_feature_names_out=False)
preprocessor_transformer


### Preprocessing dummy dataframe
df_foo = pd.DataFrame(preprocessor_transformer.fit_transform(df_foo), 
            columns=preprocessor_transformer.get_feature_names_out()
            )
print(df_foo)

Finally, I've seen this solution out there (Adding Dropping Column instance into a Pipeline) but I didn't manage to make the custom columnDropperTransformer to work in my case. Adding the columnDropperTransformer to my pipeline returns an ValueError: A given column is not a column of the dataframe, refering to column "Example" not existing in the dataframe anymore.

class columnDropperTransformer():
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self 

processor= make_pipeline(preprocessor_transformer,columnDropperTransformer([]))

processor.fit_transform(df_foo)

Any suggestions?

CodePudding user response：

As per my experience and as of today, automating these kinds of treatments in sklearn is not that easy for the following reasons:

the steps performed by the Pipeline (when calling .fit_transform() on it) make you lose the DataFrame structure (the pandas DataFrame becomes a numpy array). I'd suggest reading

How to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline? and how to use ColumnTransformer() to return a dataframe?

for some details on what happens when calling .fit() or .fit_transform() on a Pipeline instance.
In turn, "columns" in a numpy array can't be referenced anymore via their names, but only positionally.

Here are a couple of solutions which are not scalable imo, but that can work for your case:

You can add a step in your pipeline which is only intended to transform the numpy array (which is the standard output of the intermediate transformations performed by the pipeline) back in a pandas DataFrame (via the ColumnExtractor transformer). Once you have a DataFrame, you can exploit the columnDropperTransformer of the referenced link to drop the column Example_B via its name.

from sklearn.base import BaseEstimator, TransformerMixin

class ColumnExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def transform(self, X, *_):
        return pd.DataFrame(X, columns=self.columns)

    def fit(self, *_):
        return self

class columnDropperTransformer():
    def __init__(self,columns):
        self.columns=columns

    def transform(self,X,y=None):
        return X.drop(self.columns,axis=1)

    def fit(self, X, y=None):
        return self

The ColumnExtractor transformer is only intended to map the resulting array in a DataFrame. The clear disadvantage is that you'll need to manually specify the columns you would like your DataFrame to be made of.

from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

import pandas as pd
import numpy as np

### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
               'Example' : ['A','B','C','A','B','A','A','C','C'], 
               'another_col' : range(10,100,10)})

SimpImpMean_MinMaxScaler = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="mean")),
    ('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
    ('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])

### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
    ('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
    ('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)

processor = make_pipeline(
    preprocessor_transformer, 
    ColumnExtractor(['Num_col', 'Example_A', 'Example_B', 'Example_C']))

f_processor = make_pipeline(
    processor, 
    columnDropperTransformer('Example_B'))

f_processor.fit_transform(df_foo)

You can reference the "columns" of the array which comes out from the application of the Pipeline transformations positionally (i.e. by index). For instance, you might define such a dummy transformer

class NumpyColumnSelector():
    def __init__(self):
        pass

    def transform(self,X,y=None):
        return X[:, [0, 1, 3]]

    def fit(self, X, y=None):
        return self

which only retains all columns but the one which would correspond to the Example_B one.

from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline

import pandas as pd
import numpy as np

### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
               'Example' : ['A','B','C','A','B','A','A','C','C'], 
               'another_col' : range(10,100,10)})  

SimpImpMean_MinMaxScaler = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="mean")),
    ('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
    ('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
    ('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])

### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
    ('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
    ('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)

f_processor = make_pipeline(preprocessor_transformer, NumpyColumnSelector())

f_processor.fit_transform(df_foo)

CodePudding user response：

For the very specific use-case of removing dummy columns generated by the OHE ( 1 to amiola for a more generic answer), you can specify categories and handle_unknown='ignore'. In your example, replacing the OHE line by this:

('OHE', OneHotEncoder(sparse=False, categories=[['A', 'C']], handle_unknown='ignore')),

produces this:

    Num_col  Example_A  Example_C
0  0.000000        1.0        0.0
1  0.125000        0.0        0.0
2  0.482143        0.0        1.0
3  0.375000        1.0        0.0
4  0.500000        0.0        0.0
5  0.625000        1.0        0.0
6  0.750000        1.0        0.0
7  0.482143        0.0        1.0
8  1.000000        0.0        1.0