I'm trying to build a data preprocessing pipeline with sklearn Pipeline and ColumnTransformer. The preprocessing steps consist of parallel imputting values and transforming (power transform, scaling or OHE) to specific columns. This preprocessing ColumnTransform works perfectly.
However, after doing some analysis on the preprocessed result, I decided to exclude some columns from the final result. My goal is to have one pipeline starting from the original dataframe that inputs and transforms values, excludes pre-selected columns, and triggers the model fitting all in one. So to be clear, I don't want to drop columns after the pipeline is fitted/transformed. I want instead that the process of dropping columns is part of the column transformation.
It's easy to remove the numerical columns from the model (by simply not adding them), but how can I exclude the columns created by OHE? I don't want to exclude all columns created by OHE, just some of them. For example, if categorical column "Example" becomes Example_1, Example_2, and Example_3, how can I exclude only Example_2?
Example code:
### Importing libraries
from sklearn.impute import SimpleImputer
SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually.
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
### Dummy dataframe
df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9],
'Example' : ['A','B','C','A','B','A','A','C','C'],
'another_col' : range(10,100,10)})
### Pipelines
SimpImpMean_MinMaxScaler = Pipeline([
('SimpleImputer', SimpleImputer(strategy="mean")),
('MinMaxScaler', MinMaxScaler()),
])
SimpImpConstNoAns_OHE = Pipeline([
('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')),
('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')),
])
### ColumnTransformer
preprocessor_transformer = ColumnTransformer([
('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']),
('pipeline-2', SimpImpConstNoAns_OHE, ['Example'])
],
remainder='drop',
verbose_feature_names_out=False)
preprocessor_transformer
### Preprocessing dummy dataframe
df_foo = pd.DataFrame(preprocessor_transformer.fit_transform(df_foo),
columns=preprocessor_transformer.get_feature_names_out()
)
print(df_foo)
Finally, I've seen this solution out there (Adding Dropping Column instance into a Pipeline) but I didn't manage to make the custom columnDropperTransformer to work in my case. Adding the columnDropperTransformer to my pipeline returns an ValueError: A given column is not a column of the dataframe, refering to column "Example" not existing in the dataframe anymore.
class columnDropperTransformer():
def __init__(self,columns):
self.columns=columns
def transform(self,X,y=None):
return X.drop(self.columns,axis=1)
def fit(self, X, y=None):
return self
processor= make_pipeline(preprocessor_transformer,columnDropperTransformer([]))
processor.fit_transform(df_foo)
Any suggestions?
CodePudding user response:
As per my experience and as of today, automating these kinds of treatments in sklearn
is not that easy for the following reasons:
the steps performed by the
Pipeline
(when calling.fit_transform()
on it) make you lose the DataFrame structure (the pandas DataFrame becomes a numpy array). I'd suggest readingHow to be sure that sklearn piepline applies fit_transform method when using feature selection and ML model in piepline? and how to use ColumnTransformer() to return a dataframe?
for some details on what happens when calling
.fit()
or.fit_transform()
on aPipeline
instance.In turn, "columns" in a numpy array can't be referenced anymore via their names, but only positionally.
Here are a couple of solutions which are not scalable imo, but that can work for your case:
You can add a step in your pipeline which is only intended to transform the numpy array (which is the standard output of the intermediate transformations performed by the pipeline) back in a pandas DataFrame (via the
ColumnExtractor
transformer). Once you have a DataFrame, you can exploit thecolumnDropperTransformer
of the referenced link to drop the columnExample_B
via its name.from sklearn.base import BaseEstimator, TransformerMixin class ColumnExtractor(BaseEstimator, TransformerMixin): def __init__(self, columns): self.columns = columns def transform(self, X, *_): return pd.DataFrame(X, columns=self.columns) def fit(self, *_): return self class columnDropperTransformer(): def __init__(self,columns): self.columns=columns def transform(self,X,y=None): return X.drop(self.columns,axis=1) def fit(self, X, y=None): return self
The
ColumnExtractor
transformer is only intended to map the resulting array in a DataFrame. The clear disadvantage is that you'll need to manually specify the columns you would like your DataFrame to be made of.from sklearn.impute import SimpleImputer SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually. from sklearn.preprocessing import OneHotEncoder, MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.pipeline import make_pipeline import pandas as pd import numpy as np ### Dummy dataframe df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9], 'Example' : ['A','B','C','A','B','A','A','C','C'], 'another_col' : range(10,100,10)}) SimpImpMean_MinMaxScaler = Pipeline([ ('SimpleImputer', SimpleImputer(strategy="mean")), ('MinMaxScaler', MinMaxScaler()), ]) SimpImpConstNoAns_OHE = Pipeline([ ('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')), ('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')), ]) ### ColumnTransformer preprocessor_transformer = ColumnTransformer([ ('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']), ('pipeline-2', SimpImpConstNoAns_OHE, ['Example']) ], remainder='drop', verbose_feature_names_out=False) processor = make_pipeline( preprocessor_transformer, ColumnExtractor(['Num_col', 'Example_A', 'Example_B', 'Example_C'])) f_processor = make_pipeline( processor, columnDropperTransformer('Example_B')) f_processor.fit_transform(df_foo)
You can reference the "columns" of the array which comes out from the application of the
Pipeline
transformations positionally (i.e. by index). For instance, you might define such a dummy transformerclass NumpyColumnSelector(): def __init__(self): pass def transform(self,X,y=None): return X[:, [0, 1, 3]] def fit(self, X, y=None): return self
which only retains all columns but the one which would correspond to the
Example_B
one.from sklearn.impute import SimpleImputer SimpleImputer.get_feature_names_out = (lambda self, names = None: self.feature_names_in_) # SimpleImputer does not have get_feature_names_out, so we need to add it manually. from sklearn.preprocessing import OneHotEncoder, MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.base import BaseEstimator, TransformerMixin from sklearn.pipeline import make_pipeline import pandas as pd import numpy as np ### Dummy dataframe df_foo = pd.DataFrame({'Num_col' : [1,2,np.nan,4,5,6,7,np.nan,9], 'Example' : ['A','B','C','A','B','A','A','C','C'], 'another_col' : range(10,100,10)}) SimpImpMean_MinMaxScaler = Pipeline([ ('SimpleImputer', SimpleImputer(strategy="mean")), ('MinMaxScaler', MinMaxScaler()), ]) SimpImpConstNoAns_OHE = Pipeline([ ('SimpleImputer', SimpleImputer(strategy="constant", fill_value='no_answer')), ('OHE', OneHotEncoder(sparse=False, drop='if_binary', categories='auto')), ]) ### ColumnTransformer preprocessor_transformer = ColumnTransformer([ ('pipeline-1', SimpImpMean_MinMaxScaler, ['Num_col']), ('pipeline-2', SimpImpConstNoAns_OHE, ['Example']) ], remainder='drop', verbose_feature_names_out=False) f_processor = make_pipeline(preprocessor_transformer, NumpyColumnSelector()) f_processor.fit_transform(df_foo)
CodePudding user response:
For the very specific use-case of removing dummy columns generated by the OHE ( 1 to amiola for a more generic answer), you can specify categories
and handle_unknown='ignore'
. In your example, replacing the OHE line by this:
('OHE', OneHotEncoder(sparse=False, categories=[['A', 'C']], handle_unknown='ignore')),
produces this:
Num_col Example_A Example_C
0 0.000000 1.0 0.0
1 0.125000 0.0 0.0
2 0.482143 0.0 1.0
3 0.375000 1.0 0.0
4 0.500000 0.0 0.0
5 0.625000 1.0 0.0
6 0.750000 1.0 0.0
7 0.482143 0.0 1.0
8 1.000000 0.0 1.0