Good Day, I googled this without luck. It seems like it's possible, but I might be reading the API wrong. How can I have scikit-learn automatically drop the extra columns in my pandas dataframe, on my testing data, instead of explicitly having to drop those columns?
I am currently running Python 3.6
in my environment and v 0.24.2
of sklearn.
To show this with an example here's the code:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
from random import randint
from random import choice
import random
random.seed(42)
df = pd.DataFrame({
'cont_A': [randint(1,10) for _ in range(10)],
'cont_B': [randint(-20,20) for _ in range(10)],
'cat_A': [choice('ABC') for _ in range(10)],
'cat_B': [choice('XYZ') for _ in range(10)],
})
This will create a dataframe with two categorical columns and two continuous columns.
t = [
('cat', OneHotEncoder(), ['cat_A', 'cat_B']),
('nums', MinMaxScaler(), ['cont_A', 'cont_B'])
]
columnTransformer = ColumnTransformer(t, remainder='drop')
X_train = columnTransformer.fit_transform(df)
X_train
We can fit-transform our columnTransformer
on our initial training data. Now let's say we generate our testing or input data before we want to run our model.
df_test = pd.DataFrame({
'cont_A': [randint(2,9) for _ in range(3)],
'cont_B': [randint(-19,19) for _ in range(3)],
'cat_A': [choice('ABC') for _ in range(3)],
'cat_B': [choice('XYZ') for _ in range(3)],
'extra_A': [randint(1,5) for _ in range(3)],
'extra_B': [randint(1,5) for _ in range(3)],
'extra_C': [randint(1,5) for _ in range(3)],
})
This testing dataframe has 3 extra columns that are not of value to us. I want the columnTransformer
to automatically drop them and process the remaining (if this is possible) without having to explicitly drop them.
If I run the columnTransformer
on this data:
LST = columnTransformer.transform(df_test)
It will cause an error:
ValueError: X has 7 features, but ColumnTransformer is expecting 4 features as input.
However, if I explicitly drop those columns, it will run. I thought defining the remainder='drop'
would have addressed this issue but it does not seem to help:
df_test_dropped = df_test.drop(['extra_A', 'extra_B', 'extra_C'], axis=1)
LST = columnTransformer.transform(df_test_dropped)
How can I (if it's even possible) have columnTransformer
automatically drop non-relevant columns (instead of having to explicitly drop them)?
CodePudding user response:
remainder='drop'
tells the transformer to drop columns in the training set that don't fit into any of the transformers
. It doesn't say to ignore additional columns in a test set, and there is currently no way to accomplish that: all estimators expect to receive inputs in the same format at fitting and transform/prediction.
CodePudding user response:
In this line
columnTransformer = ColumnTransformer(t, remainder='drop')
the kwarg remainder=drop
is already dropping the columns not specified by the list of transformers t
.