Home > other >  Implementing sklearn PCA on limited number of variables in a pipeline
Implementing sklearn PCA on limited number of variables in a pipeline

Time:12-10

I'm setting up a machine learning pipeline to classify some data. One source of the data is a very good candidate for PCA and makes up the last $n$ dimensions of the dataset. I would like to use PCA on these variables but not the preceding variables. From searching through stackexchange this also seems like a common issue faced --- that people want to apply PCA to just a portion of the data.

Obviously I could do the PCA first then concatenate the datasets and then pass that to the pipeline, but afaik the PCA should be part of the pipeline as otherwise information from test samples bleeds into the training data.

I want to use sklearn's PCA function (but am also open to suggestions) but it doesn't take any arguments to define which variables to do the PCA on and so it's difficult to incorporate into a pipeline.

My work around currently works by defining a new PCA which selects the desired features and looks like this:

class new_PCA(PCA):
    
    @staticmethod
    def reduce(X):
        return X.iloc[:,NUMBER_NONPCA_FEATURES:]

    # Define PCA for limited number of columns
    def _fit(self, X):
        part_X = self.reduce(X)
        PCA._fit(self,part_X)

    def transform(self, X):
        part_X = self.reduce(X)
        pca_part = PCA.transform(self, part_X)
        X_new = np.concatenate([X.iloc[:,:NUMBER_NONPCA_FEATURES],pca_part],axis=1)
        return X_new

    def score_samples(self, X):
        part_X = self.reduce(X)
        PCA.score_samples(self,part_X)

    def inverse_transform(self, X):
        part_X = self.reduce(X)
        PCA.inverse_transform(self,part_X)

So the base _fit() method and transform() are both restricted to the current variables. This seems to work and from inspecting the source code I think this covers all the methods which take the training data (X) as input.

I'm just slightly concerned that I may have overlooked something and that this may have some unintended consequences somewhere. Does this look OK?

CodePudding user response:

Your workaround is not necessary, since this use case is already covered by sklearn. Different transformations for different features can be implemented by including a ColumnTransformer in the pipeline.

Consider the example below:

pipe = Pipeline(
    [
        ('ct', 
            ColumnTransformer(
             [("PCA", PCA(), [0, 1, 2]),
             ("pass", "passthrough", [3, 4, 5])])
        ),
         ('svc', SVC())
    ]
)

The ColumnTransformer stage of the pipeline can be seen as a junction, separating the columns coming in into subsets and applying the specified transformers to the passed columns (here passed as indices, but can be column names too).
In the example PCA is only applied to the columns with indices 0, 1 and 2, while the columns 3,4, and 5 are passed to the next stage of the pipeline without any transformation.

  • Related