Custom resampler for a one-v-one in a pipeline-CodePudding

I am working to implement my custom undersampler that works based on SVM. This class takes in binary class data and undersample the majority class to the size of minority, by selecting majority examples near the class's support vector, up to the size of minority examples.

Here's the code:


import numpy as np
from collections import Counter

from sklearn.svm import SVC

class NearSVUmdersampler():
  def __init__(self, random_state=None):
    self.random_state = random_state
  
  def fit_resample(self, X, y):
    random_state = check_random_state(self.random_state)
    # class distribution
    counter = Counter(y)
    maj_class = counter.most_common()[0][0]
    min_class = counter.most_common()[-1][0]
    # number of minority examples
    num_minority = len(X[ y == min_class])
    svc = SVC(kernel='rbf', random_state=32)
    svc.fit(X,y)
    # majority class support vectors
    maj_sup_vector = svc.support_vectors_[maj_class]
    # compute distances to support vector points
    distances = []
    for i, x in enumerate(X[y == maj_class]):
      d = np.linalg.norm(maj_sup_vector - x) 
      distances.append((i, d))
    # sort distances (ascending)
    distances.sort(key=lambda tup: tup[1])
    index = [i for i, d in distances][:num_minority]
    X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
    y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))

    return X_ds, y_ds

The resampled data returned by this class is balanced with majority class down to equal the minority.

So I wanted to use this class in a pipeline for multiclass classification. My intention is to do this in a one-v-one scenario, so that in each ovo case, the undersmapling is invoked to resample data for the current participating classes in the ovo.

So, for example, with this dummy data:

# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)

xtrain, xtest, ytrain, ytest = train_test_split(X, y, 
                test_size=.2, random_state=12)

Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})

Where I would have 4(3-1)/2=6 models in ovocases. So in each 'ovo' model, majority class undersampling should go like so:

Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126 
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126  -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192  -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192  -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330  -> 2:330, 3:330

With this in mind, I am interested in using SVC, as the estimator to OneVsOneClassifier as follows:

from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier

model = OneVsOneClassifier(
    estimator=SVC(kernel='rbf'), n_jobs=-1)

resampler = NearSVUmdersampler(random_state=123)

And fit this as:

classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
                 <__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
                ('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])

Problem:

It appears the resampler is only invoked once, passing it all train data containing all classes. So it returns only the majority and minority in the original data, resampled to the size of majority. Making it trained only on two classes.

In the above MWE for instance, it returns:

{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126

That is the case of Model 3, and nothing done for all other cases.

How to I make this work in a ovo considering the pipeline I have?

CodePudding user response：

Try this:

model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)

Now when you call classifier.fit the OneVsOneClassifier will fit your base_estimator pipeline for each slice of the data, thus resampling for each pair of columns