I am working to implement my custom undersampler that works based on SVM
. This class takes in binary class data and undersample the majority class to the size of minority, by selecting majority examples near the class's support vector, up to the size of minority examples.
Here's the code:
import numpy as np
from collections import Counter
from sklearn.svm import SVC
class NearSVUmdersampler():
def __init__(self, random_state=None):
self.random_state = random_state
def fit_resample(self, X, y):
random_state = check_random_state(self.random_state)
# class distribution
counter = Counter(y)
maj_class = counter.most_common()[0][0]
min_class = counter.most_common()[-1][0]
# number of minority examples
num_minority = len(X[ y == min_class])
svc = SVC(kernel='rbf', random_state=32)
svc.fit(X,y)
# majority class support vectors
maj_sup_vector = svc.support_vectors_[maj_class]
# compute distances to support vector points
distances = []
for i, x in enumerate(X[y == maj_class]):
d = np.linalg.norm(maj_sup_vector - x)
distances.append((i, d))
# sort distances (ascending)
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:num_minority]
X_ds = np.concatenate((X[y == maj_class][index], X[y == min_class]))
y_ds = np.concatenate((y[y == maj_class][index], y[y == min_class]))
return X_ds, y_ds
The resampled data returned by this class is balanced with majority class down to equal the minority.
So I wanted to use this class in a pipeline for multiclass
classification. My intention is to do this in a one-v-one scenario, so that in each ovo
case, the undersmapling is invoked to resample data for the current participating classes in the ovo
.
So, for example, with this dummy data:
# sample data
X, y = make_classification(n_samples=2000, n_features=2, n_redundant=0,
n_clusters_per_class=1, n_classes=4, weights=[0.08, 0.12, 0.2], flip_y=0, random_state=162)
xtrain, xtest, ytrain, ytest = train_test_split(X, y,
test_size=.2, random_state=12)
Counter(ytrain)
Counter({0: 126, 1: 192, 2: 330, 3: 952})
Where I would have 4(3-1)/2=6
models in ovo
cases. So in each 'ovo' model, majority class undersampling should go like so:
Model 1 = Class 0 Vs Class 1 # maj:1=192; undersampled to 126, -> 0:126, 1:126
Model 2 = Class 0 Vs Class 2 # maj:2=330; undersampled to 126 -> 0:126, 2:126
Model 3 = Class 0 Vs Class 3 # maj:3=952; undersampled to 126, -> 0:126, 3:126
Model 4 = Class 1 Vs Class 2 # maj:2=330; undersampled to 192 -> 1:192, 2:192
Model 5 = Class 1 Vs Class 3 # maj:3=952; undersampled to 192 -> 1:192, 3:192
Model 6 = Class 2 Vs Class 3 # maj:3=952; undersampled to 330 -> 2:330, 3:330
With this in mind, I am interested in using SVC
, as the estimator to OneVsOneClassifier
as follows:
from imblearn.pipeline import Pipeline
from sklearn.multiclass import OneVsOneClassifier
model = OneVsOneClassifier(
estimator=SVC(kernel='rbf'), n_jobs=-1)
resampler = NearSVUmdersampler(random_state=123)
And fit this as:
classifier = Pipeline([('sampler', resampler), ('clf', model) ])
classifier.fit(xtrain, ytrain)
Pipeline(steps=[('sampler',
<__main__.NearSVUmdersampler object at 0x7f4386fa30d0>),
('clf', OneVsOneClassifier(estimator=SVC(), n_jobs=-1))])
Problem:
It appears the resampler is only invoked once, passing it all train data containing all classes. So it returns only the majority and minority in the original data, resampled to the size of majority. Making it trained only on two classes.
In the above MWE for instance, it returns:
{0: 126, 3: 126} # the majarity: 3=952; undersampled to minority: 0=126
That is the case of Model 3
, and nothing done for all other cases.
How to I make this work in a ovo
considering the pipeline I have?
CodePudding user response:
Try this:
model = SVC(kernel='rbf')
resampler = NearSVUmdersampler(random_state=123)
base_estimator = Pipeline([('sampler', resampler), ('clf', model)])
classifier = OneVsOneClassifier(estimator=base_estimator)
Now when you call classifier.fit
the OneVsOneClassifier
will fit your base_estimator
pipeline for each slice of the data, thus resampling for each pair of columns