Home > Software design >  Dataset with new outliers after removing the outliers
Dataset with new outliers after removing the outliers

Time:02-18

I'm newbie in machine learning and I trying to train a model using "rain austrialia" dataset. Currently I'm at the preprocess step and after using KNNImputer to fill all NaN values I tried to remove the outliers with the following custom transformer class.

class OutliersRemover(BaseEstimator, TransformerMixin):
  def __init__(self, cols_indexes):
    self.cols_indexes = cols_indexes

  def fit(self, X, y=None):
    return self

  def transform(self, X, y=None):
    outliers_indexes = set()
    threshold = 3
    X = X.to_numpy()
    
    for col_index in self.cols_indexes:
      mean = np.mean(X[:, col_index])
      std  = np.std(X[:, col_index])

      for line_index, item in enumerate(X[:, col_index]):
        z_score = (item - mean) / std
        if np.abs(z_score) > threshold:
          outliers_indexes.add(line_index)
    print("Removing: {} outliers".format(len(outliers_indexes)))
    return np.delete(X, list(outliers_indexes), 0)


outliers_remover    = OutliersRemover(np.arange(24))
X_train_transformed = outliers_remover.fit_transform(X_train)

It appears to remove correctly but the thing is if I run the code bellow to check if all outliers were removed it removes another set of outliers. And if I run 10 times the same code it removes diferent sets of outliers until 0.

for _ in range(10):
    X_train_transformed = outliers_remover.fit_transform(X_train_transformed)

Removing: 1389 outliers
Removing: 319 outliers
Removing: 528 outliers
...
Removing: 0 outliers

I would like to know whether this is a normal behavior of the dataset or what I'm I doing wrong.

CodePudding user response:

In every iteration, you remove outliers from X_train_transformed and assign the returned values back to X_train_transformed. Your criteria for removing outliers is such that some values will always be removed (see below).

As for whether it is normal behavior of the dataset, Yes!. Any numerical dataset will have a mean and std, and will most probably have values for which (value - mean) / std will be greater than 3. If you remove such values and calculate a new mean and std, you will now have new values for which (value - mean) / std will be greater than 3 since your mean and std will have changed.

I would recommend only removing outliers once. Maybe play around with the threshold to determine how many you want to remove. Also, consider reading up how normal distributions, their means, and standard deviations work.

  • Related