Home > Net >  How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?
How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?

Time:10-26

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

  • My DataFrame is really large and unbalanced.
  • I need to make sampling on my DataFrame because it is really large
  • Balancing the DataFrame looks like this:
    • 99.60% - 0

    • 0.40 % - 1

      ID TARGET
      111 1
      222 1
      333 0
      444 1
      ... ...

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?

CodePudding user response:

Perhaps this is what you need. stratify param makes sure you sample your data in a stratified fashion as you need

from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)

CodePudding user response:

I think the solution is to combine Oversampling and Undersampling.

Random Oversampling: Randomly duplicate examples in the minority class.

Random Undersampling: Randomly delete examples in the majority class.

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)

CodePudding user response:

Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)
  • Related