How to take sample of data from very unbalanced DataFrame so as to not lose too many '1'?-CodePudding

I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).

My DataFrame is really large and unbalanced.
I need to make sampling on my DataFrame because it is really large
Balancing the DataFrame looks like this:
- 99.60% - 0
- 0.40 % - 1
  
  ID TARGET
  
  111 1
  
  222 1
  
  333 0
  
  444 1
  
  ... ...

ID	TARGET
111	1
222	1
333	0
444	1
...	...

How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.

How can I do that in Python ?

CodePudding user response：

Perhaps this is what you need. stratify param makes sure you sample your data in a stratified fashion as you need

from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)

CodePudding user response：

I think the solution is to combine Oversampling and Undersampling.

Random Oversampling: Randomly duplicate examples in the minority class.

Random Undersampling: Randomly delete examples in the majority class.

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)

CodePudding user response：

Assume you want a sample size = 1000

Try to use the following line :

df.sample(frac=1000/len(df), replace=True, random_state=1)