I have a Pandas DataFrame like below with ID and Target variable (for machine learning model).
- My DataFrame is really large and unbalanced.
- I need to make sampling on my DataFrame because it is really large
- Balancing the DataFrame looks like this:
99.60% - 0
0.40 % - 1
ID TARGET 111 1 222 1 333 0 444 1 ... ...
How to sample the data, so as not to lose too many ones (target = 1), which are very rare anyway? In the next step, of course, I will add the remaining variables and perform over sampling, nevertheless at the beginning i need to take sample of data.
How can I do that in Python ?
CodePudding user response:
Perhaps this is what you need. stratify
param makes sure you sample your data in a stratified fashion as you need
from sklearn.model_selection import train_test_split
import numpy as np
X = np.random.rand(30000, 2)
y = np.random.randint(2, size =30000)
skf = train_test_split(X, y, train_size=100, test_size=100, stratify=y, shuffle=True)
CodePudding user response:
I think the solution is to combine Oversampling and Undersampling.
Random Oversampling: Randomly duplicate examples in the minority class.
Random Undersampling: Randomly delete examples in the majority class.
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
over = RandomOverSampler(sampling_strategy=0.1)
X, y = over.fit_resample(X, y)
under = RandomUnderSampler(sampling_strategy=0.5)
X, y = under.fit_resample(X, y)
CodePudding user response:
Assume you want a sample size = 1000
Try to use the following line :
df.sample(frac=1000/len(df), replace=True, random_state=1)