Dataset upsampling using pandas and sklearn

I have a dataset with one class being very imbalanced (190 records vs 14810) based on the 'relevance' column. So, I tried to upsample it which worked; but the issue is that I have other category of classes in another column (1000 records per each class) and when I simply upsample based on 'relevance' column, these classes become imbalanced. is there a way to upsample 'relevance' keeping this ratio of classes in another column?

# Creating a dataset with a class minority

df_minority = df[df['relevance'] == 1]

# Creating a dataset with the other class

df_rest = df[df['relevance'] != 1] 

# Upsample the minority class

df_1_upsampled = resample(df_minority,random_state=SEED,n_samples=14810,replace=True)

# Concatenate the upsampled dataframe

df_upsampled = pd.concat([df_1_upsampled,df_rest])

Sample dataset:

  relevance   class   2   3   4   5  
          1       A  40  24  11  50
          1       A  60  20  19  60
          0       C  15  57  15  60
          0       B  12  50  15  43 
          0       B  90   8  32  80
          0       C  74   8  21  34

So, the goal is to make the number of 'relevance' classes equal, keeping the 1:1:1 ratio of the 'class' category.

CodePudding user response：

Here is a way to do it per class. Note that I'm not sure if this will not bias any model after, not enough experience here. First let's create a dummy data that is closer to your real data.

# dummy data
np.random.seed(0)
df = pd.DataFrame({
         'relevance':np.random.choice(a=[0]*14810 [1]*190,size=15000, replace=False), 
         'class':list('ABCDEFGHIKLMNOP')*1000,
         2 : np.random.randint(0,100, 15000), 3 : np.random.randint(0,100, 15000),
         4 : np.random.randint(0,100, 15000), 5 : np.random.randint(0,100, 15000),
})

Just some check on class-relevance, you will need this info anyway. You have all class with 1000 samples and each class has a different number of relevance=1

ct = pd.crosstab(df['class'], df['relevance'])
print(ct.head())
# relevance    0   1
# class             
# A          983  17
# B          982  18
# C          990  10
# D          993   7
# E          993   7

Now you can calculate the number of upsample needed per class. Note that we can define this several ways, and especially change 1000 per any number.

nb_upsample = (1000*ct[0].mean()/ct[0]).astype(int)
print(nb_upsample.head())
# class
# A    1004
# B    1005
# C     997
# D     994
# E     994
# Name: 0, dtype: int32

Now you can upsample per class

df_1_upsampled = (
    df_minority.groupby(['class'])
      .apply(lambda x: resample(x, random_state=1, replace=True,
                                n_samples=nb_upsample[x.name]))
      .reset_index(drop=True)
)
print(df_1_upsampled['class'].value_counts().head())
# B    1005
# A    1004
# L    1004
# M    1003
# H    1001
# Name: class, dtype: int64

Finally, concat and check the ratio class and relevance

df_upsampled = pd.concat([df_1_upsampled,df_rest])
print(df_upsampled['class'].value_counts().head()) #same ratio
# A    1987
# B    1987
# C    1987
# D    1987
# E    1987
# Name: class, dtype: int64
print(df_upsampled['relevance'].value_counts()) # almost same relevance number
# 1    14995 #this number is affected by the 1000 in nb_upsample
# 0    14810
# Name: relevance, dtype: int64

You can see there is more relevance=1 now. What you can do is change 1000 at the line defining nb_upsample by any number you want. You could also use nb_upsample = (ct[0].mean()**2/ct[0]).astype(int) that would balance a bit more both relevance categories.