I have a dataset with one class being very imbalanced (190 records vs 14810) based on the 'relevance' column. So, I tried to upsample it which worked; but the issue is that I have other category of classes in another column (1000 records per each class) and when I simply upsample based on 'relevance' column, these classes become imbalanced. is there a way to upsample 'relevance' keeping this ratio of classes in another column?
# Creating a dataset with a class minority
df_minority = df[df['relevance'] == 1]
# Creating a dataset with the other class
df_rest = df[df['relevance'] != 1]
# Upsample the minority class
df_1_upsampled = resample(df_minority,random_state=SEED,n_samples=14810,replace=True)
# Concatenate the upsampled dataframe
df_upsampled = pd.concat([df_1_upsampled,df_rest])
Sample dataset:
relevance class 2 3 4 5
1 A 40 24 11 50
1 A 60 20 19 60
0 C 15 57 15 60
0 B 12 50 15 43
0 B 90 8 32 80
0 C 74 8 21 34
So, the goal is to make the number of 'relevance' classes equal, keeping the 1:1:1 ratio of the 'class' category.
CodePudding user response:
Here is a way to do it per class. Note that I'm not sure if this will not bias any model after, not enough experience here. First let's create a dummy data that is closer to your real data.
# dummy data
np.random.seed(0)
df = pd.DataFrame({
'relevance':np.random.choice(a=[0]*14810 [1]*190,size=15000, replace=False),
'class':list('ABCDEFGHIKLMNOP')*1000,
2 : np.random.randint(0,100, 15000), 3 : np.random.randint(0,100, 15000),
4 : np.random.randint(0,100, 15000), 5 : np.random.randint(0,100, 15000),
})
Just some check on class-relevance, you will need this info anyway. You have all class with 1000 samples and each class has a different number of relevance=1
ct = pd.crosstab(df['class'], df['relevance'])
print(ct.head())
# relevance 0 1
# class
# A 983 17
# B 982 18
# C 990 10
# D 993 7
# E 993 7
Now you can calculate the number of upsample needed per class. Note that we can define this several ways, and especially change 1000 per any number.
nb_upsample = (1000*ct[0].mean()/ct[0]).astype(int)
print(nb_upsample.head())
# class
# A 1004
# B 1005
# C 997
# D 994
# E 994
# Name: 0, dtype: int32
Now you can upsample per class
df_1_upsampled = (
df_minority.groupby(['class'])
.apply(lambda x: resample(x, random_state=1, replace=True,
n_samples=nb_upsample[x.name]))
.reset_index(drop=True)
)
print(df_1_upsampled['class'].value_counts().head())
# B 1005
# A 1004
# L 1004
# M 1003
# H 1001
# Name: class, dtype: int64
Finally, concat
and check the ratio class and relevance
df_upsampled = pd.concat([df_1_upsampled,df_rest])
print(df_upsampled['class'].value_counts().head()) #same ratio
# A 1987
# B 1987
# C 1987
# D 1987
# E 1987
# Name: class, dtype: int64
print(df_upsampled['relevance'].value_counts()) # almost same relevance number
# 1 14995 #this number is affected by the 1000 in nb_upsample
# 0 14810
# Name: relevance, dtype: int64
You can see there is more relevance=1 now. What you can do is change 1000 at the line defining nb_upsample
by any number you want. You could also use nb_upsample = (ct[0].mean()**2/ct[0]).astype(int)
that would balance a bit more both relevance categories.