I know that we need to have balanced data in y
to have a better model. However, I'm wondering whether we need to have balanced data in independent variable as well.
In the following dataframe, X3
is a category type independent variable.
X1 X2 X3 y
22 67 1 0
33 87 1 0
55 66 1 0
77 12 1 0
28 68 1 1
12 64 2 0
19 17 2 1
10 62 2 1
88 19 2 1
99 20 2 1
While the data in y
is balanced (1:1 distribution), X3
has imbalanced data in each category (4:1 distribution).
Do I need to have equal distribution in X3 as well?
CodePudding user response:
it does nos matter, which is really important is the label.
During your modelisation, your model here an decision tree is going to search for informations in X (so your features). What you are looking for is, does this feature bring informations, if not drop it, if yes keep it.
The imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, so we don't care about the repartition of the features