Home > Mobile >  Oversampling method using R
Oversampling method using R

Time:07-11

I'm studying oversampling method using R. Let's say I want to do oversampling from the data df.

df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
                 x1=rnorm(100),
                 x2=rnorm(100))

Obviously, df has 10 No's and 90Yes's. So y is imbalanced. I tried to use ubBalance function to make y balanced, but it seems like that I cannot use it because I use R version 4. Is there a easy way to do oversampling in R version 4.

CodePudding user response:

You could use a Random Walk Overslamping using the rwo function from the imbalance package:

Generates synthetic minority examples for a dataset trying to preserve the variance and mean of the minority class. Works on every type of dataset.

Here a reproducible example:

df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
                 x1=rnorm(100),
                 x2=rnorm(100))

library(imbalance)
colnames(df) <- c("Class", "x1", "x2")
new_df <-rwo(df, numInstances = 50)
new_df
#>    Class          x1          x2
#> 1     No -1.16439984  0.21856395
#> 2     No  1.20744623  0.28858048
#> 3     No  1.56528275 -0.07579441
#> 4     No -1.03733411  0.01835535
#> 5     No -0.70526984 -2.01477788
#> 6     No -0.80978490  0.64829995
#> 7     No  0.32493643 -0.05699719
#> 8     No -0.98764951 -1.72838623
#> 9     No -0.42004551  0.79171386
#> 10    No -2.02128473  0.41171867
#> 11    No -0.84667118 -1.31055008
#> 12    No -0.41447116  0.73619119
#> 13    No -0.59519331 -2.12420980
#> 14    No -1.87381529  0.36029347
#> 15    No -1.71772198 -0.67236749
#> 16    No -1.91984498  0.30281031
#> 17    No -0.30854811  1.07314736
#> 18    No -2.09342702 -0.33375116
#> 19    No -0.57984243  0.94788328
#> 20    No -1.04299574  0.97960623
#> 21    No -0.48914322  1.09651605
#> 22    No  1.95909036  0.62301445
#> 23    No  0.32071004 -2.08889830
#> 24    No -0.98998047  0.45250458
#> 25    No  0.78258023 -0.57429362
#> 26    No  0.04426842 -1.48160646
#> 27    No -1.61386524 -0.07911380
#> 28    No -0.54491597  0.24783255
#> 29    No -1.55084192  0.44819029
#> 30    No  0.40391743 -2.00554911
#> 31    No -0.57996600 -1.70075786
#> 32    No  0.34502429 -0.11452995
#> 33    No -1.42240697 -0.15749236
#> 34    No  0.56406328 -1.96536380
#> 35    No -0.99870646  0.16643333
#> 36    No  0.29262027 -1.86874500
#> 37    No  1.44551833  0.35333586
#> 38    No  1.69167557  0.16451481
#> 39    No -0.63712453 -2.37375325
#> 40    No -1.13339974  0.25853248
#> 41    No  1.60384482  0.21507984
#> 42    No -0.76946285  0.27068821
#> 43    No  0.58484861 -2.48727381
#> 44    No -1.33939478 -0.11824381
#> 45    No -1.01812834 -1.85177192
#> 46    No  0.57773883 -0.29486029
#> 47    No -1.11804972 -1.39796677
#> 48    No -1.79134432 -0.07027661
#> 49    No -0.56362892 -1.66805640
#> 50    No -1.61152940  0.06337827
plotComparison(df, rbind(df, new_df), attrs = names(new_df)[1:3])

Created on 2022-07-10 by the reprex package (v2.0.1)

  • Related