I'm studying oversampling method using R. Let's say I want to do oversampling from the data df
.
df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
x1=rnorm(100),
x2=rnorm(100))
Obviously, df
has 10 No
's and 90Yes
's.
So y
is imbalanced. I tried to use ubBalance
function to make y
balanced, but it seems like that I cannot use it because I use R version 4.
Is there a easy way to do oversampling in R version 4.
CodePudding user response:
You could use a Random Walk Overslamping using the rwo
function from the imbalance
package:
Generates synthetic minority examples for a dataset trying to preserve the variance and mean of the minority class. Works on every type of dataset.
Here a reproducible example:
df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
x1=rnorm(100),
x2=rnorm(100))
library(imbalance)
colnames(df) <- c("Class", "x1", "x2")
new_df <-rwo(df, numInstances = 50)
new_df
#> Class x1 x2
#> 1 No -1.16439984 0.21856395
#> 2 No 1.20744623 0.28858048
#> 3 No 1.56528275 -0.07579441
#> 4 No -1.03733411 0.01835535
#> 5 No -0.70526984 -2.01477788
#> 6 No -0.80978490 0.64829995
#> 7 No 0.32493643 -0.05699719
#> 8 No -0.98764951 -1.72838623
#> 9 No -0.42004551 0.79171386
#> 10 No -2.02128473 0.41171867
#> 11 No -0.84667118 -1.31055008
#> 12 No -0.41447116 0.73619119
#> 13 No -0.59519331 -2.12420980
#> 14 No -1.87381529 0.36029347
#> 15 No -1.71772198 -0.67236749
#> 16 No -1.91984498 0.30281031
#> 17 No -0.30854811 1.07314736
#> 18 No -2.09342702 -0.33375116
#> 19 No -0.57984243 0.94788328
#> 20 No -1.04299574 0.97960623
#> 21 No -0.48914322 1.09651605
#> 22 No 1.95909036 0.62301445
#> 23 No 0.32071004 -2.08889830
#> 24 No -0.98998047 0.45250458
#> 25 No 0.78258023 -0.57429362
#> 26 No 0.04426842 -1.48160646
#> 27 No -1.61386524 -0.07911380
#> 28 No -0.54491597 0.24783255
#> 29 No -1.55084192 0.44819029
#> 30 No 0.40391743 -2.00554911
#> 31 No -0.57996600 -1.70075786
#> 32 No 0.34502429 -0.11452995
#> 33 No -1.42240697 -0.15749236
#> 34 No 0.56406328 -1.96536380
#> 35 No -0.99870646 0.16643333
#> 36 No 0.29262027 -1.86874500
#> 37 No 1.44551833 0.35333586
#> 38 No 1.69167557 0.16451481
#> 39 No -0.63712453 -2.37375325
#> 40 No -1.13339974 0.25853248
#> 41 No 1.60384482 0.21507984
#> 42 No -0.76946285 0.27068821
#> 43 No 0.58484861 -2.48727381
#> 44 No -1.33939478 -0.11824381
#> 45 No -1.01812834 -1.85177192
#> 46 No 0.57773883 -0.29486029
#> 47 No -1.11804972 -1.39796677
#> 48 No -1.79134432 -0.07027661
#> 49 No -0.56362892 -1.66805640
#> 50 No -1.61152940 0.06337827
plotComparison(df, rbind(df, new_df), attrs = names(new_df)[1:3])
Created on 2022-07-10 by the reprex package (v2.0.1)