How to generate random data based on some criteria in R-CodePudding

I wan to generate 300 random data based on the following criteria:

Class   value
0   1-8
1   9-11
2   12-14
3   15-16
4   17-20

Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.

This gives me the following hypothetical table as an example:

   

 Class  Value
    0   7
    0   4
    1   10
    1   9
    1   11
    .   .
    .   .

I want to have equal and unequal mixtures in each class

CodePudding user response：

You could do:

df <- data.frame(Class = sample(0:4, 300, TRUE))

df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class   1],
                   sample, size = 1)

This gives you a data frame with 300 rows and appropriate numbers for each class:

head(df)
#>   Class Value
#> 1     0     3
#> 2     1    10
#> 3     4    19
#> 4     2    12
#> 5     4    19
#> 6     1    10

^{Created on 2022-12-30 with reprex v2.0.2}

CodePudding user response：

Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:

# load data.table
library(data.table)

# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14", 
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]

# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])

# This is a vector of UNEQUAL Probabilities. If wanted, it should be 
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)

# This is the number of Class samples wanted
numberOfSamples = 300

# This is the working horse
a[sample(Class, numberOfSamples, TRUE, prob = probs), ][, 
           smpl := apply(.SD, 
                         1, 
                         function(x) sample(eval(parse(text = x)), 1)), 
           .SDcols = "value"][, 
                  .(Class, smpl)]

What is good about this code?

If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.