Home > database >  How to generate random data based on some criteria in R
How to generate random data based on some criteria in R

Time:12-31

I wan to generate 300 random data based on the following criteria:

Class   value
0   1-8
1   9-11
2   12-14
3   15-16
4   17-20

Logic: when class = 0, I want to get random data between 1-8. Or when class= 1, I want to get random data between 9-11 and so on.

This gives me the following hypothetical table as an example:

   

 Class  Value
    0   7
    0   4
    1   10
    1   9
    1   11
    .   .
    .   .

I want to have equal and unequal mixtures in each class

CodePudding user response:

You could do:

df <- data.frame(Class = sample(0:4, 300, TRUE))

df$Value <- sapply(list(1:8, 9:11, 12:14, 15:16, 17:20)[df$Class   1],
                   sample, size = 1)

This gives you a data frame with 300 rows and appropriate numbers for each class:

head(df)
#>   Class Value
#> 1     0     3
#> 2     1    10
#> 3     4    19
#> 4     2    12
#> 5     4    19
#> 6     1    10

Created on 2022-12-30 with reprex v2.0.2

CodePudding user response:

Providing some additional flexibility in the code, so that different probabilities can be used in the sampling, and having the smallest possible amount of hard-coded values:

# load data.table
library(data.table)

# this is the original data
a = structure(list(Class = 0:4, value = c("1-8", "9-11", "12-14", 
"15-16", "17-20")), row.names = c(NA, -5L), class = c("data.table", 
"data.frame"))

# this is to replace "-" by ":", we will use that in a second
a[, value := gsub("\\-", ":", value)]

# this is a vector of EQUAL probabilities
probs = rep(1/a[, uniqueN(Class)], a[, uniqueN(Class)])

# This is a vector of UNEQUAL Probabilities. If wanted, it should be 
# uncommented and adjusted manually
# probs = c(0.05, 0.1, 0.2, 0.4, 0.25)

# This is the number of Class samples wanted
numberOfSamples = 300

# This is the working horse
a[sample(Class, numberOfSamples, TRUE, prob = probs), ][, 
           smpl := apply(.SD, 
                         1, 
                         function(x) sample(eval(parse(text = x)), 1)), 
           .SDcols = "value"][, 
                  .(Class, smpl)]

What is good about this code?

  • If you change your classes, or the value ranges, the only change you need to be concerned about is the original data frame (a, as I called it)
  • If you want to use uneven probabilities for your sampling, you can set them and the code still runs.
  • If you want to take a smaller or larger sample, you don't have to edit your code, you only change the value of a variable.
  •  Tags:  
  • r
  • Related