Home > database >  Generating Random Numbers Based on Some Condition
Generating Random Numbers Based on Some Condition

Time:03-22

I am working with the R programming language.

I would like to generate random numbers : a1, a2, a3, b1, b2, b3

I would like there to be a condition such that:

  • a3 > a2 > a1
  • b3 > b2 > b1

I do not know how to do this directly, so I tried to generate a large data frame of numbers and only keep rows that match this condition:

a1 = rnorm(100000,10,10)
a2 = rnorm(100000,10,10)
a3 = rnorm(100000,10,10)
b1 = rnorm(100000,10,10)
b2 = rnorm(100000,10,10)
b3 = rnorm(100000,10,10)

my_data = data.frame(a1, a2, a3, b1, b2,b3)

This data frame looks like this:

head(my_data)
          a1         a2        a3         b1        b2         b3
1  5.6713342 -4.5930442  6.063861 28.9258586 -1.073999 23.7398862
2 17.5791993  5.1482061  6.683438  9.2969640  6.438304 10.2569026
3 13.9389949  8.9943351  1.089840 12.9340164 22.099974 -0.6791567
4 16.0257008 10.4139726 18.469092 10.9470812 20.105047  0.4710750
5 -0.1370202  0.9112077  4.349729 11.9442915 22.318155  8.7671923
6 18.8508432 -3.6210024  3.022941  0.6319464 14.406452 25.2002712

I then tried to make an "indicator" variable that indicates whether a row should be deleted or kept based on whether or not it matches the conditions:

my_data$indicator_a2_a1 = ifelse(my_data$a2 > my_data$a1, "TRUE", "FALSE")
my_data$indicator_a3_a2 = ifelse(my_data$a3 > my_data$a2, "TRUE", "FALSE")
my_data$indicator_a3_a1 = ifelse(my_data$a3 > my_data$a1, "TRUE", "FALSE")

my_data$indicator_b2_b1 = ifelse(my_data$b2 > my_data$b1, "TRUE", "FALSE")
my_data$indicator_b3_b2 = ifelse(my_data$b3 > my_data$b2, "TRUE", "FALSE")
my_data$indicator_b3_b1 = ifelse(my_data$b3 > my_data$b1, "TRUE", "FALSE")

With these indicators, the data now looks like this:

          a1         a2        a3         b1        b2         b3 indicator_a2_a1 indicator_a3_a2 indicator_a3_a1 indicator_b2_b1 indicator_b3_b2 indicator_b3_b1
1  5.6713342 -4.5930442  6.063861 28.9258586 -1.073999 23.7398862           FALSE            TRUE            TRUE           FALSE            TRUE           FALSE
2 17.5791993  5.1482061  6.683438  9.2969640  6.438304 10.2569026           FALSE            TRUE           FALSE           FALSE            TRUE            TRUE
3 13.9389949  8.9943351  1.089840 12.9340164 22.099974 -0.6791567           FALSE           FALSE           FALSE            TRUE           FALSE           FALSE
4 16.0257008 10.4139726 18.469092 10.9470812 20.105047  0.4710750           FALSE            TRUE            TRUE            TRUE           FALSE           FALSE
5 -0.1370202  0.9112077  4.349729 11.9442915 22.318155  8.7671923            TRUE            TRUE            TRUE            TRUE           FALSE           FALSE
6 18.8508432 -3.6210024  3.022941  0.6319464 14.406452 25.2002712           FALSE            TRUE           FALSE            TRUE            TRUE            TRUE

Finally, I isolated rows in which all indicators were TRUE:

final_file <- my_data[which(my_data$indicator_a2_a1 == "TRUE" & my_data$indicator_a3_a2 == "TRUE" & my_data$indicator_a3_a1 == "TRUE" & my_data$indicator_b2_b1 == "TRUE" & my_data$indicator_b3_b2 == "TRUE" &  my_data$indicator_b3_b1 == "TRUE"), ]

 dim(final_file)
[1] 2754   12

This was successfully accomplished the task - but I was wondering if there is a more "efficient" way to perform this task. For example, I randomly generated 100000 rows, but only 2754 of these rows (~ 2%) met the condition I had wanted. The other problem is that I had to manually create 6 indicator variables to make sure all conditions were respected - had there been more conditions, I would have been required to manually create a large number of indicator variables to ensure that all the conditions were respected.

My Question: Is there a way to randomly generate data according to some conditions such that ALL rows produced would meet these conditions? Could this be done with a WHILE LOOP?

CodePudding user response:

Does simply generating a list of random numbers for a and b and then sorting it using the sort() function work for your use case? The following code matches your specified conditions

a = rnorm(3,10,10)
b = rnorm(3,10,10)

a.ordered = sort(a)
b.ordered = sort(b)

df = data.frame(numbers = c(a.ordered,b.ordered),
                row.names = c("a1","a2","a3","b1","b2","b3"))

df

CodePudding user response:

A "direct" method could be creating your variables sequentially using tibble:

fun <- function(n) {
    tibble(a3 = rnorm(n),
           a2 = a3 - abs(rnorm(n)),
           a1 = a2 - abs(rnorm(n)),
           b3 = rnorm(n),
           b2 = b3 - abs(rnorm(n)),
           b1 = b2 - abs(rnorm(n))) 
}

fun(10)

       a3     a2      a1       b3      b2     b1
    <dbl>  <dbl>   <dbl>    <dbl>   <dbl>  <dbl>
 1 -0.211 -0.901 -2.09   -0.988   -1.61   -2.40 
 2 -0.543 -2.04  -2.18   -0.0840  -1.06   -2.41 
 3 -0.190 -1.22  -1.41   -0.00393 -1.46   -1.73 
 4  2.11   1.36   1.20   -1.06    -2.21   -3.39 
 5  0.653 -0.156 -0.313   1.41     0.301  -0.539
 6 -1.16  -1.46  -2.71    0.387   -1.40   -4.00 
 7  1.56   0.865  0.676   1.18     0.863  -0.296
 8  1.01   0.544  0.0511  0.318    0.0864 -1.76 
 9  0.636  0.165 -1.83    0.929    0.905   0.210
10  0.633 -0.269 -1.01    0.466   -0.0685 -0.445
  • Related