R: Simulating Discrete Correlated Data-CodePudding

I am working with the R programming language.

I am trying to simulate random discrete data that contains "correlations" between the variables. For example, this is what I have tried so far (I generated random continuous data with correlations, and converted all values below a certain threshold to 0 else 1):

library(mvtnorm)

    
   n <- 11
A <- matrix(runif(n^2)*2-1, ncol=n) 
s <- t(A) %*% A
            

my_data = MASS::mvrnorm(100, mu = c(rnorm(11,10,1)), Sigma = s)
my_data = data.frame(my_data)

colnames(my_data)[1] <- 'p1'
colnames(my_data)[2] <- 'p2'
colnames(my_data)[3] <- 'p3'
colnames(my_data)[4] <- 'p4'
colnames(my_data)[5] <- 'p5'
colnames(my_data)[6] <- 'p6'
colnames(my_data)[7] <- 'p7'
colnames(my_data)[8] <- 'p8'
colnames(my_data)[9] <- 'p9'
colnames(my_data)[10] <- 'p10'
colnames(my_data)[11] <- 'result'

my_data[my_data < 9] <- 0
my_data[my_data > 9] <- 1 

  p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 result
1  1  1  1  0  1  1  1  0  0   0      0
2  0  1  1  1  1  0  1  1  1   1      1
3  1  1  1  0  1  0  1  1  1   1      1
4  1  1  1  0  1  1  1  1  1   1      1
5  0  1  1  1  1  0  1  1  0   0      0
6  1  0  1  0  1  1  1  0  1   1      1

I am not sure if I have done this correctly - sure, I have simulated random discrete data, but I am not sure if I have preserved the correlation structure within the data. For instance, I would have liked there to be correlation patterns such as:

When p1 = p5 = p9 = 1 -> "results" are more likely to be 1 (i.e. take all rows where p1 = p5 = p9 = 1 and measure the percentage of 1's in the results column)
When p3 = p4 = 0 and p9 = 1 -> "results" are more likely to be 0
etc.

Is there some other way to do this?

Thanks!

CodePudding user response：

I think you may be able to get a better answer over at

# final conversion to binomial
my_data <- as.data.frame(apply(my_data, 2, function(x){as.numeric(x>0.5)}))

CodePudding user response：

If you are happy with p1 through p10 and just want to use your stated rules to generate the result column, then you do a kind of reverse logistic regression. First of all, set up your rules to give you numerical results. Here, we get a 1 if p1 = p5 = p9 = 1, and we get -1 if p3 = 0, p4 = 0, p9 = 1:

log_odds <- with(my_data, p1 * p5 * p9)
log_odds <- with(my_data, result - (1 - p3) * (1 - p4) * p9)

Now we convert these to probabilities of getting a 1 in our results column:

odds <- exp(log_odds)
probs <- odds / (1   odds)

Finally, we use probs to generate a binomial sample:

my_data$result <- rbinom(nrow(my_data), size = 1, prob = probs)

We can see that overall our sample has about a 50% chance of having a 1 or 0:

table(my_data$result)
#>  0  1 
#> 47 53

But the odds of a 1 are much increased when p1 = p5 = p9 = 1

table(my_data$result[with(my_data, p1 == 1 & p5 == 1 & p9 == 1)])
#>  0  1 
#>  3 18

It is possible to control the background probability and strength of correlations by adjusting the weightings for log_odds