Home > Software engineering >  Creating data with pre-determined correlations in R
Creating data with pre-determined correlations in R

Time:09-19

I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.

In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?

library(tidyverse)
library(faux)

cmat <- c(1,   .195,  .346,  .674,  .561,  
         .195,  1,    .479,  .721,  .631,  
         .346, .479,   1,    .154,  .121, 
         .674, .721,  .154,   1,    .241, 
         .561, .631,  .121,  .241,   1)

nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat, 
                   varnames = c("NPS",
                                "change in NPS",
                                "sales (t0)",
                                "sales (t1)",
                                "sales (t2)")), 0) %>%
    tibble()

CodePudding user response:

You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.

Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it's easy to specify means, SDs, and correlations, but you can't control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).

  • Related