Home > Software design >  How to simulate a strong correlation of data with R
How to simulate a strong correlation of data with R

Time:07-08

Sometimes I try to simulate data by using the rnorm function, which I have done below:

mom.iq <- rnorm(n=1000,
                mean=120,
                sd=15)
kid.score <- rnorm(n=1000,
                   mean=45,
                   sd=20)
df <- data.frame(mom.iq,
                 kid.score)

But when I plot something like this, it usually ends up with data thats highly uncorrelated:

library(ggpubr)

ggscatter(df,
          x="mom.iq",
          y="kid.score") 
  geom_smooth(method = "lm")

lm.pic

However, I would like to simulate something with a stronger correlation if possible. Is there an easy way to do this within R? I'm aware that I could just as easily just produce my own values manually, but thats not super practical for recreating large samples.

CodePudding user response:

What you are doing is to generate two independent variables; so, it is normal not to be correlated. What you can do is this:

# In order to make the values reproducible
  set.seed(12345)

# Generate independent variable
  x <- rnorm(n=1000, mean=120, sd=15)
# Generate the dependen variable
  y <- 3*x   6   rnorm(n=1000, mean = 0, sd = 5)

I used 3 and 6, but you can define them as you want (a and b) in order to get a linear dependence defined as y = a*x b.

The sum of rnorm(n=1000, mean = 0, sd = 5) is done to add some variability and avoid a perfect correlation between x and y. If you want to get a more correlated data, reduce the standard deviation (sd) and to get a lower correlation, increase its value.

CodePudding user response:

You can create your second variable by taking the first variable into account, and adding some error with rnorm in order to avoid making the relationship completely deterministic/

library(ggplot2)

dat <- data.frame(father_age = rnorm(1000, 35, 5)) |> 
  dplyr::mutate(child_score = -father_age * 0.5   rnorm(1000, 0, 4))

dat |> 
  ggplot(aes(father_age, child_score))  
  geom_point()  
  geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'

Created on 2022-07-07 by the xyplot

  • Related