Sometimes I try to simulate data by using the rnorm
function, which I have done below:
mom.iq <- rnorm(n=1000,
mean=120,
sd=15)
kid.score <- rnorm(n=1000,
mean=45,
sd=20)
df <- data.frame(mom.iq,
kid.score)
But when I plot something like this, it usually ends up with data thats highly uncorrelated:
library(ggpubr)
ggscatter(df,
x="mom.iq",
y="kid.score")
geom_smooth(method = "lm")
However, I would like to simulate something with a stronger correlation if possible. Is there an easy way to do this within R? I'm aware that I could just as easily just produce my own values manually, but thats not super practical for recreating large samples.
CodePudding user response:
What you are doing is to generate two independent variables; so, it is normal not to be correlated. What you can do is this:
# In order to make the values reproducible
set.seed(12345)
# Generate independent variable
x <- rnorm(n=1000, mean=120, sd=15)
# Generate the dependen variable
y <- 3*x 6 rnorm(n=1000, mean = 0, sd = 5)
I used 3 and 6, but you can define them as you want (a
and b
) in order to get a linear dependence defined as y = a*x b
.
The sum of rnorm(n=1000, mean = 0, sd = 5)
is done to add some variability and avoid a perfect correlation between x
and y
. If you want to get a more correlated data, reduce the standard deviation (sd
) and to get a lower correlation, increase its value.
CodePudding user response:
You can create your second variable by taking the first variable into account, and adding some error with rnorm
in order to avoid making the relationship completely deterministic/
library(ggplot2)
dat <- data.frame(father_age = rnorm(1000, 35, 5)) |>
dplyr::mutate(child_score = -father_age * 0.5 rnorm(1000, 0, 4))
dat |>
ggplot(aes(father_age, child_score))
geom_point()
geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'