I want to generate some data in order to show partial correlation to control for a confounder.
Specifically, I want to generate data about two uncorrelated random variables (let's say speech and memory) and use a third variable to influence them both (age).
I'd expect to observe a strong correlation between speech and memory, due to the confounder age, and no correlation between the same two variables if I control for age (that is, calculate a partial correlation on age).
That said, I can't generate the strong correlation with my code.
age <- rep(1:10, 10)
speech <- age * abs(rnorm(100))
memory <- age * abs(rnorm(100))
cor(speech, memory) # correlation, it should be high but it's not
residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals
cor(residuals_speech, residuals_memory) # partial correlation controlling for age, it should be around zero
CodePudding user response:
The random noise needs to be additive, not multiplicative:
set.seed(1)
age <- rep(1:10, 10)
speech <- age rnorm(100)
memory <- age rnorm(100)
Now we have a high correlation between speech and memory:
cor(speech, memory)
#> [1] 0.9067772
But the correlation of the residuals once age is taken into account is close to zero, as expected.
residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals
cor(residuals_speech, residuals_memory)
#> [1] 0.0001755437
To get speech and memory to be on different scales with different amounts of noise (as real data might be), you can multiply age
and the random noise by scalar values. This will retain the high correlation caused by the confounder.
set.seed(1)
age <- rep(1:10, 10)
speech <- 5 * age 3 * rnorm(100)
memory <- 2.1 * age - 2 * rnorm(100)
cor(speech, memory)
#> [1] 0.9361466
residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals
cor(residuals_speech, residuals_memory)
#> [1] -0.0001755437
Created on 2022-11-21 with reprex v2.0.2