When does R evaluate the formula
argument passed into lm()
. For example:
tmp <- rnorm(1000)
y <- rnorm(1000) tmp
df = data.frame(x=tmp)
lm(y~x, data=df)$coefficients
returns
(Intercept) x
0.01713098 0.98073687
This suggests that y~x
is evaluated after control is passed to lm()
because there's no x
variable in the calling context. However, it also suggests it is evaluated before control is passed to lm()
because there's no y
column in df
.
In most languages the arguments are fully evaluated before being passed down the stack. I'm probably missing something subtle here. Any help would be appreciated.
CodePudding user response:
The formula indicator ~
is really just a function that captures unevaluated symbols. It's a lot like quote()
where you can run quote(f(x) 1)
and get back the unevaluated R expression for the syntax you pass in. The main difference is that ~
allows you to have two sets of unevaluated expressions, and that the formula will keep track of the environment where it was created. So these are the same
a ~ b
`~`(a, b)
So you can pass any valid R syntax into a formula, it doesn't have to be a variable name. The values are stored as un-evaluated symbols or calls. So the formula call itself is evaluated, but its parameters are not.
Other functions can use these formulas however they like. The lm()
function for example will pass the formula to model.matrix
. Then model.matrix
will attempt to evaluate the variables found in the formula in the data.frame that was passed to the function, and, if not present there, the environment where the formula was created. Different functions may choose to evaluate formulas in different ways so it's really not possible to say with certainty when the evaluation will happen.
It usually happens something like this
myformula <- a~b
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula)) # a
eval(myformula[[3]], dd, environment(myformula)) # b
# [1] 5 6 7
The symbols are looked up in the environment/list passed to the evaluator.
The environment
part is also relevant because you can also have situations like this
b <- 10
foo <- function(myformula = a~b) {
b <- 4
dd <- data.frame(a=1:3)
eval(myformula[[2]], dd, environment(myformula))
eval(myformula[[3]], dd, environment(myformula))
}
foo()
# [1] 5 6 7
foo(a~b)
# [1] 11 12 13
If you call foo without a parameter, the formula is created by default inside the function body so it will look for the value b
there. But if you call foo with a formula, that formula is evaulated in the calling environment so R will look for the value of b
outside the function for most modeling functions. Again, this is just a convention and non-base R package may choose to do something else.