I´m trying to create a function to (visually) compare the distribution of a variable, with that of the same variable after a Box-Cox transformation. The variable is a single column pulled out of my entire data frame.
library(EnvStats)
bc_compare_1 <- function(var){
bc_var <- boxcox(lm(var ~ 1))
lambda <- bc_var$x[which.max(bc_var$y)]
var_T <- (var^lambda - 1)/lambda
g <- ggarrange(
ggdensity(var, fill = "grey", alpha = 0.3)
geom_histogram(colour = 1, fill = "white",
position = "identity", alpha = 0)
ggtitle("original")
theme(plot.title = element_text(size = 11)),
ggdensity(var_T, fill = "grey", alpha = 0.3)
geom_histogram(colour = 1, fill = "white",
position = "identity", alpha = 0)
ggtitle("transformed")
theme(plot.title = element_text(size = 11)))
g <- annotate_figure(g, top = text_grob(substring(deparse(substitute(var)),3), size = 11))
l <- list(g, paste("lambda = ", lambda))
return(l)
}
This unfortunately doesn´t work:
Error in model.frame.default(formula = var ~ 1, drop.unused.levels = TRUE) :
object is not a matrix
I tried several things, but nothing works, and it seems that the problem is somehow with boxcox() not being able to deal with a linear model which was created within the function, cause I get the same error even in the simple example:
library(EnvStats)
testt <- function(var){
boxcox(lm(var ~ 1))
}
edit: trying to include the data argument in the lm() function also didn´t seem to work:
testt <- function(data, var){
data %>%
pull(var) -> dvar
lmvar <- lm(data = data, formula = dvar ~ 1)
boxcox(lmvar)
}
-> also no good:
Error in model.frame.default(formula = (data %>% pull(var)) ~ 1, data = data, :
'data' must be a data.frame, environment, or list
(the data is a dataframe)
Any ideas?
Thanks a lot in advance!
Guy
CodePudding user response:
Here is a solution.
The following function uses reformulate
to put the formula's pieces together and corrects the lm
call, apparently, boxcox
needs the model.matrix x
and the response y
and these values default to FALSE
, see the documentation for lm
, section Arguments.
The column name var
can be quoted or not.
suppressPackageStartupMessages(
library(EnvStats)
)
testt <- function(data, var){
fmla <- reformulate("1", var)
lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
boxcox(lmvar)
}
testt(iris, "Sepal.Length")
#> $lambda
#> [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
#>
#> $objective
#> [1] 0.9839965 0.9881195 0.9910533 0.9927322 0.9931009 0.9921171 0.9897540
#> [8] 0.9860027 0.9808736
#>
#> $objective.name
#> [1] "PPCC"
#>
#> $optimize
#> [1] FALSE
#>
#> $optimize.bounds
#> lower upper
#> NA NA
#>
#> $eps
#> [1] 2.220446e-16
#>
#> $lm.obj
#>
#> Call:
#> lm(formula = fmla, data = data, x = TRUE, y = TRUE)
#>
#> Coefficients:
#> (Intercept)
#> 5.843
#>
#>
#> $sample.size
#> [1] 150
#>
#> $data.name
#> [1] "lmvar"
#>
#> attr(,"class")
#> [1] "boxcoxLm"
# same output, omitted
testt(iris, Sepal.Length)
Created on 2022-08-26 by the reprex package (v2.0.1)
Edit
An argument optimize
could be useful whenever the optimal lambda is needed.
testt <- function(data, var, optimize = FALSE){
var <- as.character(substitute(var))
fmla <- reformulate("1", var)
lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
boxcox(lmvar, optimize = optimize)
}
testt(iris, Sepal.Length)$lambda
#[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
testt(iris, Sepal.Length, optimize = TRUE)$lambda
#[1] -0.1117881