I´m trying to create a function to (visually) compare the distribution of a variable, with that of the same variable after a Box-Cox transformation. The variable is a single column pulled out of my entire data frame.

library(EnvStats)

bc_compare_1 <- function(var){
  bc_var <- boxcox(lm(var ~ 1))
  lambda <- bc_var$x[which.max(bc_var$y)]
  var_T <- (var^lambda - 1)/lambda
  g <- ggarrange(
    ggdensity(var, fill = "grey", alpha = 0.3)   
      geom_histogram(colour = 1, fill = "white", 
                     position = "identity", alpha = 0)   
      ggtitle("original")   
      theme(plot.title = element_text(size = 11)),
    ggdensity(var_T, fill = "grey", alpha = 0.3)   
      geom_histogram(colour = 1, fill = "white", 
                     position = "identity", alpha = 0)    
      ggtitle("transformed")   
      theme(plot.title = element_text(size = 11))) 
  g <- annotate_figure(g, top = text_grob(substring(deparse(substitute(var)),3), size = 11))
  l <- list(g, paste("lambda = ", lambda))
  return(l)
}

This unfortunately doesn´t work:

Error in model.frame.default(formula = var ~ 1, drop.unused.levels = TRUE) : 
object is not a matrix

I tried several things, but nothing works, and it seems that the problem is somehow with boxcox() not being able to deal with a linear model which was created within the function, cause I get the same error even in the simple example:

library(EnvStats)

testt <- function(var){
  boxcox(lm(var ~ 1))
}

edit: trying to include the data argument in the lm() function also didn´t seem to work:

testt <- function(data, var){
  data %>%
    pull(var) -> dvar
  lmvar <- lm(data = data, formula = dvar ~ 1)
  boxcox(lmvar)
}

-> also no good:

Error in model.frame.default(formula = (data %>% pull(var)) ~ 1, data = data, : 
'data' must be a data.frame, environment, or list

(the data is a dataframe)

Any ideas?

Thanks a lot in advance!

Guy

CodePudding user response：

Here is a solution.
The following function uses reformulate to put the formula's pieces together and corrects the lm call, apparently, boxcox needs the model.matrix x and the response y and these values default to FALSE, see the documentation for lm, section Arguments.

The column name var can be quoted or not.

suppressPackageStartupMessages(
  library(EnvStats)
)

testt <- function(data, var){
  fmla <- reformulate("1", var)
  lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
  boxcox(lmvar)
}

testt(iris, "Sepal.Length")
#> $lambda
#> [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0
#> 
#> $objective
#> [1] 0.9839965 0.9881195 0.9910533 0.9927322 0.9931009 0.9921171 0.9897540
#> [8] 0.9860027 0.9808736
#> 
#> $objective.name
#> [1] "PPCC"
#> 
#> $optimize
#> [1] FALSE
#> 
#> $optimize.bounds
#> lower upper 
#>    NA    NA 
#> 
#> $eps
#> [1] 2.220446e-16
#> 
#> $lm.obj
#> 
#> Call:
#> lm(formula = fmla, data = data, x = TRUE, y = TRUE)
#> 
#> Coefficients:
#> (Intercept)  
#>       5.843  
#> 
#> 
#> $sample.size
#> [1] 150
#> 
#> $data.name
#> [1] "lmvar"
#> 
#> attr(,"class")
#> [1] "boxcoxLm"

# same output, omitted
testt(iris, Sepal.Length)

^{Created on 2022-08-26 by the reprex package (v2.0.1)}

Edit

An argument optimize could be useful whenever the optimal lambda is needed.

testt <- function(data, var, optimize = FALSE){
  var <- as.character(substitute(var))
  fmla <- reformulate("1", var)
  lmvar <- lm(formula = fmla, data = data, x = TRUE, y = TRUE)
  boxcox(lmvar, optimize = optimize)
}

testt(iris, Sepal.Length)$lambda
#[1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0

testt(iris, Sepal.Length, optimize = TRUE)$lambda
#[1] -0.1117881