Home > other >  how to avoid writing large number of column names when fitting a model in R
how to avoid writing large number of column names when fitting a model in R

Time:07-06

I want to use bs() function for non linear variables of my dataset when fitting a logistic regression model.

df <- data.frame(a = c(0,1), b = c(0,1), d = c(0,1), e = c(0,1),
                  f= c("m","f"), output = c(0,1))
 
library(splines) 
model <- glm(output~ bs(a, df=2)  bs(b, df=2)  bs(d, df=2)  bs(e, df=2) 
                      factor(f) ,
                      data = df, 
                      family = "binomial") 

in my actual dataset, the number of columns to be bs()ed are way more than this example. Is there a way I can do this without writing all the terms?

CodePudding user response:

We can use some string manipulation with sprintf, together with reformulate:

predictors <- c("a", "b", "d", "e")
bspl.terms <- sprintf("bs(%s, df = 2)", predictors)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 2)   bs(b, df = 2)   bs(d, df = 2)   bs(e, 
#    df = 2)   factor(f)

If you want to use a different df and degree for each spline, it is also straightforward (note that df can not be smaller than degree).

predictors <- c("a", "b", "d", "e")
dof <- c(3, 4, 3, 6)
degree <- c(2, 2, 2, 3)
bspl.terms <- sprintf("bs(%s, df = %d, degree = %d)", predictors, dof, degree)
other.terms <- "factor(f)"
form <- reformulate(c(bspl.terms, other.terms), response = "output")
#output ~ bs(a, df = 3, degree = 2)   bs(b, df = 4, degree = 2)   
#    bs(d, df = 3, degree = 2)   bs(e, df = 6, degree = 3)   factor(f)

Prof. Ben Bolker: I was going to something a little bit fancier, something like predictors <- setdiff(names(df)[sapply(df, is.numeric)], "output").

Yes. This is good for safety. And of course, an automatic way if OP wants to include all numerical variables other than "output" as predictors.

  • Related