Is there a way to loop through column names (not numbers) in r for linear models?-CodePudding

I have a data sheet with 40 data columns (40 different nutrients), with additional columns for plot numbers and factors. I would like to automatically loop through each column name and produce a linear model and summary for each. The data columns begin at column 10.

for(i in 10:ncol(df)) {       # for-loop over columns
  mod2<-aov(i~block tillage*residue Error(subblock),data=df)
  summary(mod2)
}

This is currently producing the error Error in model.frame.default(formula = i ~ subblock, data = df, drop.unused.levels = TRUE) : variable lengths differ (found for 'subblock') Variable lengths are consistent so I imagine I am looping incorrectly.

The data looks similar to below (with more categorical columns at the start), with the nutrient columns beginning at column 10.

block	tillage	residue	subblock	nutrient 1	nutrient 2	etc.
b1	NT	NR	s1	0.5	0.6

CodePudding user response：

In general it is helpful to post a sample of your data using dput(). In the absence of that I am going to use the built in dataset mtcars to show you how it is possible to do what you are doing with formula():

head(mtcars)

#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# Select columns
desired_columns  <- names(mtcars)[!names(mtcars)=="mpg"]

for (column in desired_columns){
    this_formula = formula(paste("mpg ~ ", column))
    print(summary(lm(this_formula, data = mtcars)))
}

This will output lm(mpg ~ var) for each var in the data. The key is the paste() statement which builds the expression into a string, and then formula() makes it into a formula object Hopefully you can see how this can be applied to your data.

CodePudding user response：

Here a simple base solution:

model <- list()
model_summary <- list()
for(i in 10:ncol(df)) {       # for-loop over columns
  col <- colnames(df)[i]
  formula <- as.formula(paste0(col,"~block tillage*residue Error(subblock)"))
  model[[i-9]] <-aov(formula,data=df)
   model_summary [[i-9]]<-summary(model[[i-9]])
}

Just create a new formula at each iteration using the name of the i-column

CodePudding user response：

You do not need a loop. You can just pass a matrix to the LHS of the formula:

dep <- names(iris)[names(iris) != "Species"]
f <- as.formula(sprintf("cbind(%s) ~ Species", paste(dep, collapse = ",")))

summary(lm(f, data = iris))

CodePudding user response：

Purrr solution:

Without a MWE it is difficult to help you. My approach would be to split your dataset into one dependent and one independent variable dataset. Then put each dependent variable into a list and append the independent dataset. Then you can "loop" through each list and apply the regression you like.

df <- mtcars

df_independent <- df %>%
  as_tibble() %>%
  # select independent variables
  select(9:10)

df_dependent <- df %>%
  as_tibble() %>%
  # select all dependent variables and store each column in a list
  select(1:8) %>%
  as.list() %>%
  map(as_tibble) %>%
  map(~ cbind(.x, df_independent))


df_dependent %>%
 # df_independent %>% colnames() %>% paste0(".x$",., collapse =" ")
  map(~ lm(.x$value ~ .x$am   .x$gear)) %>%
  map(summary)

CodePudding user response：

If you want the statistics in a table (which might come in handy) you can use the purrr and broom packages. Here's an example using the dataset mtcars:

Code

library(tidyr)
library(purrr)
library(broom)

formula <- lapply(colnames(mtcars)[3:ncol(mtcars)], function(x) as.formula(paste0(x, " ~ cyl")))

names(formula) <- format(formula)

table <- formula %>% map(~aov(.x, mtcars)) %>% map_dfr(tidy, .id="model")

Output

> head(table)
# A tibble: 6 x 7
  model      term         df     sumsq     meansq statistic   p.value
  <chr>      <chr>     <dbl>     <dbl>      <dbl>     <dbl>     <dbl>
1 disp ~ cyl cyl           1 387454.   387454.        131.   1.80e-12
2 disp ~ cyl Residuals    30  88731.     2958.         NA   NA       
3 hp ~ cyl   cyl           1 100984.   100984.         67.7  3.48e- 9
4 hp ~ cyl   Residuals    30  44743.     1491.         NA   NA       
5 drat ~ cyl cyl           1      4.34      4.34       28.8  8.24e- 6
6 drat ~ cyl Residuals    30      4.52      0.151      NA   NA

Try

formula <- lapply(colnames(df)[10:ncol(df)], function(x) as.formula(paste0(x, " ~ block   tillage * residue   Error(subblock)")))

names(formula) <- format(formula)

table <- formula %>% map(~aov(.x, df)) %>% map_dfr(tidy, .id="model")