Linear regression for each category of a variable-CodePudding

Let's say I'm working with the iris dataset in R:

data(iris)

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   : 4.300   Min.   : 2.000   Min.   : 1.000   Min.   : 0.100  
 1st Qu.: 5.100   1st Qu.: 2.800   1st Qu.: 1.600   1st Qu.: 0.300  
 Median : 5.800   Median : 3.000   Median : 4.350   Median : 1.300  
 Mean   : 5.843   Mean   : 3.057   Mean   : 3.758   Mean   : 1.199  
 3rd Qu.: 6.400   3rd Qu.: 3.300   3rd Qu.: 5.100   3rd Qu.: 1.800  
 Max.   : 7.900   Max.   : 4.400   Max.   : 6.900   Max.   : 2.500  
       Species  
 setosa    : 50  
 versicolor: 50  
 virginica : 50

I want to perform a linear regression where Petal.Length is the dependent variable, and Sepal.Length is the independent variable. How can I, in R, perform this regression for each Species category at once, getting values of P, R² and F for each test?

CodePudding user response：

Use by.

by(iris, iris$Species, \(x) summary(lm(Petal.Length ~ Sepal.Length, x)))
# iris$Species: setosa
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.40856 -0.08027 -0.00856  0.11708  0.46512 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)  
# (Intercept)   0.80305    0.34388   2.335   0.0238 *
#   Sepal.Length  0.13163    0.06853   1.921   0.0607 .
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.1691 on 48 degrees of freedom
# Multiple R-squared:  0.07138, Adjusted R-squared:  0.05204 
# F-statistic:  3.69 on 1 and 48 DF,  p-value: 0.0607
# 
# --------------------------------------------------------- 
#   iris$Species: versicolor
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.68611 -0.22827 -0.04123  0.19458  0.79607 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.18512    0.51421   0.360     0.72    
# Sepal.Length  0.68647    0.08631   7.954 2.59e-10 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.3118 on 48 degrees of freedom
# Multiple R-squared:  0.5686,  Adjusted R-squared:  0.5596 
# F-statistic: 63.26 on 1 and 48 DF,  p-value: 2.586e-10
# 
# --------------------------------------------------------- 
#   iris$Species: virginica
# 
# Call:
#   lm(formula = Petal.Length ~ Sepal.Length, data = x)
# 
# Residuals:
#   Min       1Q   Median       3Q      Max 
# -0.68603 -0.21104  0.06399  0.18901  0.66402 
# 
# Coefficients:
#   Estimate Std. Error t value Pr(>|t|)    
# (Intercept)   0.61047    0.41711   1.464     0.15    
# Sepal.Length  0.75008    0.06303  11.901  6.3e-16 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.2805 on 48 degrees of freedom
# Multiple R-squared:  0.7469,  Adjusted R-squared:  0.7416 
# F-statistic: 141.6 on 1 and 48 DF,  p-value: 6.298e-16

Edit

To elaborate on my comment, we can extract the desired values very easily doing

by(iris, iris$Species, \(x) lm(Petal.Length ~ Sepal.Length, x)) |>
  lapply(\(x) {
    with(summary(x), c(r2=r.squared, f=fstatistic, 
                       p=do.call(pf, c(as.list(unname(fstatistic)), lower.tail=FALSE))))
  }) |> do.call(what=rbind)
#                    r2    f.value f.numdf f.dendf            p
# setosa     0.07138289   3.689765       1      48 6.069778e-02
# versicolor 0.56858983  63.263024       1      48 2.586190e-10
# virginica  0.74688439 141.636664       1      48 6.297786e-16

CodePudding user response：

If you would like to pull out those values, we can use

library (dplyr) 

df <- iris

list_res <- df %>%
  base::split (., df$Species, drop = FALSE) %>%
  lapply (., function (x) {
   fit <- lm(Petal.Length ~ Sepal.Length, data = x) %>%
      summary ()
   r <- fit$r.squared
   coeffs <- fit$coefficients %>% 
     as_tibble ()
   f <- fit$fstatistic[[1]] 
   list_res <- list (r, coeffs, f)
   names (list_res) <- c("R-Squared", "Coefficients", "F-Value")
   return (list_res)
  })

That returns a list of three objects for each regression model including the desired values. I've left the coefficients table here as it is, since it's always good to know to which independent variable your p-values belong. If you want those p-values pulled out separately, we can use coeffs <- fit$coefficients [,4] %>% as.list (), for instance.