Let's say I'm working with the iris
dataset in R:
data(iris)
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. : 4.300 Min. : 2.000 Min. : 1.000 Min. : 0.100
1st Qu.: 5.100 1st Qu.: 2.800 1st Qu.: 1.600 1st Qu.: 0.300
Median : 5.800 Median : 3.000 Median : 4.350 Median : 1.300
Mean : 5.843 Mean : 3.057 Mean : 3.758 Mean : 1.199
3rd Qu.: 6.400 3rd Qu.: 3.300 3rd Qu.: 5.100 3rd Qu.: 1.800
Max. : 7.900 Max. : 4.400 Max. : 6.900 Max. : 2.500
Species
setosa : 50
versicolor: 50
virginica : 50
I want to perform a linear regression where Petal.Length
is the dependent variable, and Sepal.Length
is the independent variable. How can I, in R, perform this regression for each Species
category at once, getting values of P, R² and F for each test?
CodePudding user response:
Use by
.
by(iris, iris$Species, \(x) summary(lm(Petal.Length ~ Sepal.Length, x)))
# iris$Species: setosa
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.40856 -0.08027 -0.00856 0.11708 0.46512
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.80305 0.34388 2.335 0.0238 *
# Sepal.Length 0.13163 0.06853 1.921 0.0607 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.1691 on 48 degrees of freedom
# Multiple R-squared: 0.07138, Adjusted R-squared: 0.05204
# F-statistic: 3.69 on 1 and 48 DF, p-value: 0.0607
#
# ---------------------------------------------------------
# iris$Species: versicolor
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.68611 -0.22827 -0.04123 0.19458 0.79607
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.18512 0.51421 0.360 0.72
# Sepal.Length 0.68647 0.08631 7.954 2.59e-10 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.3118 on 48 degrees of freedom
# Multiple R-squared: 0.5686, Adjusted R-squared: 0.5596
# F-statistic: 63.26 on 1 and 48 DF, p-value: 2.586e-10
#
# ---------------------------------------------------------
# iris$Species: virginica
#
# Call:
# lm(formula = Petal.Length ~ Sepal.Length, data = x)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.68603 -0.21104 0.06399 0.18901 0.66402
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.61047 0.41711 1.464 0.15
# Sepal.Length 0.75008 0.06303 11.901 6.3e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.2805 on 48 degrees of freedom
# Multiple R-squared: 0.7469, Adjusted R-squared: 0.7416
# F-statistic: 141.6 on 1 and 48 DF, p-value: 6.298e-16
Edit
To elaborate on my comment, we can extract the desired values very easily doing
by(iris, iris$Species, \(x) lm(Petal.Length ~ Sepal.Length, x)) |>
lapply(\(x) {
with(summary(x), c(r2=r.squared, f=fstatistic,
p=do.call(pf, c(as.list(unname(fstatistic)), lower.tail=FALSE))))
}) |> do.call(what=rbind)
# r2 f.value f.numdf f.dendf p
# setosa 0.07138289 3.689765 1 48 6.069778e-02
# versicolor 0.56858983 63.263024 1 48 2.586190e-10
# virginica 0.74688439 141.636664 1 48 6.297786e-16
CodePudding user response:
If you would like to pull out those values, we can use
library (dplyr)
df <- iris
list_res <- df %>%
base::split (., df$Species, drop = FALSE) %>%
lapply (., function (x) {
fit <- lm(Petal.Length ~ Sepal.Length, data = x) %>%
summary ()
r <- fit$r.squared
coeffs <- fit$coefficients %>%
as_tibble ()
f <- fit$fstatistic[[1]]
list_res <- list (r, coeffs, f)
names (list_res) <- c("R-Squared", "Coefficients", "F-Value")
return (list_res)
})
That returns a list of three objects for each regression model including the desired values. I've left the coefficients table here as it is, since it's always good to know to which independent variable your p-values belong. If you want those p-values pulled out separately, we can use coeffs <- fit$coefficients [,4] %>% as.list ()
, for instance.