Trying to graph different linear regression models with ggplot and equation labels-CodePudding

Here is the dput for the dataset I'm working with:

structure(list(X = c(18, 19, 20, 17, 8, 15, 14, 16, 18, 14, 16, 
13, 16, 17, 10, 18, 19, 25, 18, 13, 18, 16, 11, 17, 15, 18, 19, 
16, 20, 17, 8, 18, 15, 14, 18, 14, 16, 13, 16), Y = c(15, 13, 
14, 22, 2, 11, 15, 11, 20, 17, 20, 17, 20, 14, 21, 10, 13, 16, 
12, 11, 13, 10, 4, 16, 18, 15, 10, 13, 14, 17, 2, 11, 15, 11, 
20, 17, 14, 7, 16), Z = c(32, 42, 37, 34, 32, 39, 44, 49, 36, 
31, 36, 37, 37, 45, 46, 48, 36, 42, 36, 25, 36, 39, 26, 32, 33, 
38, 33, 44, 46, 34, 32, 39, 44, 49, 36, 31, 36, 37, 37)), class = "data.frame", row.names = c(NA, 
-39L))

So before when I ran linear regressions in R, I would use a simple code like this:

library(ggplot2)
library(ggpmisc)    

ggplot(data = hw_data,
           aes(x=X,
               y=Y)) 
      geom_point() 
      geom_smooth(method = lm) 
      stat_poly_eq(formula = simple_model,
                   aes(label = paste(..eq.label.., ..rr.label..,
                                     sep = "~~~~~~")))

This would give me a graph with this kind of equation label:

However, I'm trying to run a multiple linear regression and a hierarchal regression, but when I try to add in a third variable, I'm not entirely sure how to 1) get the third variable (Z) to graph as a regression line and 2) get an equation that fits the model on the graph. What I'm looking for is something like this:

The two models I need to graph are (Y ~ X Z) and (Y ~ X * Z). The best I've come up with so far is this:

# One predictor model:
hw_regression_simple <- lm(Y ~ X,
                    data = hw_data)
# Two predictor model:
hw_regression_two_factors <- lm(Y ~ X   Z,
                                hw_data)
# Interaction model:
hw_regression_interaction <- lm(Y ~ X * Z,
                                hw_data)
# Comparison of models:
summary(hw_regression_simple)
summary(hw_regression_two_factors)
summary(hw_regression_interaction)

model <- Y ~ X * Z

ggplot(data = hw_data,
       aes(x=X,
           y=Y,
           color=Z)) 
  geom_point() 
  geom_smooth(method = lm) 
  labs(title = "X, Y, and Z Interactions") 
  stat_poly_eq()

Which gives me this graph with an R-squared and some coloration of the scatterplot, but otherwise doesn't give as much info as I'd like. How can I fix this to look more like the models I'm looking for?

CodePudding user response：

If you are regressing Y on both X and Z, and these are both numerical variables (as they are in your example) then a simple linear regression represents a 2D plane in 3D space, not a line in 2D space. Adding an interaction term means that your regression represents a curved surface in a 3D space. This can be difficult to represent in a simple plot, though there are some ways to do it : the colored lines in the smoking / cycling example you show are slices through the regression plane at various (aribtrary) values of the Z variable, which is a reasonable way to display this type of model.

Although ggplot has some great shortcuts for plotting simple models, I find people often tie themselves in knots because they try to do all their modelling inside ggplot. The best thing to do when you have a more complex model to plot is work out what exactly you want to plot using the right tools for the job, then plot it with ggplot.

For example, if you make a prediction data frame for your interaction model:

model2 <- lm(Y ~ X * Z, data = hw_data)

predictions <- expand.grid(X = seq(min(hw_data$X), max(hw_data$X), length.out = 5),
                           Z = seq(min(hw_data$Z), max(hw_data$Z), length.out = 5))

predictions$Y <- predict(model2, newdata = predictions)

Then you can plot your interaction model very simply:

ggplot(hw_data, aes(X, Y))   
  geom_point()  
  geom_line(data = predictions, aes(color = factor(Z)))  
  labs(color = "Z")

You can easily work out the formula from the coefficients table and stick it together with paste:

labs <- trimws(format(coef(model2), digits = 2))
form <- paste("Y =", labs[1], " ", labs[2], "* x  ", 
              labs[3], "* Z   (", labs[4], " * X * Z)")
form
#> [1] "Y = -69.07   5.58 * x   2.00 * Z   ( -0.13  * X * Z)"

This can be added as an annotation to your plot using geom_text or annotation

Update

A complete solution if you wanted to have only 3 levels for Z, effectively "high", "medium" and "low", you could do something like:

library(ggplot2)

model2 <- lm(Y ~ X * Z, data = hw_data)

predictions <- expand.grid(X = quantile(hw_data$X, c(0, 0.5, 1)),
                           Z = quantile(hw_data$Z, c(0.1, 0.5, 0.9)))

predictions$Y <- predict(model2, newdata = predictions)

labs <- trimws(format(coef(model2), digits = 2))
form <- paste("Y =", labs[1], " ", labs[2], "* x  ", 
              labs[3], "* Z   (", labs[4], " * X * Z)")

form <- paste(form, "      R\u00B2 =",  
              format(summary(model2)$r.squared, digits = 2))

ggplot(hw_data, aes(X, Y))   
  geom_point()  
  geom_line(data = predictions, aes(color = factor(Z)))  
  geom_text(x = 15, y = 25, label = form, check_overlap = TRUE,
            fontface = "italic")  
  labs(color = "Z")