Home > Software design >  How to create 1 forest plot for multi- and univariable (factor variables) Cox regression results
How to create 1 forest plot for multi- and univariable (factor variables) Cox regression results

Time:08-05

I have browsed multiple similar questions here but always get stuck by different problems, probably because I am not good at R. I need to create a forest plot with several variables made of various levels of factors that includes its uni- and multivariable HR estimates. The HR lines of each factor level should be on top of each other and grouped by the main variables.

I have tried the code below, but I would like to have the main variable names on the left of the plot, next or above each factor (Not on the right as from the code). Also, the factor levels need to be ordered specifically, for example: General childhood condition levels: Poor, Fair, Good, Very good, Excellent. Any help would be very much appreciated!

Here is a snippet of my dataset & ggplot script:

df <- data.frame(
  label = c("Low", "Middle", "High", "Good", "Fair", "Poor", "Very  good", "Excellent", "Low", "Middle", "High", "Good", "Fair", "Poor", "Very good", "Excellent"), 
  var_label = c("Education", "Education", "Education", "General health", "General health", "General health", "General health", "General health", "Education", "Education", "Education", "General health", "General health", "General health", "General health", "General health"),
  reference = c("TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE"), 
  estimate = c(1.00, 1.34, 1.67, 1.00, 0.82, 1.11, 0.85, 1.48, 1.00, 2.67, 3.49, 0.83, 1.00, 0.95, 1.34),
  std.error = c(NA, 0.16, 0.20, NA, 0.16, 0.24, 0.18, 0.24, NA, 0.14, 0.18, NA, 0.16, 0.23, 0.18, 0.24),
  conf.low = c(NA, 0.95, 1.10, NA, 0.60, 0.69, 0.60, 0.92, NA, 2.01, 2.45, NA, 0.61, 0.63, 0.66, 0.83),
  conf.high = c(NA, 1.87, 2.52, NA, 1.11, 1.80, 1.21, 2.39, NA, 3.55, 4.98, NA, 1.13, 1.58, 1.35, 2.16),
  p.value = c(NA, 0.09, 0.02, NA, 0.20, 0.67, 0.38, 0.11, NA, 0.00, 0.00, NA, 0.24, 1.00, 0.76, 0.23),
  model = c("m", "m", "m", "m", "m", "m", "m", "m", "u", "u", "u", "u", "u", "u", "u", "u")
)
dotCOLS = c("#a6d8f0","#f9b282")   
barCOLS = c("#008fd5","#de6b35")
forest <- ggplot(df, aes(x = estimate, xmax= conf.high, xmin = conf.low, y = label, color = model, fill = model))  
  geom_point(size = 3, shape = 18, position = position_dodge(width = 0.5))  
  geom_linerange(position = position_dodge(width = 0.5), size = 1)  
  geom_vline(xintercept = 1, size = 1)  
  geom_hline(yintercept = 0, size = 1)  
  facet_grid(var_label~., scales = "free_y", space = "free_y")  
  scale_alpha_identity()  
  scale_fill_manual(values=barCOLS) 
  scale_color_manual(values=dotCOLS) 
  scale_y_discrete(name="Variable groups")  
  scale_x_continuous(name="Hazard ratio", limits = c(0, 5))  
  geom_text(aes(label = var_label), x = Inf, y = Inf, hjust = 1, vjust = 4.5, check_overlap = TRUE, color = "darkgreen")  
  theme(panel.background = element_blank(), 
        panel.spacing = unit(0, "pt"),
        axis.line.x.bottom = element_line(size = 1), 
        strip.background = element_blank(), strip.text = element_blank())

And here is a picture of the result: forestplot with multi- and univariable HR estimates

CodePudding user response:

As you want the labels on the left use x=-Inf in geom_text. Also, instead of shifting via vjust you could IMHO simply use vjust=1 to align at the top. Additionally use coord_cartesian(clip = "off") to prevent the labels from being clipped off and make some room for the labels by adding some margin to the y axis text. Finally, if you want your labels in a specific order than convert to a factor with the levels set in your desired order.

Note: I fixed your data which had only 15 values for the estimate variable while all other columns had a length of 16. Additionally I removed the geom_hline which IMHO is not necessary (at least for your example data).

dotCOLS <- c("#a6d8f0", "#f9b282")
barCOLS <- c("#008fd5", "#de6b35")

library(ggplot2)

df$label <- factor(df$label, levels = c("Low", "Middle", "High", "Good", "Fair", "Poor", "Very  good", 
                                        "Excellent", "Very good"))

ggplot(df, aes(x = estimate, xmax = conf.high, xmin = conf.low, y = label, color = model, fill = model))  
  geom_point(size = 3, shape = 18, position = position_dodge(width = 0.5))  
  geom_linerange(position = position_dodge(width = 0.5), size = 1)  
  geom_vline(xintercept = 1, size = 1)  
  #geom_hline(yintercept = 0, size = 1)  
  facet_grid(var_label ~ ., scales = "free_y", space = "free_y")  
  scale_alpha_identity()  
  scale_fill_manual(values = barCOLS)  
  scale_color_manual(values = dotCOLS)  
  scale_y_discrete(name = "Variable groups")  
  scale_x_continuous(name = "Hazard ratio", limits = c(0, 5))  
  geom_text(aes(label = var_label), x = -Inf, y = Inf, hjust = 1, vjust = 1, check_overlap = TRUE, color = "darkgreen")  
  coord_cartesian(clip = "off")  
  theme(
    panel.background = element_blank(),
    panel.spacing = unit(0, "pt"),
    axis.line.x.bottom = element_line(size = 1),
    axis.text.y.left =  element_text(margin = margin(l = 20, unit = "pt")),
    strip.background = element_blank(), 
    strip.text = element_blank()
  )
#> Warning: Removed 2 rows containing missing values (geom_segment).
#> Removed 2 rows containing missing values (geom_segment).

DATA

df <- data.frame(
  label = c("Low", "Middle", "High", "Good", "Fair", "Poor", "Very  good", "Excellent", "Low", "Middle", "High", "Good", "Fair", "Poor", "Very good", "Excellent"),
  var_label = c("Education", "Education", "Education", "General health", "General health", "General health", "General health", "General health", "Education", "Education", "Education", "General health", "General health", "General health", "General health", "General health"),
  reference = c("TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "TRUE", "FALSE", "FALSE", "FALSE", "FALSE"),
  estimate = c(1.00, 1.34, 1.67, 1.00, 0.82, 1.11, 0.85, 1.48, 1.00, 2.67, 3.49, 0.83, 1.00, 0.95, 1.34, 1),
  std.error = c(NA, 0.16, 0.20, NA, 0.16, 0.24, 0.18, 0.24, NA, 0.14, 0.18, NA, 0.16, 0.23, 0.18, 0.24),
  conf.low = c(NA, 0.95, 1.10, NA, 0.60, 0.69, 0.60, 0.92, NA, 2.01, 2.45, NA, 0.61, 0.63, 0.66, 0.83),
  conf.high = c(NA, 1.87, 2.52, NA, 1.11, 1.80, 1.21, 2.39, NA, 3.55, 4.98, NA, 1.13, 1.58, 1.35, 2.16),
  p.value = c(NA, 0.09, 0.02, NA, 0.20, 0.67, 0.38, 0.11, NA, 0.00, 0.00, NA, 0.24, 1.00, 0.76, 0.23),
  model = c("m", "m", "m", "m", "m", "m", "m", "m", "u", "u", "u", "u", "u", "u", "u", "u")
)
  • Related