Home > Software design >  Plot normal distribution when only mean and standar deviation exists in ggplot2
Plot normal distribution when only mean and standar deviation exists in ggplot2


First of all, this is my data:

x <- structure(list(FIGURE = c("4", "4", "4", "4", "XXXIIIc", "XXXIIIc", 
"3", "3", "88", "88", "19", "19", "19", "19", "19", "18a", "18a", 
"18a", "18b", "18c", "18c", "18c", "18c", "18c", "55", "Ensemble IX", 
"2", "25", "25", "26", "26", "29", "54", "54", "130", "130"), 
    media_fecha = c(16382.5, 14759, 13776, 15941, 18189.5, 17546.5, 
    17422, 18084, 16898.5, 17987, 16422.5, 15729.5, 17615.5, 
    16016, 15864.5, 14643, 14132, 17146.5, 16619.5, 16424, 16317, 
    16328.5, 16317, 15679, 14834.5, 15467, 13197.5, 12691.5, 
    12683.5, 12384, 14545, 16053.5, 16669.5, 15234.5, 17036.5, 
    16608, 17428, 17439.5, 26785.5, 16813, 16651.5, 15432, 14655
    ), sd_fecha = c(266.25, 235.5, 131.5, 327.5, 210.25, 267.75, 
    267, 262, 239.25, 304, 283.75, 189.75, 266.75, 174, 294.75, 
    278.5, 331, 266.75, 204.75, 215.5, 196, 285.25, 196, 288.5, 
    250.25, 277, 63.75, 101.75, 100.75, 168, 337.5, 136.25, 212.75, 
    183.25, 199.25, 206.5, 193, 264.75, 312.25, 235.5, 299.25, 
    238.5, 312.5)), row.names = c(NA, -43L), class = c("tbl_df", 
"tbl", "data.frame"))

This dataset is composed of 3 columns:

  • FIGURE: categorical variable
  • media_fecha: mean value for that element/row of the dataset.
  • sd_fecha: standard deviation for that element/row of the dataset.

The category XLIV in FIGURE is composed of three rows or elements.

I am interested in plotting this by using ggplot2.

  • For example, illustrate three normal distributions of XLIV by using their corresponding means and deviations. And also, highlight the overlapping area.

How could I do it?


I would like to give a color to the overlapping area between all the density functions taking into account their corresponding 2 sigma limit. Something similar to this example of XXXIIIc:

enter image description here

CodePudding user response:

You could group your data row-wise, summarize with 100 x and y values from, say, 4 sd below to 4 sd above the means, then plot with plain old geom_line.

If you want to highlight the areas where all the curves overlap (within 2 sd of each mean), you can find the x values that lie within that range for all curves, then mark the highest y values where this is true for filling.

We can wrap all this in a function for ease of use and consistency:


plot_norms <- function(data, fig, ...) {
  x %>%
    filter(FIGURE == fig) %>%
    mutate(min_range = min(media_fecha - 4 * sd_fecha),
           max_range = max(media_fecha   4 * sd_fecha)) %>%
    mutate(group = row_number()) %>%
    rowwise() %>%
    summarise(x = seq(min_range, max_range, length.out = 500),
              y = dnorm(x, media_fecha, sd_fecha),
              is_2d = x > media_fecha - 2 * sd_fecha & 
                      x < media_fecha   2 * sd_fecha,
              FIGURE = FIGURE, group = group) %>%
    group_by(x) %>%
    mutate(fill_me = all(is_2d) & rank(y) == max(rank(y))) %>%
    ggplot(aes(x, y))  
    geom_area(data = . %>% filter(fill_me), ..., position = 'identity')  
    geom_line(aes(group = group))  
    ggtitle(paste("Normal distributions, figure", fig))  
    theme_minimal(base_size = 16)

So now you can do:

plot_norms(x, "19", fill = 'lightblue', alpha = 0.5)

enter image description here

plot_norms(x, "XXXIIIc", fill = 'red', alpha = 0.2)

enter image description here

Created on 2022-11-12 with reprex v2.0.2

  • Related