Home > Software design >  Plot normal distribution when only mean and standar deviation exists in ggplot2
Plot normal distribution when only mean and standar deviation exists in ggplot2

Time:11-16

First of all, this is my data:

x <- structure(list(FIGURE = c("4", "4", "4", "4", "XXXIIIc", "XXXIIIc", 
"XXXIIIc", "XXXVI", "XXXVI", "XXXVI", "XLIV", "XLIV", "XLIV", 
"3", "3", "88", "88", "19", "19", "19", "19", "19", "18a", "18a", 
"18a", "18b", "18c", "18c", "18c", "18c", "18c", "55", "Ensemble IX", 
"2", "25", "25", "26", "26", "29", "54", "54", "130", "130"), 
    media_fecha = c(16382.5, 14759, 13776, 15941, 18189.5, 17546.5, 
    17422, 18084, 16898.5, 17987, 16422.5, 15729.5, 17615.5, 
    16016, 15864.5, 14643, 14132, 17146.5, 16619.5, 16424, 16317, 
    16328.5, 16317, 15679, 14834.5, 15467, 13197.5, 12691.5, 
    12683.5, 12384, 14545, 16053.5, 16669.5, 15234.5, 17036.5, 
    16608, 17428, 17439.5, 26785.5, 16813, 16651.5, 15432, 14655
    ), sd_fecha = c(266.25, 235.5, 131.5, 327.5, 210.25, 267.75, 
    267, 262, 239.25, 304, 283.75, 189.75, 266.75, 174, 294.75, 
    278.5, 331, 266.75, 204.75, 215.5, 196, 285.25, 196, 288.5, 
    250.25, 277, 63.75, 101.75, 100.75, 168, 337.5, 136.25, 212.75, 
    183.25, 199.25, 206.5, 193, 264.75, 312.25, 235.5, 299.25, 
    238.5, 312.5)), row.names = c(NA, -43L), class = c("tbl_df", 
"tbl", "data.frame"))

This dataset is composed of 3 columns:

  • FIGURE: categorical variable
  • media_fecha: mean value for that element/row of the dataset.
  • sd_fecha: standard deviation for that element/row of the dataset.

The category XLIV in FIGURE is composed of three rows or elements.

I am interested in plotting this by using ggplot2.

  • For example, illustrate three normal distributions of XLIV by using their corresponding means and deviations. And also, highlight the overlapping area.

How could I do it?

UPDATE

I would like to give a color to the overlapping area between all the density functions taking into account their corresponding 2 sigma limit. Something similar to this example of XXXIIIc:

enter image description here

CodePudding user response:

You could group your data row-wise, summarize with 100 x and y values from, say, 4 sd below to 4 sd above the means, then plot with plain old geom_line.

If you want to highlight the areas where all the curves overlap (within 2 sd of each mean), you can find the x values that lie within that range for all curves, then mark the highest y values where this is true for filling.

We can wrap all this in a function for ease of use and consistency:

library(tidyverse)

plot_norms <- function(data, fig, ...) {
  x %>%
    filter(FIGURE == fig) %>%
    mutate(min_range = min(media_fecha - 4 * sd_fecha),
           max_range = max(media_fecha   4 * sd_fecha)) %>%
    mutate(group = row_number()) %>%
    rowwise() %>%
    summarise(x = seq(min_range, max_range, length.out = 500),
              y = dnorm(x, media_fecha, sd_fecha),
              is_2d = x > media_fecha - 2 * sd_fecha & 
                      x < media_fecha   2 * sd_fecha,
              FIGURE = FIGURE, group = group) %>%
    group_by(x) %>%
    mutate(fill_me = all(is_2d) & rank(y) == max(rank(y))) %>%
    ggplot(aes(x, y))  
    geom_area(data = . %>% filter(fill_me), ..., position = 'identity')  
    geom_line(aes(group = group))  
    ggtitle(paste("Normal distributions, figure", fig))  
    theme_minimal(base_size = 16)
}

So now you can do:

plot_norms(x, "19", fill = 'lightblue', alpha = 0.5)

enter image description here

plot_norms(x, "XXXIIIc", fill = 'red', alpha = 0.2)

enter image description here

Created on 2022-11-12 with reprex v2.0.2

  • Related