Plotting mean and standard deviation for every numeric value in data frame in R-CodePudding

I want to plot every numeric column with the mean as a bar and the standard deviation is a line through the bar. How can I do this for the iris dataset?

I'm trying to transform my dataset to make it easy to plot in ggplot2.

What I've tried

iris %>%
  dplyr::select_if(is.numeric) %>%
  dplyr::summarise(avg_sepal_length = mean(Sepal.Length),
                  avg_sepal_width = mean(Sepal.Width),
                  avg_petal_length = mean(Petal.Length),
                  avg_petal_width = mean(Petal.Width),
                  sd_sepal_length = sd(Sepal.Length),
                  sd_sepal_width = sd(Sepal.Width),
                  sd_petal_length = sd(Petal.Length),
                  sd_petal_width = sd(Petal.Width))

I want to pivot into two columns so the dataframe will look like so:

stat            mean            sd
sepal_length    5.843333        0.8280661        
sepal_width     3.057333        0.4358663
petal_length    3.758           1.765298    
pedal_width     1.199333        0.7622377

And then plot the upperbound and lower bound as a line for the sd and the. mean as a bar in ggplot

CodePudding user response：

To achieve your desired result you could first simplify your code using dplyr::across. Afterwards you could convert to long via pivot_longer whereby using the .value allows to put the means and the sds in their own columns. Finally you could make your plot as a combination of e.g. geom_col and geom_pointrange:

library(dplyr)
library(tidyr)
library(ggplot2)

iris_sum <- iris %>%
  summarise(across(where(is.numeric), .fns = list(avg = mean, sd = sd), .names = "{.fn}_{.col}")) |> 
  pivot_longer(everything(), names_to = c(".value", "name"), names_sep = "_") |> 
  mutate(name = gsub("\\.", '_', tolower(name)))

iris_sum
#> # A tibble: 4 × 3
#>   name           avg    sd
#>   <chr>        <dbl> <dbl>
#> 1 sepal_length  5.84 0.828
#> 2 sepal_width   3.06 0.436
#> 3 petal_length  3.76 1.77 
#> 4 petal_width   1.20 0.762

ggplot(iris_sum, aes(name, avg))  
  geom_col()  
  geom_pointrange(aes(ymin = avg - sd, ymax = avg   sd))

CodePudding user response：

Your output format is not the best format for ggplot2, which prefers it even longer:


library(tidyr); library(dplyr)

iris %>%
  summarise(
        across(
            where(is.double), 
            list(mean = mean, sd = sd)
        )
    )  |>
    pivot_longer(
        everything(), 
        names_sep = "_", 
        names_to = c("feature", "stat")
    )  


# A tibble: 8 x 3
#   feature      stat  value
#   <chr>        <chr> <dbl>
# 1 Sepal.Length mean  5.84
# 2 Sepal.Length sd    0.828
# 3 Sepal.Width  mean  3.06
# 4 Sepal.Width  sd    0.436
# 5 Petal.Length mean  3.76
# 6 Petal.Length sd    1.77
# 7 Petal.Width  mean  1.20
# 8 Petal.Width  sd    0.762

As you are familiar with the iris dataset, it is worth checking out the

CodePudding user response：

Note that you don't actually need to pre-process the df to calculate the summary values, you can use ggplot2's stat_summary directly:

library(ggplot2)

ggplot(stack(iris), aes(x = ind, y = values))   
  stat_summary(geom = "bar", fun = mean)   
  stat_summary(
    fun = mean, 
    fun.min = function(x) mean(x) - sd(x), 
    fun.max = function(x) mean(x)   sd(x))

Here I've used base R's simple stack function to make a long version of the iris dataset; you can use whatever libraries you prefer (especially if you want to include other manipulations).