I want to plot every numeric column with the mean as a bar and the standard deviation is a line through the bar. How can I do this for the iris
dataset?
I'm trying to transform my dataset to make it easy to plot in ggplot2.
What I've tried
iris %>%
dplyr::select_if(is.numeric) %>%
dplyr::summarise(avg_sepal_length = mean(Sepal.Length),
avg_sepal_width = mean(Sepal.Width),
avg_petal_length = mean(Petal.Length),
avg_petal_width = mean(Petal.Width),
sd_sepal_length = sd(Sepal.Length),
sd_sepal_width = sd(Sepal.Width),
sd_petal_length = sd(Petal.Length),
sd_petal_width = sd(Petal.Width))
I want to pivot into two columns so the dataframe will look like so:
stat mean sd
sepal_length 5.843333 0.8280661
sepal_width 3.057333 0.4358663
petal_length 3.758 1.765298
pedal_width 1.199333 0.7622377
And then plot the upperbound and lower bound as a line for the sd and the. mean as a bar in ggplot
CodePudding user response:
To achieve your desired result you could first simplify your code using dplyr::across
. Afterwards you could convert to long via pivot_longer
whereby using the .value
allows to put the mean
s and the sd
s in their own columns. Finally you could make your plot as a combination of e.g. geom_col
and geom_pointrange
:
library(dplyr)
library(tidyr)
library(ggplot2)
iris_sum <- iris %>%
summarise(across(where(is.numeric), .fns = list(avg = mean, sd = sd), .names = "{.fn}_{.col}")) |>
pivot_longer(everything(), names_to = c(".value", "name"), names_sep = "_") |>
mutate(name = gsub("\\.", '_', tolower(name)))
iris_sum
#> # A tibble: 4 × 3
#> name avg sd
#> <chr> <dbl> <dbl>
#> 1 sepal_length 5.84 0.828
#> 2 sepal_width 3.06 0.436
#> 3 petal_length 3.76 1.77
#> 4 petal_width 1.20 0.762
ggplot(iris_sum, aes(name, avg))
geom_col()
geom_pointrange(aes(ymin = avg - sd, ymax = avg sd))
CodePudding user response:
Your output format is not the best format for ggplot2
, which prefers it even longer:
library(tidyr); library(dplyr)
iris %>%
summarise(
across(
where(is.double),
list(mean = mean, sd = sd)
)
) |>
pivot_longer(
everything(),
names_sep = "_",
names_to = c("feature", "stat")
)
# A tibble: 8 x 3
# feature stat value
# <chr> <chr> <dbl>
# 1 Sepal.Length mean 5.84
# 2 Sepal.Length sd 0.828
# 3 Sepal.Width mean 3.06
# 4 Sepal.Width sd 0.436
# 5 Petal.Length mean 3.76
# 6 Petal.Length sd 1.77
# 7 Petal.Width mean 1.20
# 8 Petal.Width sd 0.762
As you are familiar with the iris
dataset, it is worth checking out the
CodePudding user response:
Note that you don't actually need to pre-process the df to calculate the summary values, you can use ggplot2's stat_summary
directly:
library(ggplot2)
ggplot(stack(iris), aes(x = ind, y = values))
stat_summary(geom = "bar", fun = mean)
stat_summary(
fun = mean,
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) sd(x))
Here I've used base R's simple stack
function to make a long version of the iris dataset; you can use whatever libraries you prefer (especially if you want to include other manipulations).