The dataframe consists of two factor variables: cls
with 3 leveles and subset
with 2 levels. I want to compare how much of each class (cls
) is there in both groups of subset
. I want to show percentages on y-axis. They should be computed within certain subset
group, not whole dataset.
library(tidyverse)
data = data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
This was my attempt to show percentages, but it failed because they are computed within whole dataset instead of subset
group:
ggplot(data, aes(x = cls, fill = cls)) geom_bar(aes(y = ..count.. / sum(..count..))) facet_wrap(~subset)
How can I fix it?
Edit related to the accepted answer:
plot_train_vs_test = function(data, var, subset_colname){
plot_data = data %>%
count(var, eval(subset_colname)) %>%
group_by(eval(subset_colname)) %>%
mutate(perc = n/sum(n))
ggplot(plot_data, aes(x = var, y = perc, fill = var))
geom_col()
scale_y_continuous(labels = scales::label_percent())
facet_wrap(~eval(subset_colname))
}
plot_train_vs_test(data, "cls", "subset")
Results in errors.
CodePudding user response:
One option and easy fix would be to compute the percentages outside of ggplot and plot the summarized data:
library(ggplot2)
library(dplyr, warn = FALSE)
set.seed(123)
data <- data.frame(
x = rnorm(1000),
cls = factor(c(rep("A", 200), rep("B", 300), rep("C", 500))),
subset = factor(c(rep("train", 900), rep("test", 100)))
)
data_sum <- data %>%
count(cls, subset) %>%
group_by(subset) %>%
mutate(pct = n / sum(n))
ggplot(data_sum, aes(x = cls, y = pct, fill = cls))
geom_col()
scale_y_continuous(labels = scales::label_percent())
facet_wrap(~subset)
EDIT One approach to put the code in a function may look like so:
plot_train_vs_test <- function(.data, x, facet) {
.data_sum <- .data %>%
count({{ x }}, {{ facet }}) %>%
group_by({{ facet }}) %>%
mutate(pct = n / sum(n))
ggplot(.data_sum, aes(x = {{ x }}, y = pct, fill = {{ x }}))
geom_col()
scale_y_continuous(labels = scales::label_percent())
facet_wrap(vars({{ facet }}))
}
plot_train_vs_test(data, cls, subset)
For more on the details and especially the {{
operator see Programming with dplyr, Programming with ggplot2 and Best practices for programming with ggplot2