Home > Software engineering >  Cumulative percentage in a barplot with multiple value per group (ggplot2)
Cumulative percentage in a barplot with multiple value per group (ggplot2)

Time:08-13

I would like to add a cumulative percent of each group to the following barplot:

df %>% mutate(Taxa = fct_relevel(Taxa, 
            "Cnidaria ", "Echinodermata", "Mollusca", 
            "Bryozoa", "Porifera", "Annelida", 
            "Foraminifera", "Platyhelminthes", "Brachiopoda")) %>% ggplot(aes(x=Taxa,y=Observations,fill = Method))  
  geom_bar(stat = "identity", width = 0.8 ,size = 1, position=position_dodge(), alpha = 0.9)  
  geom_text(aes(label=Observations), vjust=1.6, color="white", size=4, position=position_dodge(0.9))  
  theme_bw()  
  theme(legend.background = element_rect(fill="grey95", size=0.5,
                                         linetype="solid", colour ="black"),
        axis.text.x = element_text(angle = 45, hjust=1, size = 10, face="bold"), 
        axis.text.y = element_text(size = 10, face="bold"),
        axis.title = element_text(size = 12, face="bold"),
        panel.grid.minor = element_blank())

Barplot

And that, with the given dataset

dput(df)
structure(list(Taxa = c("Cnidaria ", "Cnidaria ", "Echinodermata", 
"Echinodermata", "Mollusca", "Mollusca", "Bryozoa", "Bryozoa", 
"Porifera", "Porifera", "Annelida", "Annelida", "Foraminifera", 
"Foraminifera", "Platyhelminthes", "Platyhelminthes", "Brachiopoda", 
"Brachiopoda"), sum = c(26L, 26L, 19L, 19L, 17L, 17L, 15L, 15L, 
14L, 14L, 11L, 11L, 9L, 9L, 7L, 7L, 2L, 2L), Method = structure(c(2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
1L), .Label = c("BGR", "GSR"), class = "factor"), Observations = c(16L, 
10L, 10L, 9L, 11L, 6L, 12L, 3L, 11L, 3L, 8L, 3L, 4L, 5L, 6L, 
1L, 2L, 0L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-18L))

Nevertheless, there is two values per group (one per method), I have the sum which encompass both methods. But it does not work because I need one value per variable. I have tried to create a a second dataframe with only one value per group by the use of a filter function (see code on below), but it does not work either.

df2 <- df %>% filter(Method !='BGR')
p   geom_line((aes(data = df2, x= Taxa,y=sum/sum(sum)))) 
  scale_y_continuous(name = "Observations", 
    sec.axis = sec_axis(~./5, name = "Cumulative frequency", 
      labels = function(b) { paste0(round((b) * 100, 0), "%")})) 

I think I should use the filter function inside the geom_line formula. But I do not know how to proceed...

CodePudding user response:

You can use dplyr::group_by() and then calculate the cumsum() and then convert to cumulative fraction before you plot. The key to the second axis is calculating the appropriate scalar, which in this case is just the maximum y value on the left axis.

library(tidyverse)

d <- structure(list(Taxa = c("Cnidaria ", "Cnidaria ", "Echinodermata", 
                        "Echinodermata", "Mollusca", "Mollusca", "Bryozoa", "Bryozoa", 
                        "Porifera", "Porifera", "Annelida", "Annelida", "Foraminifera", 
                        "Foraminifera", "Platyhelminthes", "Platyhelminthes", "Brachiopoda", 
                        "Brachiopoda"), sum = c(26L, 26L, 19L, 19L, 17L, 17L, 15L, 15L, 
                                                14L, 14L, 11L, 11L, 9L, 9L, 7L, 7L, 2L, 2L), Method = structure(c(2L, 
                                                                                                                  1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                                  1L), .Label = c("BGR", "GSR"), class = "factor"), Observations = c(16L, 
                                                                                                                                                                                10L, 10L, 9L, 11L, 6L, 12L, 3L, 11L, 3L, 8L, 3L, 4L, 5L, 6L, 
                                                                                                                                                                                     1L, 2L, 0L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                                                                                                                                                                                                           -18L))

scalar <- max(d$Observations)

d %>% 
  mutate(Taxa = fct_relevel(Taxa, 
                               "Cnidaria ", "Echinodermata", "Mollusca", 
                               "Bryozoa", "Porifera", "Annelida", 
                               "Foraminifera", "Platyhelminthes", "Brachiopoda")) %>% 
  group_by(Method) %>% 
  mutate(cum_obs = cumsum(Observations),
         cum_obs_fract = cum_obs/sum(Observations)) %>% 
  ungroup() %>% 
  mutate(scaled_cum_obs_fract = cum_obs_fract * scalar) %>% 
  ggplot(aes(x=Taxa,y=Observations,fill = Method))  
  geom_bar(stat = "identity", width = 0.8 ,size = 1, position=position_dodge(), alpha = 0.9)  
  geom_text(aes(label=Observations), vjust=1.6, color="white", size=4, position=position_dodge(0.9))  
  geom_line(aes(y = scaled_cum_obs_fract, color = Method, group = Method))  
  scale_y_continuous(name = "Observations",
                     sec.axis = sec_axis(~./scalar, name = "Cumulative frequency",
                                         labels = function(b) { paste0(round((b) * 100, 0), "%")}))  
  theme_bw()  
  theme(legend.background = element_rect(fill="grey95", size=0.5,
                                         linetype="solid", colour ="black"),
        axis.text.x = element_text(angle = 45, hjust=1, size = 10, face="bold"), 
        axis.text.y = element_text(size = 10, face="bold"),
        axis.title = element_text(size = 12, face="bold"),
        panel.grid.minor = element_blank())

Created on 2022-08-12 by the reprex package (v2.0.1)

  • Related