Home > Blockchain >  make a single barplot from two dataframes of unequal length
make a single barplot from two dataframes of unequal length

Time:06-29

I have two (in reality - more) dataframes:
(edit - the current answer does not answer my question)

sex <- data.frame(sex = c("M", "F")
                   ,n = c(25, 30))

age <- data.frame(age = c("20-40","40-60","60-80")
                    ,n = c(18, 30, 25))

I would like to produce a single stacked bar chart that will show information from both of these dataframes.

The final plot should look something like this: enter image description here

I imagine I would fist merge these dataframes of unequal lengths , filling the missing rows with NAs. I am asking about the plot, rather than about merging, because I am hoping that maybe there is a ggplot solution that does not require merging at all?

EDIT
Following the first answer, I would like to say that one other desired traits for the plot, not explicitly stated until now, is that the bars share the same colour palette, as in the example plot above.

EDIT2
the exact colours are not important to me, and do not have a meaning. However, my real data will have more bars than just two, and I do not want to produce a figure that will have 40 different colours. As you can see in my example plot, I do not display a legend; instead, each of the categories is specified as text on top of the bars. This allows the plots to be clear despite not having a legend, and despite sharing colours.

CodePudding user response:

library(tidyverse)

sex_data <- data.frame(
  sex = c("M", "F"),
  n = c(25, 30)
)

age_data <- data.frame(
  age = c("20-40", "40-60", "60-80"),
  n = c(18, 30, 25)
)

ggplot()  
  geom_bar(data = sex_data, mapping = aes(x = "sex", fill = sex, y = n/sum(n)), stat = "identity")  
  geom_bar(data = age_data, mapping = aes(x = "age", fill = age, y = n/sum(n)), stat = "identity")  
  coord_flip()  
  labs(x = "", y = "Proportion", fill = "")

Created on 2022-06-28 by the reprex package (v2.0.0)

CodePudding user response:

This is mainly about preparing your data for the plot. This assumes a regular construction of your data frames. If you don't have that, then you need to make sure that they are structured accordingly.

Disclaimer: I cannot endorse this type of visualisation, the below is just about demonstrating one way to get your desired result.

Why do I think this isn't a good idea? Using the same colors within one plot suggests a connection/ relation between variables that may not be given. For example, in your plot, we would be inclined to believe that all subjects that are aged 20-40 are females, and those of 40-60 are all males, because they have the same color. This creates potentially misleading messages and can be dangerous, especially when we are talking about data exploration (what you seem to have in mind, otherwise there should be no need to produce a series of so many graphs).

library(tidyverse)
sex <- data.frame(sex = c("M", "F")
                  ,n = c(25, 30))

age <- data.frame(age = c("20-40","40-60","60-80")
                  ,n = c(18, 30, 25))

## add a "meaningful" mapping variable to color,
## assumptions: 
## - your data frames are named like the desired category, 
## - they are arranged in the order you would like to have the columns stacked
## - they have two columns
## - one column named after the desired category 
## - the other with the values

## first sort your sex column according to your desired output.
sex <- sex[nrow(sex):1,]
my_categ <- c("sex", "age")
map(my_categ, ~{
  get(as.name(.x)) %>%
    ## convert to factor and use levels for fill
    mutate(order = as.integer(fct_inorder(.data[[.x]])))
}) %>%
  ## bind to one data frame
  bind_rows() %>%
  ## make both sex and age one variable
  pivot_longer(my_categ) %>%
  ## remove NAs
  drop_na(value) %>%
  ggplot()  
  ## use order as fill
  geom_col(aes(n, name, fill = order))  
  ## add the labels
  geom_text(aes(n, name, label = value), 
            position = position_stack(vjust = .5)) 
  theme(legend.position = "none")
#> Note: Using an external vector in selections is ambiguous.
#> ℹ Use `all_of(my_categ)` instead of `my_categ` to silence this message.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#> This message is displayed once per session.

Created on 2022-06-28 by the reprex package (v2.0.1)

Maybe you meant "they are not stacked to a 100%". This is easily achievable by using position = "fill".

## same data transformation as above piped into the plot... %>%
 ggplot()  
  geom_col(aes(n, name, fill = order), position = "fill")  
  geom_text(aes(n, name, label = value), 
            position = position_fill(vjust = .5)) 
  theme(legend.position = "none")

If you want to use color, a less misleading way would be to use one color per category and have a gradient. This can either be achieved by using different mono-hue palettes or simply by adding an alpha (as I am using here).

## same data transformation as above piped into the plot... %>%
 ggplot()  
  ## use order as fill
  geom_col(aes(n, name, fill = name, alpha = order), position = "fill")  
  ## add the labels
  geom_text(aes(n, name, label = value), 
            position = position_fill(vjust = .5)) 
  theme(legend.position = "none")

  • Related