Boxplot and violin plot misaligned in ggplot2 for only one level of the x-axis-CodePudding

I am using the starwars dataset from tidyverse. I am trying to generate overlapping violin and box plots in ggplot2 but they seem not to be aligning.

# Remove Jabba the Hut which is a massive outlier
starwars$mass[which(starwars$mass>1000)] <- NA

starwarsunnested <- starwars %>% unnest(films)

pos <- position_dodge(0.9)

starwarsunnested %>% drop_na() %>% ggplot(aes(x = gender, y = mass, fill=films))  
  geom_violin(position = pos)  
  geom_boxplot(width = .2, 
               fatten = NULL, 
               position = pos)   
  stat_summary(fun = "mean", 
               geom = "point", 
               position = pos)  
  stat_summary(fun.data = "mean_se", 
               geom = "errorbar", 
               width = .1,
               position = pos)

This generates this figure. As you can see, "masculine" appears as expected with aligned box plots and violin plots. Meanwhile, feminine does not. For example, "Attack of the Clones" has one dot in it and the rest of it's boxplot is shifted. What is going on here? I suspect maybe it's because there are no feminine characters belonging to "Return of the Jedi" level of the films factor, but then why is that leading to misalignment? Additionally, the misalignment seems inconsistent in terms of how spaced out the box and violin plots are for feminine...

CodePudding user response：

The issue is that all groups of gender and films with less than one observation get dropped by geom_violin and geom_boxplot but not for the stat_summary. Interestingly however, while the dropped groups are still taken into account for the dodging in case of geom_boxplot this is not the case for the geom_violin, i.e. the violins are dodged as if there are only four groups (aka films) for feminine, which causes the misalignment. For me this is an inconsistency and perhaps a bug.

One option would be to get rid of the groups with only one obs. Second option or workaround would be to manually dodge the violins.

library(dplyr, warn = FALSE)
library(tidyr)
library(ggplot2)

starwarsunnested <- starwars %>% 
  unnest(films) %>% 
  drop_na() %>%
  add_count(gender, films) |> 
  filter(n > 1)

pos <- position_dodge(0.9)

ggplot(starwarsunnested, aes(x = gender, y = mass, fill=films))  
  geom_violin(position = pos)  
  geom_boxplot(width = .2,
               fatten = NULL,
               position = pos)   
  stat_summary(fun = "mean", 
               geom = "point", 
               position = pos)  
  stat_summary(fun.data = "mean_se", 
               geom = "errorbar", 
               width = .1,
               position = pos)

And a section option or workaround which keeps all obs. would be to manually dodge the violins. Basically this involves converting the categorical variables to numerics. To make my life a bit easier when computing the positions for the violins I rescale the "numeric" films to the range of -1 and 1.

One thing I only figured out by trial and error (and still wondering what's the reason is (: ) is how take the number of genders into account when computing the width by which we have to shift the position of the violins.

starwarsunnested <- starwars %>% 
  unnest(films) %>% 
  drop_na()

starwarsunnested$gender_num <- as.numeric(factor(starwarsunnested$gender))
starwarsunnested$films_num <- as.numeric(factor(starwarsunnested$films))
starwarsunnested$films_num <- scales::rescale(starwarsunnested$films_num, to = c(-1, 1))

n_films <- length(unique(starwarsunnested$films))
n_gender <- length(unique(starwarsunnested$gender))

width <- .9
pos <- position_dodge(0.9)

dw_violin <- (n_gender   1) * width / n_films

ggplot(starwarsunnested, aes(x = gender, y = mass, fill=films))  
  scale_x_discrete()  
  geom_violin(aes(x = gender_num   dw_violin * films_num, group = interaction(gender, films)), position = "identity")  
  geom_boxplot(width = .2,
               fatten = NULL,
               position = pos)   
  stat_summary(fun = "mean", 
               geom = "point", 
               position = pos)  
  stat_summary(fun.data = "mean_se", 
               geom = "errorbar", 
               width = .1,
               position = pos)

CodePudding user response：

The violin plots look weird because films have only 1 data point for one gender:

library(tidyverse)

starwarsunnested <- starwars %>%
  unnest(films)

pos <- position_dodge(0.9)

starwars %>%
  unnest(films) %>%
  drop_na(gender, mass, hair_color) %>%
  dplyr::group_by(gender, films) %>% 
  dplyr::mutate(count = n()) %>% 
  dplyr::filter(count > 1) %>% 
  ggplot(.)  
  geom_violin(aes(gender, mass, fill=films), alpha=.6, position = pos)  
  geom_boxplot(aes(gender, mass, fill=films), alpha=.6, position = pos)
  stat_summary(fun = "mean", geom = "point", position = pos)  
  stat_summary(fun.data = "mean_se", geom = "errorbar", width = .1, position = pos)