Home > Software engineering >  Alluvial plot with 2 different sources but a converging/shared variable [R]
Alluvial plot with 2 different sources but a converging/shared variable [R]

Time:12-17

I have experience with making alluvial plots using the ggalluvial package. However, I have run in to an issue where I am trying to create an alluvial plot with two different sources that converge onto 1 variable.

here is example data

library(dplyr)
library(ggplot2)
library(ggalluvial)

data <- data.frame(
  unique_alluvium_entires = seq(1:10),
  label_1 = c("A", "B", "C", "D", "E", rep(NA, 5)),
  label_2 = c(rep(NA, 5), "F", "G", "H", "I", "J"),
  shared_label = c("a", "b", "c", "c", "c", "c", "c", "a", "a", "b")
)

here is the code I use to make the plot

#prep the data
data <- data %>%
  group_by(shared_label) %>%
  mutate(freq = n())

data <- reshape2::melt(data, id.vars = c("unique_alluvium_entires", "freq"))
data$variable <- factor(data$variable, levels = c("label_1", "shared_label", "label_2"))

#ggplot
ggplot(data,
       aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
           y = freq, fill = value, label = value))  
  scale_x_discrete(expand = c(.1, .1))   
  geom_flow()  
  geom_stratum(color = "grey", width = 1/4, na.rm = TRUE)  
  geom_text(stat = "stratum", size = 4)  
  theme_void()  
  theme(
   axis.text.x = element_text(size = 12, face = "bold")
  )

resulting plot (apparently I cannot embed images yet)

As you can see, I can remove the NA values, but the shared_label does not properly "stack". Each unique row should stack on top of each other in the shared_label column. This would also fix the sizing issue so that they are equal size along the y axis.

Any ideas how to fix this? I have tried ggsankey but the same issue arises and I cannot remove NA values. Any tips is greatly appreciated!

CodePudding user response:

This plot is the expected result of the "flow" statistical transformation, which is the default for the "flow" graphical object. (That is, geom_flow() = geom_flow(stat = "flow").) It looks like what you want is to specify the "alluvium" statistical transformation instead. Below i've used all your code but only copied and edited the ggplot() call.

#ggplot
ggplot(data,
       aes(x = variable, stratum = value, alluvium = unique_alluvium_entires,
           y = freq, fill = value, label = value))  
  scale_x_discrete(expand = c(.1, .1))  
  geom_flow(stat = "alluvium")    # <-- specify alternate stat
  geom_stratum(color = "grey", width = 1/4, na.rm = TRUE)  
  geom_text(stat = "stratum", size = 4)  
  theme_void()  
  theme(
    axis.text.x = element_text(size = 12, face = "bold")
  )
#> Warning: Removed 2 rows containing missing values (geom_text).

Created on 2021-12-10 by the reprex package (v2.0.1)

  • Related