Home > Enterprise >  Formating dataframe for a ggalluvial plot between 2 categorical variables?
Formating dataframe for a ggalluvial plot between 2 categorical variables?

Time:10-23

I have a dataframe with three categorical variables:

  • classification1 : "1", "2", "3", "4"
  • classification2 : "Low", "Med", "High"
  • survival: "1" (alive), "2" (dead)

The dataframe is composed of multiples rows, one for each individual. The 20 first rows:

classification1 <- c(4, 3, 1, 2, 3, 1, 2, 2, 2, 2, 1, 1, 4, 2, 2, 1, 2, 1, 3, 2)
classification2 <- c("Medium", "Medium", "Low", "High", "High", "Low", "Medium", "Medium", "High", "Low", "Low", "Low", "High", "High", "Medium", "Low", "Medium", "Low", "Medium", "Medium")
survival <- c(2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1)
df <- data.frame(classification1, classification2, survival)

I want to use ggalluvialwith ggplot2 to build an alluvial plot like this, but I can't figure how!

alluvial plot

The following code return an error ( Error in FUN(X[[i]], ...) : objet 'Freq' introuvable ) because I can't figure out what is "Freq":

ggplot(data = df, aes(axis1 = classification1, axis2 = classification2, y = Freq))  
  scale_x_discrete(limits = c("classification1", "classification2"), expand = c(.2, .05))  
  geom_alluvium(aes(fill = survival))  
  geom_stratum()  
  geom_text(stat = "stratum", aes(label = after_stat(stratum)))

Any clues on how to format my dataframe to fit in ggaluvial?

CodePudding user response:

You could aggregate your dataset using e.g. dplyr::count to add a column with the frequencies:

library(ggalluvial)
library(dplyr)

df <- df %>% 
  count(classification1, classification2, survival, name = "Freq")

ggplot(data = df, aes(axis1 = classification1, axis2 = classification2, y = Freq))  
  scale_x_discrete(limits = c("classification1", "classification2"), expand = c(.2, .05))  
  geom_alluvium(aes(fill = factor(survival)))  
  geom_stratum()  
  geom_text(stat = "stratum", aes(label = after_stat(stratum)))

  • Related