Making a Sankey Diagram in R-CodePudding

I'm trying to create a Sankey Diagram. I am using R with either {plotly} or {networkD3} packages. Both ask for the same type of data: source, target, value. I'm not really sure what source, target, and value is supposed to be and how to aggregate my data to this format. I have the following:

data.frame(
  UniqID = rep(c(1:10), times=4), 
  Year = rep(c("2005", "2010", "2015", "2020"), times=10),
  Response_Variable = round(runif(n = 40, min = 0, max = 2), digits = 0)
)

The response variable is a categorical variable of 0, 1, or 2. I would like to show the flow of the classes of this variable from one year to the next. The final product should look something like this:

In my case, "Wave" would be Year and "Outcome" would be the classes (0, 1, 2) of the response variable.

CodePudding user response：

You don't really have enough information in your data to make a chart exactly like that because with the data you provided it's not clear which things changed from one category to the next across years. Maybe you were trying to achieve that with the UniqID column, but the way the data is, it doesn't make sense...

df <- data.frame(UniqID=rep(c(1:10), times=4), 
           Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
           Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0))

library(dplyr)

df %>% arrange(UniqID, Year) %>% filter(UniqID == 1)
#>   UniqID Year Response_Variable
#> 1      1 2005                 2
#> 2      1 2005                 1
#> 3      1 2015                 1
#> 4      1 2015                 0

Ignoring that, the data format you're asking about is a list of "links" each one defining a movement from one "node", the "source" node, to another "node", the "target" "node". So in your case, each year-category combination is a "node", and you need a list of each "link" between those nodes, and potentially a "value" for each of your links, which in your case the number of occurrences of the source node makes the most sense. You could reshape your data to that format like this...

df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target))
#> # A tibble: 9 × 5
#> # Groups:   Response_Variable [3]
#>   Year  Response_Variable value source target
#>   <chr>             <dbl> <int> <chr>  <chr> 
#> 1 2005                  0     4 2005_0 2010_0
#> 2 2005                  1     3 2005_1 2010_1
#> 3 2005                  2     3 2005_2 2010_2
#> 4 2010                  0     2 2010_0 2015_0
#> 5 2010                  1     6 2010_1 2015_1
#> 6 2010                  2     2 2010_2 2015_2
#> 7 2015                  0     3 2015_0 2020_0
#> 8 2015                  1     3 2015_1 2020_1
#> 9 2015                  2     4 2015_2 2020_2

To get to the more specific format that {networkD3} requires, you need one data.frame for links and one that lists each node. The links data.frame needs to refer to each node in the nodes data.frame by its 0-based index. You can set that up like this...

library(dplyr)
library(networkD3)

df <- 
  data.frame(
    UniqID=rep(c(1:10), times=4), 
    Year=rep(c("2005", "2010", "2015", "2020"), times=10), 
    Response_Variable=round(runif(n=40, min = 0, max = 2), digits=0)
  )

links <-
  df %>% 
  group_by(Year, Response_Variable) %>% 
  summarise(value = n(), .groups = "drop") %>% 
  mutate(source = paste(Year, Response_Variable, sep = "_")) %>% 
  group_by(Response_Variable) %>% 
  mutate(target = lead(source, order_by = Year)) %>% 
  filter(!is.na(target)) %>% 
  ungroup() %>% 
  select(source, target, value)

nodes <- data.frame(node_id = unique(c(links$source, links$target)))  

links$source <- match(links$source, nodes$node_id) - 1
links$target <- match(links$target, nodes$node_id) - 1

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source", 
  Target = "target", 
  Value = "value", 
  NodeID = "node_id"
)
#> Links is a tbl_df. Converting to a plain data frame.