Home > database >  Order a dataframe based on a column that points to the next record
Order a dataframe based on a column that points to the next record

Time:09-08

I have a data frame with a column that points to the next record, sample dataframe below.

OG_Data <- tibble(
  Record = c("aaaa", "NNNN", "rrrr", "tttt", "pppp", "ssss", "bbbb"),
  NextRecord = c("pppp", "tttt", "bbbb", "N/A" , "NNNN", "rrrr", "N/A") 
  )

Original Data:

# Record NextRecord 
# aaaa   pppp
# NNNN   tttt       
# rrrr   bbbb       
# tttt   N/A        
# pppp   NNNN       
# ssss   rrrr       
# bbbb   N/A        

I am looking to order this data frame based on a predefined sequence determined by column B (NextRecord) that points to the next record's column A (Record) to get the sequence order and line group.

Desired Output:

# Record  NextRecord  Sequence  Line 
# aaaa    pppp        1         1    
# pppp    NNNN        2         1    
# NNNN    tttt        3         1    
# tttt    N/A         4         1    
# ssss    rrrr        1         2    
# rrrr    bbbb        2         2    
# bbbb    N/A         3         2    

I was thinking of something like this:

OutputData <- OG_Data[1,] %>% add_row(OG_Data, filter(OG_Data, OG_Data$Record == NextRecord))

But that doesn't work and is not scalable. Also, I am not sure where to start to find the beginning of the line groups.

CodePudding user response:

I bet there are simpler ways, but at least it's fun to approach it as a graph problem.

library(igraph)
g = graph_from_data_frame(d)
g2 = sapply(V(g)[degree(g, mode = 'in') == 0], function(v) all_simple_paths(g, v, "N/A"))
d2 = d[unlist(lapply(g2, function(v) head(as.vector(v), -1))),]
d2$g = rep(seq_along(g2), lengths(g2) - 1)

  Record NextRecord g
1   aaaa       pppp 1
5   pppp       NNNN 1
2   NNNN       tttt 1
4   tttt        N/A 1
6   ssss       rrrr 2
3   rrrr       bbbb 2
7   bbbb        N/A 2

Then Numbering rows within groups in a data frame

CodePudding user response:

With cumsum and lag:

library(dplyr)
OG_Data %>%
  mutate(NextRecord = na_if(NextRecord, "N/A"),
         Line = cumsum(lag(is.na(NextRecord), default = T))) %>% 
  group_by(Line) %>% 
  mutate(Sequence = row_number())

output

  Record NextRecord  Line Sequence
  <chr>  <chr>      <int>    <int>
1 aaaa   pppp           1        1
2 NNNN   tttt           1        2
3 rrrr   bbbb           1        3
4 tttt   NA             1        4
5 pppp   NNNN           2        1
6 ssss   rrrr           2        2
7 bbbb   NA             2        3
  • Related