I have a data frame with a column that points to the next record, sample dataframe below.
OG_Data <- tibble(
Record = c("aaaa", "NNNN", "rrrr", "tttt", "pppp", "ssss", "bbbb"),
NextRecord = c("pppp", "tttt", "bbbb", "N/A" , "NNNN", "rrrr", "N/A")
)
Original Data:
# Record NextRecord
# aaaa pppp
# NNNN tttt
# rrrr bbbb
# tttt N/A
# pppp NNNN
# ssss rrrr
# bbbb N/A
I am looking to order this data frame based on a predefined sequence determined by column B (NextRecord) that points to the next record's column A (Record) to get the sequence order and line group.
Desired Output:
# Record NextRecord Sequence Line
# aaaa pppp 1 1
# pppp NNNN 2 1
# NNNN tttt 3 1
# tttt N/A 4 1
# ssss rrrr 1 2
# rrrr bbbb 2 2
# bbbb N/A 3 2
I was thinking of something like this:
OutputData <- OG_Data[1,] %>% add_row(OG_Data, filter(OG_Data, OG_Data$Record == NextRecord))
But that doesn't work and is not scalable. Also, I am not sure where to start to find the beginning of the line groups.
CodePudding user response:
I bet there are simpler ways, but at least it's fun to approach it as a graph problem.
library(igraph)
g = graph_from_data_frame(d)
g2 = sapply(V(g)[degree(g, mode = 'in') == 0], function(v) all_simple_paths(g, v, "N/A"))
d2 = d[unlist(lapply(g2, function(v) head(as.vector(v), -1))),]
d2$g = rep(seq_along(g2), lengths(g2) - 1)
Record NextRecord g
1 aaaa pppp 1
5 pppp NNNN 1
2 NNNN tttt 1
4 tttt N/A 1
6 ssss rrrr 2
3 rrrr bbbb 2
7 bbbb N/A 2
Then Numbering rows within groups in a data frame
CodePudding user response:
With cumsum
and lag
:
library(dplyr)
OG_Data %>%
mutate(NextRecord = na_if(NextRecord, "N/A"),
Line = cumsum(lag(is.na(NextRecord), default = T))) %>%
group_by(Line) %>%
mutate(Sequence = row_number())
output
Record NextRecord Line Sequence
<chr> <chr> <int> <int>
1 aaaa pppp 1 1
2 NNNN tttt 1 2
3 rrrr bbbb 1 3
4 tttt NA 1 4
5 pppp NNNN 2 1
6 ssss rrrr 2 2
7 bbbb NA 2 3