Path along linked values taking the lowest value each time-CodePudding

I have a data.table with two columns "From" and "To" as follows:

data.table(From = c(1,1,1,1,2,2,2,2,3,3,3,4,4,5),
           To = c(3,4,5,6,3,4,5,6,4,5,6,5,6,6))

The data.table will always be sorted as shown in the example above, with "From" and "To" values increasing from smallest to largest.

I need to find a 'path' starting from the first 'From' (which will always be '1'), through to the last 'To' value, subject to always choosing the lowest 'To' value. In the above example, I would have 1 --> 3, then 3 --> 4, then 4 --> 5, then finally 5 --> 6.

I then want to return in a vector 1, 3, 4, 5 and 6, representing the linked values.

The only way that I can think of doing it is using a while or for loop and looping through each group of 'From' values and iteratively choosing the smallest. That seems inefficient though and will probably be very slow on my actual data set which is over 100,000 rows long.

Are there any data.table-like solutions? I also thought that maybe igraph would have a method for this, but I must admit that I currently have pretty much zero knowledge of this function.

Any help would be greatly appreciated.

Thanks, Phil

CodePudding user response：

If you want to use igraph, I guess "the shortest path" might be a case fits your objective if you artificially assign weights on edges such that only the lower To has a lower weight.

Probably you can try the code like below

setorder(dt, From, To)[, wt := fifelse(rleid(To)==1,1,Inf), From] %>%
  graph_from_data_frame() %>%
  set_edge_attr(name = "weight", value = dt[, wt]) %>%
  shortest_paths(min(dt[, From]), max(dt[, To])) %>%
  pluck(1) %>%
  unlist(use.names = FALSE)

or a shorter version

dt[, .SD[which.min(To)], From] %>%
  graph_from_data_frame() %>%
  shortest_paths(min(dt[, From]), max(dt[, To])) %>%
  pluck(1) %>%
  unlist(use.names = FALSE)

which gives

[1] 1 3 4 5 6

CodePudding user response：

Assuming an ordered From and To list this may work.

It first groups by From, compresses by To, then excludes non-matching From-To values using shift.

If jumps are missing (e.g. To 3 but From 3 missing) it prints NULL

dt[, .(frst = first(To)), From][
  , if(all((frst %in% From)[1:(.N - 1)])){
      c(1, unique(frst[From == shift(frst, type = "lag", fill = T)]))}]
[1] 1 3 4 5 6