Selecting number from string based on criteria-CodePudding

I have the following data set:

PATH = c("5-8-10-8-17-20",
         "56-85-89-89-0-15-88-10",
         "58-85-89-65-49-51")
INDX = c(18, 89, 50)

data.frame(PATH, INDX)

PATH	INDX
5-8-10-8-17-20	18
56-85-89-89-0-15-88-10	89
58-85-89-65-49-51	50

The column PATH has strings that represent a numerical series and I want to be able to pick the largest number from the string that satisfies PATH <= INDX, that is selecting a number from PATH that is equal to INDX or the largest number from PATH that is yet less than INDX

my desired output would look like this:

PATH	INDX	PICK
5-8-10-8-17-20	18	17
56-85-89-89-0-15-88-10	89	88
58-85-89-65-49-51	50	49

Some of my thought-process behind the answer:

I know that If I have a function such strsplit I could separate each string by "-", arrange by number and then subtract with INDX and thus select the smallest negative number or zero. However, the original dataset is quite large and I wonder if there is a faster or more efficient way to perform this task.

CodePudding user response：

Another option:

mapply(
  \(x, y) max(x[x <= y]),
  strsplit(PATH, "-") |> lapply(as.integer),
  INDX
)

# [1] 17 88 49

CodePudding user response：

Using purrr::map2_dbl():

library(purrr)

PICK <- map2_dbl(
  strsplit(PATH, "-"),
  INDX,
  ~ max(
    as.numeric(.x)[as.numeric(.x) <= .y]
  )
)

# 17 89 49

CodePudding user response：

The below should be reasonably efficient, there is nothing wrong with your approach.

numpath <- sapply(strsplit(PATH, "-"), as.numeric)
maxindexes <- lapply(1:length(numpath), function(x) which(numpath[[x]] <= INDX[x]))
result <- sapply(1:length(numpath), function(x) max(numpath[[x]][maxindexes[[x]]]))
> result
[1] 17 89 49

CodePudding user response：

Using dplyr

library(dplyr)

df |> 
  rowwise() |> 
  mutate(across(PATH, ~ {
    a =  unlist(strsplit(.x, split = "-"))
   max(as.numeric(a)[which(as.numeric(a) <= INDX)])
   },  .names = "PICK"))

 PATH                    INDX  PICK
  <chr>                  <dbl> <dbl>
1 5-8-10-8-17-20            18    17
2 56-85-89-89-0-15-88-10    89    89
3 58-85-89-65-49-51         50    49

CodePudding user response：

You can create a custom function like below:

my_func <- function(vec1, vec2) {
  
  sort(as.numeric(unlist(strsplit(vec1, split = "-")))) -> x
  return(x[max(cumsum(x <= vec2))])
  
}


df$PICK <- sapply(seq_len(nrow(df)), function(i) my_func(df$PATH[i], df$INDX[i]))

which will yield the following output:

# PATH INDX PICK
# 1         5-8-10-8-17-20   18   17
# 2 56-85-89-89-0-15-88-10   89   89
# 3      58-85-89-65-49-51   50   49