Home > database >  R - Efficiently insert multiple strings
R - Efficiently insert multiple strings

Time:10-30

I would like to insert a list of sub strings (word_list) into a string (text) at specific positions (idx_list)

text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")

I know there are multiple possibilites functions (gsub, stri_sub etc.) which I can use in a loop. This gets however quite slow on large corpora. Is there a more efficient solution? Maybe vectorized?

CodePudding user response:

I think it's important to start from the last (highest idx_list) first, since otherwise all numbers will need to be shifted. (This is certainly not hard, but going backwards seems easier.)

#      0         1         2         3         4         5         6         7         8         9         a         b         c
#      0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list = c(5,16,30,50)
word_list = c("AAA", "BBB", "CCC", "DDD")

The work:

ord <- order(-idx_list)
Reduce(function(S, R) {
  paste0(substring(S, 1, R[[1]]-1), R[[2]], substring(S, R[[1]], nchar(S)))
}, Map(list, idx_list[ord], word_list[ord]), text)
# [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

Walk-through:

  • ord is just the decreasing order, so for instance

    word_list[ord]
    # [1] "DDD" "CCC" "BBB" "AAA"
    
  • because we're going to use Reduce (explanation in a second), we need the combination of idx_list[1] and word_list[1] to be in one argument, not individual; for this, we combine them using Map(list, ...), which "zips" them together into a single list, each containing the character position and the string to insert:

    str( Map(list, idx_list[ord], word_list[ord]) )
    # List of 4
    #  $ :List of 2
    #   ..$ : num 50
    #   ..$ : chr "DDD"
    #  $ :List of 2
    #   ..$ : num 30
    #   ..$ : chr "CCC"
    #  $ :List of 2
    #   ..$ : num 16
    #   ..$ : chr "BBB"
    #  $ :List of 2
    #   ..$ : num 5
    #   ..$ : chr "AAA"
    

    (This can be used with an arbitrary number of arguments.)

  • Because we need to insert a string, then insert another string into the result of the first, the base function Reduce will work well here. The first arg is a function that accepts two arguments: the results from the previous call, and the next element from the Map'd argument.

CodePudding user response:

An adaptation of Zach Foster's answer, condensed into a single function

reposition_str <- function(string, inject, index) {
  inject <- inject[order(index)]
  index <- sort(index)
  # expand string
  split <- substr(rep(string, length(index)   1),
    start = c(1, index),
    stop = c(index - 1, nchar(string))
  )
  ord1 <- 2 * (1:length(split)) - 1
  ord2 <- 2 * (1:length(inject))
  paste(c(split, inject)[order(c(ord1, ord2))], collapse = "")
}

text <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
idx_list <- c(5, 16, 30, 50)
word_list <- c("AAA", "BBB", "CCC", "DDD")

reposition_str(text, word_list, idx_list)
#> [1] "LoreAAAm ipsum dolBBBor sit amet, cCCConsectetur adipiscinDDDg elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

And a benchmark for speed comparisons:

bench::mark(reposiion = {reposition_str(text, word_list, idx_list)})
# A tibble: 1 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory               time               gc               
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>               <list>             <list>           
1 reposiion    69.5us   74.5us    12303.    79.7KB     8.32  5918     4      481ms <chr [1]> <Rprofmem [218 x 3]> <bench_tm [5,922]> <tibble [5,922 x~
  • Related