Efficient way to split a huge string in R-CodePudding

I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.

The data looks like this:

mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"

# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
                   title = c( "Lorem", "maecenas"))
mydf

  page   title
1    1   Lorem
2    2 vivamus

mygoal <- mydf  # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal 

  page   title                                    text
1    1   Lorem ipsum dolor sit amet, sollicitudin duis
2    2 vivamus        habitasse ultrices aenean tempus

How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.

CodePudding user response：

We could use strsplit

mygoal$text <- trimws(strsplit(mystring,
      paste(mydf$title, collapse = "|"))[[1]][-1])

-output

> mygoal
  page    title                                    text
1    1    Lorem ipsum dolor sit amet, sollicitudin duis
2    2 maecenas        habitasse ultrices aenean tempus

CodePudding user response：

In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract with some regex:

library(dplyr)
library(stringr)
library(glue)

mydf |>  
  mutate(next_title = lead(title, default = "$")) |> 
  mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |> 
  select(-next_title)

Yielding:

page    title                                      text
1    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2    2 maecenas          habitasse ultrices aenean tempus

If performance is a concern, a similar approach with data.table would be:

library(data.table)
library(stringr)
library(glue)

mydt <- setDT(mydf)

mydt[, next_title :=shift(title, fill = "$", type = "lead")][
  ,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
    !("next_title")]

Resulting in:

   page    title                                      text
1:    1    Lorem  ipsum dolor sit amet, sollicitudin duis 
2:    2 maecenas          habitasse ultrices aenean tempus