I have a huge string (> 500MB), actually it's an entire book collection in one. I have some meta information in another dataframe, e.g. page numbers, (different) authors and titles. I try to detect the title strings in my huge string and split it by title. I assume titles are unique.
The data looks like this:
mystring <- "Lorem ipsum dolor sit amet, sollicitudin duis maecenas habitasse ultrices aenean tempus"
# a dataframe of meta data, e.g. page numbers and titles
mydf <- data.frame(page = c(1, 2),
title = c( "Lorem", "maecenas"))
mydf
page title
1 1 Lorem
2 2 vivamus
mygoal <- mydf # text that comes after the title
mygoal$text <- c("ipsum dolor sit amet, sollicitudin duis", "habitasse ultrices aenean tempus")
mygoal
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 vivamus habitasse ultrices aenean tempus
How can I split the string such that everything between two titles is the first text, everything that comes after the second title and before the third title, becomes the second text element - in the most efficient way.
CodePudding user response:
We could use strsplit
mygoal$text <- trimws(strsplit(mystring,
paste(mydf$title, collapse = "|"))[[1]][-1])
-output
> mygoal
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 maecenas habitasse ultrices aenean tempus
CodePudding user response:
In case you wanted to do the operation in a piped tidyverse way, you could try using stringr::str_extract
with some regex:
library(dplyr)
library(stringr)
library(glue)
mydf |>
mutate(next_title = lead(title, default = "$")) |>
mutate(text = str_extract(mystring, glue::glue("(?<={title}\\s?)(.*)(?:{next_title})"))) |>
select(-next_title)
Yielding:
page title text
1 1 Lorem ipsum dolor sit amet, sollicitudin duis
2 2 maecenas habitasse ultrices aenean tempus
If performance is a concern, a similar approach with data.table
would be:
library(data.table)
library(stringr)
library(glue)
mydt <- setDT(mydf)
mydt[, next_title :=shift(title, fill = "$", type = "lead")][
,text := str_extract(..mystring, glue_data(.SD,"(?<={title}\\s?)(.*)(?={next_title})"))][,
!("next_title")]
Resulting in:
page title text
1: 1 Lorem ipsum dolor sit amet, sollicitudin duis
2: 2 maecenas habitasse ultrices aenean tempus