I have a vector of strings. Each element of the vector corresponds to a line of an ocerized text. The last word of each line can be cut (or not) by a dash and continue on the next line, the next element of the vector.
text <- c("Lorem ipsum dolor sit am-",
"et consectetur adipis-",
"cing Quisque euismod, ex vel -aliquam- vestibulum",
"Nulla lacinia volutpat ipsum sed condimentum")
What I want : to reconstitute the cut words while keeping the layout of the text in paragraphs.
Lorem ipsum dolor sit amet
consectetur adipiscing
Quisque euismod, ex vel -aliquam- vestibulum
Nulla lacinia volutpat ipsum sed condimentum
WXhat I dont want :
Lorem ipsum dolor sit amet consectetur adipiscing Quisque euismod, ex vel -aliquam- vestibulum Nulla lacinia volutpat ipsum sed condimentum
What I did : I converted the rows of the vector into a data array because I think the functions in the dplyr Package (lead and lag) might be useful to me.
textdf <- as.data.frame((text))
library(dplyr)
textdf <- textdf %>%
rename( text = '(text)')
What I think should be done: If a string ends with a dash then select the first word of the next row, remove the dash and cut and paste the word at the end of the row.
library(stringr)
textdf <- textdf %>%
mutate(text = str_replace(text, "-$", lag("^. \\s")))
CodePudding user response:
Here is a way -
library(dplyr)
data.frame(text) %>%
#The word end with "-"
mutate(cut_word = grepl('-$', text),
#Remove the last "-"
text = sub('-$', '', text),
#If cut_word get 1st word from next value and paste it in current value.
text = ifelse(cut_word, paste0(text, stringr::word(lead(text), 1)), text),
#Remove the first word if previous value has cut_word.
text = ifelse(lag(cut_word, default = FALSE), sub('.*?\\s', '', text), text)) %>%
select(-cut_word)
# text
#1 Lorem ipsum dolor sit amet
#2 consectetur adipiscing
#3 Quisque euismod, ex vel -aliquam- vestibulum
#4 Nulla lacinia volutpat ipsum sed condimentum