Home > OS >  ocr : unbreak the words cut at the end of lines and keep the paragraphs
ocr : unbreak the words cut at the end of lines and keep the paragraphs

Time:10-09

I have a vector of strings. Each element of the vector corresponds to a line of an ocerized text. The last word of each line can be cut (or not) by a dash and continue on the next line, the next element of the vector.

text <- c("Lorem ipsum dolor sit am-",
     "et consectetur adipis-",
     "cing Quisque euismod, ex vel -aliquam- vestibulum",
     "Nulla lacinia volutpat ipsum sed condimentum")

What I want : to reconstitute the cut words while keeping the layout of the text in paragraphs.

Lorem ipsum dolor sit amet 
consectetur adipiscing
Quisque euismod, ex vel -aliquam- vestibulum
Nulla lacinia volutpat ipsum sed condimentum

WXhat I dont want :

Lorem ipsum dolor sit amet consectetur adipiscing Quisque euismod, ex vel -aliquam- vestibulum Nulla lacinia volutpat ipsum sed condimentum 

What I did : I converted the rows of the vector into a data array because I think the functions in the dplyr Package (lead and lag) might be useful to me.

textdf <- as.data.frame((text))


library(dplyr)
textdf  <- textdf %>%
rename( text = '(text)')

What I think should be done: If a string ends with a dash then select the first word of the next row, remove the dash and cut and paste the word at the end of the row.

library(stringr)
 textdf  <- textdf %>%
 mutate(text = str_replace(text, "-$", lag("^. \\s")))

CodePudding user response:

Here is a way -

library(dplyr)

data.frame(text) %>%
         #The word end with "-"
  mutate(cut_word = grepl('-$', text), 
         #Remove the last "-"
         text = sub('-$', '', text), 
         #If cut_word get 1st word from next value and paste it in current value.
         text = ifelse(cut_word, paste0(text, stringr::word(lead(text), 1)), text), 
         #Remove the first word if previous value has cut_word.
         text = ifelse(lag(cut_word, default = FALSE), sub('.*?\\s', '', text), text)) %>%
  select(-cut_word)

#                                          text
#1                   Lorem ipsum dolor sit amet
#2                       consectetur adipiscing
#3 Quisque euismod, ex vel -aliquam- vestibulum
#4 Nulla lacinia volutpat ipsum sed condimentum
  • Related