I have a text that is extracted from a PDF using pdftools::pdf_text. the PDf contains bullet point items for instance:
- project abstract
- project narrative
after extracting it, the text looks like this:
project abstract project narrative
now, I want to pull these items from the blob of text. I have tried doing something like this:
grep("\\s[a-zA-Z] \\s[a-zA-Z] ", text)
but it can't find it. What will be the right regex expression to pull the list items? or what is the right way of extracting the list items?
CodePudding user response:
You can use the str_split
function from stringr
to identify the text after each ambiguous unicode character...
# install.packages("stringr")
library(stringr)
txt <- " project abstract project narrative"
trimws(unlist(str_split(txt, "\uf0b7"))[-1])
# [1] "project abstract" "project narrative"
The unicode character you use in your example is \uf0b7