Home > Mobile >  extract list items from text in R
extract list items from text in R

Time:12-22

I have a text that is extracted from a PDF using pdftools::pdf_text. the PDf contains bullet point items for instance:

 - project abstract
 - project narrative

after extracting it, the text looks like this:

   project abstract       project narrative

now, I want to pull these items from the blob of text. I have tried doing something like this:

grep("\\s[a-zA-Z] \\s[a-zA-Z] ", text)

but it can't find it. What will be the right regex expression to pull the list items? or what is the right way of extracting the list items?

CodePudding user response:

You can use the str_split function from stringr to identify the text after each ambiguous unicode character...

# install.packages("stringr")
library(stringr)

txt <- "   project abstract       project narrative"

trimws(unlist(str_split(txt, "\uf0b7"))[-1])
# [1] "project abstract"  "project narrative"

The unicode character you use in your example is \uf0b7

  •  Tags:  
  • r
  • Related