I have a large folder of pdf documents. I am trying to extract the text from each document iteratively (such that the only input is the folder pathway). It seems one can approach this with a imap/map and a for loop. Below is an attempt mapping a function onto a vector in which all files in the folder reside.
files <- list.files(path = "foldername", pattern = "*.pdf")
text_vector = c()
df <- files %>% map(function(x) {
text <- pdf_text(x))
text_vector <- append(text)})
I welcome alternative methods to the same end of extracting the text across all files in a folder.
CodePudding user response:
You could use sapply
followed by rbind
to join the results together.
library(pdftools)
pdfs <- list.files('foldername', pattern = 'pdf', full.names = T)
text <- sapply(pdfs, pdf_text)
all_text <- do.call(rbind, text)
CodePudding user response:
Here's a more concise way of assigning your pdf text to a single vector using map_chr
:
files <- list.files(path = "foldername", pattern = "*.pdf")
text_vector <- map_chr(files, ~ pdf_text(.))