Home > Software design >  How to iteratively apply a function (pdf_text()) across pdf files in a folder in R?
How to iteratively apply a function (pdf_text()) across pdf files in a folder in R?

Time:02-11

I have a large folder of pdf documents. I am trying to extract the text from each document iteratively (such that the only input is the folder pathway). It seems one can approach this with a imap/map and a for loop. Below is an attempt mapping a function onto a vector in which all files in the folder reside.

files <- list.files(path = "foldername", pattern = "*.pdf")

text_vector = c()

df <- files %>% map(function(x) {
    text <- pdf_text(x))
    text_vector <- append(text)})

I welcome alternative methods to the same end of extracting the text across all files in a folder.

CodePudding user response:

You could use sapply followed by rbind to join the results together.

library(pdftools)
pdfs <- list.files('foldername', pattern = 'pdf', full.names = T)
text <- sapply(pdfs, pdf_text)
all_text <- do.call(rbind, text)

CodePudding user response:

Here's a more concise way of assigning your pdf text to a single vector using map_chr:

files <- list.files(path = "foldername", pattern = "*.pdf")

text_vector <- map_chr(files, ~ pdf_text(.))
  •  Tags:  
  • r
  • Related