Home > Software design >  R - Merge two elements of a list in an iterative pdf task
R - Merge two elements of a list in an iterative pdf task

Time:02-20

For a pdf mining task in R, I need your help.

I wish to mine 1061 multi-page pdf files with the file names pdf_filenames, for which I would like to extract the content of the first two pages of each pdf file.

So far, I have managed to get the content of all pdf files using the map function from the purrr library and pdf_text function from pdftools library.

> pdfs = pdf_filenames %>% 
        map(pdf_text)

This outputs a list with each element of the list representing one pdf file. The structure of the pdfs list is:

> str(pdfs)
List of 1061
 $ : chr [1:3] "Content page 1_pdf1" "Content page 2_pdf1" "Content page 3_pdf1"
 $ : chr [1:4] "Content page 1_pdf2" "Content page 2_pdf2" "Content page 3_pdf2" "Content page 4_pdf2"
 $ : chr [1:2] "Content page 1_pdf3" "Content page 2_pdf3"
 .
 .
 .

My desired output is:

List of 1061
 $ : chr [1:2] "Content page 1_pdf1 Content page 2_pdf1" "Content page 3_pdf1"
 $ : chr [1:3] "Content page 1_pdf2 Content page 2_pdf2" "Content page 3_pdf2" "Content page 4_pdf2"
 $ : chr [1:1] "Content page 1_pdf3 Content page 2_pdf3"
 .
 .
 .

I tried this map function

> pdfs = pdf_filenames %>% 
        map(pdf_text) %>%
        map(c(1,2))

but that returned an empty list.

> pdfs
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL
.
.
.

Appreciate your help very much! Thanks!

CodePudding user response:

We can use a lambda expression (~) to apply the pdf_text on the elements individually and then paste/str_c the first two elements (based on the expected output)

library(dplyr)
library(pdftools)
library(purrr)
library(stringr)
pdf_filenames %>% 
        map( ~ {
           x1 <- pdf_text(.x)
           c(str_c(head(x1, 2), collapse = " "), tail(x1, -2) )
        })
  • Related