Home > Enterprise >  Split PDF files in multiples files every 2 pages in R
Split PDF files in multiples files every 2 pages in R

Time:05-19

I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.

Maybe I can use the "pdftools" package, but I don't know how.

CodePudding user response:

Neither pdftools nor qpdf (on which the first depends) support splitting PDF files by other than "every page". You likely will need to rely on an external program, I'm confident you can get pdftk to do that by calling it once for each 2-page output.

I have a 36-page PDF here named quux.pdf in the current working directory.

str(pdftools::pdf_info("quux.pdf"))
# List of 11
#  $ version    : chr "1.5"
#  $ pages      : int 36
#  $ encrypted  : logi FALSE
#  $ linearized : logi FALSE
#  $ keys       :List of 8
#   ..$ Producer       : chr "pdfTeX-1.40.24"
#   ..$ Author         : chr ""
#   ..$ Title          : chr ""
#   ..$ Subject        : chr ""
#   ..$ Creator        : chr "LaTeX via pandoc"
#   ..$ Keywords       : chr ""
#   ..$ Trapped        : chr ""
#   ..$ PTEX.Fullbanner: chr "This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4"
#  $ created    : POSIXct[1:1], format: "2022-05-17 22:54:40"
#  $ modified   : POSIXct[1:1], format: "2022-05-17 22:54:40"
#  $ metadata   : chr ""
#  $ locked     : logi FALSE
#  $ attachments: logi FALSE
#  $ layout     : chr "no_layout"

I also have pdftk installed and available in the page,

Sys.which("pdftk")
#                                        pdftk 
# "C:\\PROGRA~2\\PDFtk Server\\bin\\pdftk.exe" 

With this, I can run an external script to create 2-page PDFs:

list.files(pattern = "pdf$")
# [1] "quux.pdf"

pages <- seq(pdftools::pdf_info("quux.pdf")$pages)
pages <- split(pages, (pages - 1) %/% 2)
pages[1:3]
# $`0`
# [1] 1 2
# $`1`
# [1] 3 4
# $`2`
# [1] 5 6

for (pg in pages) {
  system(sprintf("pdftk quux.pdf cat %s-%s output out_i-i.pdf",
         min(pg), max(pg), min(pg), max(pg)))
}

list.files(pattern = "pdf$")
#  [1] "out_01-02.pdf" "out_03-04.pdf" "out_05-06.pdf" "out_07-08.pdf"
#  [5] "out_09-10.pdf" "out_11-12.pdf" "out_13-14.pdf" "out_15-16.pdf"
#  [9] "out_17-18.pdf" "out_19-20.pdf" "out_21-22.pdf" "out_23-24.pdf"
# [13] "out_25-26.pdf" "out_27-28.pdf" "out_29-30.pdf" "out_31-32.pdf"
# [17] "out_33-34.pdf" "out_35-36.pdf" "quux.pdf"     

str(pdftools::pdf_info("out_01-02.pdf"))
# List of 11
#  $ version    : chr "1.5"
#  $ pages      : int 2
#  $ encrypted  : logi FALSE
#  $ linearized : logi FALSE
#  $ keys       :List of 2
#   ..$ Creator : chr "pdftk 2.02 - www.pdftk.com"
#   ..$ Producer: chr "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
#  $ created    : POSIXct[1:1], format: "2022-05-18 09:37:56"
#  $ modified   : POSIXct[1:1], format: "2022-05-18 09:37:56"
#  $ metadata   : chr ""
#  $ locked     : logi FALSE
#  $ attachments: logi FALSE
#  $ layout     : chr "no_layout"

CodePudding user response:

1) pdftools Change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.

library(pdftools)

# inputs
setwd("~/../Downloads")  # where input pdf and output pdfs located
infile <- "a.pdf"  # input pdf
prefix <- "out_"  # output pdf's will begin with this prefix

num <- pdf_length(infile)
st <- seq(1, num, 2)
en <- pmin(st   1, num)

for (i in seq_along(st)) {
  outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
  pdf_subset(infile, pages = st[i]:en[i], output = outfile)
}

2) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.

library(animation)

# inputs
setwd("~/../Downloads")  # where input pdf is located
PDFTK <- "~/../bin/pdftk.exe"  # path to pdftk
infile <- "a.pdf"  # input pdf
prefix <- "out_"  # output pdf's will begin with this prefix

ani.options(pdftk = Sys.glob(PDFTK))

tmp <- tempfile()
dump_data <- pdftk(infile, "dump_data", tmp)
g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
num <- as.numeric(sub(".* ", "", g))

st <- seq(1, num, 2)
en <- pmin(st   1, num)

for (i in seq_along(st)) {
  outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
  pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)
}
  • Related