This is more theoretical as I was not able to construct a reproducible example. But after many hours I need your help.
1. I have a folder with 3 pdf files
This code does what I want. It prints the 3 tables combined in one to the screen.
library(pdftools)
library(here)
pdf_files <- list.files(here("pdf_xxx"), pattern=".pdf")
for (i in 1:length(pdf_files)) {
PDF <- pdf_text(paste(here("pdf_xxx"), pdf_files[i], sep="/")) %>%
readr::read_lines()
print(PDF)
}
My question is how can I store this output in a dataframe:
For this I used this code:
# create empty dataframe results
results <- data.frame(text = character(length(pdf_files)), stringsAsFactors = FALSE)
for (i in 1:length(pdf_files)) {
PDF <- pdf_text(paste(here("pdf_xxx"), pdf_files[i], sep="/")) %>%
readr::read_lines()
results$text[i] <- PDF[i]
}
results
Here I only get 3 rows. This is because length(pdf_files) is 3 ?!
How could I store the print output which looks like this in a dataframe:
Update the print(PDF)
in the for loop (here with 2 pdf files) gives this (and I want to save this to an object):
for (i in 1:length(pdf_files)) {
PDF <- pdf_text(paste(here("pdf_dienstplan"), pdf_files[i], sep="/")) %>%
readr::read_lines()
print(PDF)
}
[1] " blabla
[2] " blabla
[3] ""
[4] " Datum
[5] "Sa 01.10.2022
[6] ""
[7] "So 02.10.2022
[8] ""
[9] "Mo 03.10.2022
[10] ""
[11] "Di 04.10.2022
[12] ""
[13] "Mi 05.10.2022
[14] ""
[15] "Do 06.10.2022
[16] ""
[17] "Fr 07.10.2022
[18] ""
[19] "Sa 08.10.2022
[20] ""
[21] "So 09.10.2022
[22] ""
[23] "Mo 10.10.2022
[24] ""
[25] "Di 11.10.2022
[26] ""
[27] "Mi 12.10.2022
[28] ""
[29] "Do 13.10.2022
[30] ""
[31] "Fr 14.10.2022
[32] ""
[33] "Sa 15.10.2022
[34] ""
[35] "So 16.10.2022
[36] ""
[37] "Mo 17.10.2022
[38] ""
[39] "Di 18.10.2022
[40] ""
[41] "Mi 19.10.2022
[42] ""
[43] "Do 20.10.2022
[44] ""
[45] "Fr 21.10.2022
[46] ""
[47] "Sa 22.10.2022
[48] ""
[49] "So 23.10.2022
[50] ""
[51] "Mo 24.10.2022
[52] ""
[53] "Di 25.10.2022
[54] ""
[55] "Mi 26.10.2022
[56] ""
[57] "Do 27.10.2022
[58] ""
[59] "Fr 28.10.2022
[60] ""
[61] "Sa 29.10.2022
[62] ""
[63] "So 30.10.2022
[64] ""
[65] "Mo 31.10.2022
[1] " blabla
[2] " blabla
[3] ""
[4] " Datum
[5] "Do 01.09.2022
[6] ""
[7] "Fr 02.09.2022
[8] ""
[9] "Sa 03.09.2022
[10] ""
[11] "So 04.09.2022
[12] ""
[13] "Mo 05.09.2022
[14] ""
[15] "Di 06.09.2022
[16] ""
[17] "Mi 07.09.2022
[18] ""
[19] "Do 08.09.2022
[20] ""
[21] "Fr 09.09.2022
[22] ""
[23] "Sa 10.09.2022
[24] ""
[25] "So 11.09.2022
[26] ""
[27] "Mo 12.09.2022
[28] ""
[29] "Di 13.09.2022
[30] ""
[31] "Mi 14.09.2022
[32] ""
[33] "Do 15.09.2022
[34] ""
[35] "Fr 16.09.2022
[36] ""
[37] "Sa 17.09.2022
[38] ""
[39] "So 18.09.2022
[40] ""
[41] "Mo 19.09.2022
[42] ""
[43] "Di 20.09.2022
[44] ""
[45] "Mi 21.09.2022
[46] ""
[47] "Do 22.09.2022
[48] ""
[49] "Fr 23.09.2022
[50] ""
[51] "Sa 24.09.2022
[52] ""
[53] "So 25.09.2022
[54] ""
[55] "Mo 26.09.2022
[56] ""
[57] "Di 27.09.2022
[58] ""
[59] "Mi 28.09.2022
[60] ""
[61] "Do 29.09.2022
[62] ""
[63] "Fr 30.09.2022
CodePudding user response:
That is because you're indexing PDF[i]
so you only keep the i
th element. And yes, there are only 3 pdf files since result
has 3 rows.
The below should work.
res_list <- list()
for (i in 1:length(pdf_files)) {
PDF <- pdf_text(paste(here("pdf_xxx"), pdf_files[i], sep="/")) %>%
readr::read_lines()
res_list[[i]] <- PDF
}
result <- as.data.frame(res_list)
CodePudding user response:
First of all, thanks to all who helped me. I have changed my strategy:
With pdf_combine
from qpdf
package I combine all pdf (here 3) in my folder to one pdf (combined) pdf.
Then I use pdf_text()
from pdftools
package with read_lines
.
Here is the code:
library(qpdf)
library(pdftools)
library(tidyverse)
pdf_files <- list.files(here("pdf_xxx"), pattern=".pdf", full.names = TRUE)
my_path <- "pdf_xxx/combined_xxx.pdf"
pdf_combine(input = pdf_files,
output = my_path )
PDF <- pdf_text(my_path) %>%
readr::read_lines()