I am unexperienced in handling files in R! So please be gentile.
I have a pdf that looks like this:
I would like to extract the data in the red rectangle only from this text and save it to a dataframe (I have thousands of this kind of pdf).
So far I managed to read in the data and get this ->
My code:
library(tidyverse)
library(pdftools)
library(here)
PDF_x <- pdf_text(here("pdf_project/example_for_pdf.pdf")) %>%
str_split("\n")
Which gives:
[[1]]
[1] " BlaBla heaeder"
[2] " Mr. Bombastic XXXXXXXXXXXXX"
[3] " Text1"
[4] " Text2"
[5] " Text3,"
[6] " Text4"
[7] " Text5"
[8] " Text6"
[9] " Text7"
[10] " Text8"
[11] " Blabla, 12.01.2021"
[12] " bobo /"
[13] " blabla: 111111111"
[14] " Micheal Jackson, justo duo dolores et ea rebu"
[15] " accusam: justo duo dolores et ea rebu"
[16] " dolores: Bla Bla Bla"
[17] " BLABLA_1"
[18] " X-Date: 17.07.2021"
[19] " 1. Master1 Tim"
[20] " 1. Master2 Jack"
[21] " 1. Master3 Monika"
[22] " 1. Master4 Jill"
[23] " Header1"
[24] " Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"
[25] " magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd"
[26] " gubergren, no sea takimata"
[27] " Header2"
[28] " Lorem ipsum dolor sit amet, consetetur sadipscing elitr."
[29] " Header3"
[30] " Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"
[31] " magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum."
[32] " Header4"
[33] " Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna"
[34] " aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea"
[35] " takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy"
[36] " eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo"
[37] " dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."
[38] "ipsum dolor sit a sed diam nonumy eirmod tempor invidunt ut labore et dolore magna Master of Disaster Tim"
[39] "ipsum dolor sit a invidunt ut labore et dolore magna Chief master"
[40] " ipsum dolor sit a invidunt ut labore et dolore magnainvidunt ut labore et dolore magna 2s"
[41] ""
[[2]]
[1] " blablablablablablab" " invidunt ut labore et dolore magna"
[3] "invidunt ut labore et dolore magna..at" ""
I really appreciate any guiding help!
CodePudding user response:
As str_split/strsplit
returns a list
, extract the first list
element ([[1]]
), find the position index where the line starts with (^
) 'X-Date:' after removing the leading/lagging spaces (trimws
) as well the position of 'Header4' (and subtract 1 to get the previous line position), get the sequence (:
) to subset the vector elements
v1 <- trimws(PDF_x[[1]])
v1[grep("^X-Date:", v1):(grep("Header4", v1)-1)]