Home > database >  How to extract text from specific part of pdf
How to extract text from specific part of pdf

Time:06-20

I am unexperienced in handling files in R! So please be gentile.

I have a pdf that looks like this:

enter image description here

I would like to extract the data in the red rectangle only from this text and save it to a dataframe (I have thousands of this kind of pdf).

So far I managed to read in the data and get this ->

My code:

library(tidyverse)
library(pdftools)
library(here)

PDF_x <- pdf_text(here("pdf_project/example_for_pdf.pdf")) %>% 
  str_split("\n")

Which gives:

[[1]]
 [1] "                                              BlaBla heaeder"                                                                                               
 [2] "                                           Mr. Bombastic XXXXXXXXXXXXX"                                                                                     
 [3] "                                                                                                                 Text1"                                     
 [4] "                                                                                                                 Text2"                                     
 [5] "                                                                                                                 Text3,"                                    
 [6] "                                                                                                                 Text4"                                     
 [7] "                                                                                                                 Text5"                                     
 [8] "                                                                                                                 Text6"                                     
 [9] "                                                                                                                 Text7"                                     
[10] "                                                                                                                                                 Text8"     
[11] "                                                                                                                                       Blabla, 12.01.2021"  
[12] "                                                                                                                                                     bobo /"
[13] "                                                                                                                                        blabla: 111111111"  
[14] "       Micheal Jackson, justo duo dolores et ea rebu"                                                                                                       
[15] "       accusam:           justo duo dolores et ea rebu"                                                                                                     
[16] "       dolores:           Bla Bla Bla"                                                                                                                      
[17] "                                                                              BLABLA_1"                                                                     
[18] "     X-Date: 17.07.2021"                                                                                                                                    
[19] "      1. Master1                        Tim"                                                                                                                
[20] "      1. Master2                        Jack"                                                                                                               
[21] "      1. Master3                        Monika"                                                                                                             
[22] "      1. Master4                        Jill"                                                                                                               
[23] "     Header1"                                                                                                                                               
[24] "     Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"                                   
[25] "      magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd"                                     
[26] "      gubergren, no sea takimata"                                                                                                                           
[27] "     Header2"                                                                                                                                               
[28] "      Lorem ipsum dolor sit amet, consetetur sadipscing elitr."                                                                                             
[29] "     Header3"                                                                                                                                               
[30] "     Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore"                                   
[31] "      magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum."                                                     
[32] "     Header4"                                                                                                                                               
[33] "      Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna"                            
[34] "      aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea"                         
[35] "      takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy"                            
[36] "      eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo"                               
[37] "      dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."                                              
[38] "ipsum dolor sit a                            sed diam nonumy eirmod tempor invidunt ut labore et dolore magna         Master of Disaster Tim"               
[39] "ipsum dolor sit a                                                             invidunt ut labore et dolore magna                Chief master"               
[40] "            ipsum dolor sit a            invidunt ut labore et dolore magnainvidunt ut labore et dolore magna 2s"                                           
[41] ""                                                                                                                                                           

[[2]]
[1] "                  blablablablablablab"  "   invidunt ut labore et dolore magna" 
[3] "invidunt ut labore et dolore magna..at" ""  

I really appreciate any guiding help!

CodePudding user response:

As str_split/strsplit returns a list, extract the first list element ([[1]]), find the position index where the line starts with (^) 'X-Date:' after removing the leading/lagging spaces (trimws) as well the position of 'Header4' (and subtract 1 to get the previous line position), get the sequence (:) to subset the vector elements

v1 <- trimws(PDF_x[[1]])
v1[grep("^X-Date:", v1):(grep("Header4", v1)-1)]
  • Related