I'm trying to use pdftools package to extract data table from a pdf. My source file is here: https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf. Say, I want to extract data from Table 20 on page 170 (Change in Nominal house price)
I use the following code:
install.packages("pdftools")
library(pdftools)
report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")
tab20 <- as.data.frame(report[170])
To get the proper table I had to manually indicate that I want to extract 170th element of the list (as the table is on page 170). If next year, a new page with table is added to the report, I will have to modify the code to extract 171th element. Is there a way to do it in a more automated manner?
Basically, what I need to do is to find the element of the list that contains string "Change in Nominal house price". Any suggestion how to do it?
CodePudding user response:
You can find a string with a corresponding pattern. By using multiple filters you can gather this singular table.
table <- report[grepl('Change', report) & grepl('Nominal', report) &
grepl('house', report)]
I guess a more subtil regex could work. Also this only work because no other table have the same title, but it could be better to check if it only return a value like below :
place <- grepl('Change', report) &
grepl('Nominal', report) &
grepl('house', report)
if(sum(place) != 1){
stop("There is not only one pattern that match. Adjust pattern.")
} else {
table <- report[place]
}
CodePudding user response:
Another solution, based on purrr::map_lgl
:
library(tidyverse)
library(pdftools)
report <- pdftools::pdf_data("https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf")
map_lgl(
report,
~ str_detect(
str_c(.x$text, collapse = " "),
"Change in Nominal house price")) %>% report[.]
#> [[1]]
#> # A tibble: 606 × 6
#> width height x y space text
#> <int> <int> <int> <int> <lgl> <chr>
#> 1 59 14 39 38 TRUE STATISTICAL
#> 2 35 14 102 38 FALSE TABLES
#> 3 25 26 33 81 TRUE 20.
#> 4 60 26 65 81 TRUE Change
#> 5 15 26 129 81 TRUE in
#> 6 67 26 149 81 TRUE Nominal
#> 7 47 26 221 81 TRUE house
#> 8 41 26 272 81 FALSE price
#> 9 30 14 65 103 TRUE Annual
#> 10 7 14 98 103 TRUE %
#> # … with 596 more rows