I am conducting a structural equation model on several PDF's (>1000) in R.
However, some PDF's are readable and other are scanned, i.e. I need to run them through an OCR-function.
Therefore, I need to find a way to automatically identify which PDF's contains text and which don't. Specifically, I wish to get find a way to return whether a given PDF should be ran through OCR.
Does anyone know of any functions or packages in R that might help do this - I can find a couple of solutions for Python, but haven't been able to identify some in R.
CodePudding user response:
You could use an approach like this (as @danlooo already suggested but I wanted to spell it out):
files <- list.files("/home/johannes/pdfs/",
pattern = ".pdf$",
full.names = TRUE)
pdfs_l <- lapply(files, function(f) {
out <- pdftools::pdf_text(f)
# I set the test to an arbitrary number of characters, it works for me but you want
# to maybe fine tune it a bit
contains_text <- nchar(out) > 15
if (!contains_text) {
out <- pdftools::pdf_ocr_text(f)
}
data.frame(text = out, ocr = !contains_text)
})
pdfs_l |>
dplyr::bind_rows() |>
dplyr::mutate(text = trimws(text)) |>
tibble::as_tibble()
#> # A tibble: 22 × 2
#> text ocr
#> <chr> <lgl>
#> 1 "TEAM MEMBERS:\n … FALSE
#> 2 "WS 21/22 … FALSE
#> 3 "WS 21/22 … FALSE
#> 4 "TEAM MEMBERS:\n … FALSE
#> 5 "TEAM MEMBERS:\n … FALSE
#> 6 "Key Concepts in Political Communication\n @Agenda Setting, Priming… FALSE
#> 7 "Key Concepts in Political Communication\n @Agenda Setting, Priming… FALSE
#> 8 "ELECTIONS AND CAMPAIGNS\n … FALSE
#> 9 "" TRUE
#> 10 "" TRUE
#> # … with 12 more rows
Created on 2022-02-10 by the reprex package (v2.0.1)