Home > OS >  Converting PDF to text with pdftools in R returning empty string
Converting PDF to text with pdftools in R returning empty string

Time:04-03

In the following example, the result is empty for every page in the PDF.

library(pdftools)

rm(list = ls())
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
destfile = file.path(getwd(), basename(url))
download.file(url, destfile, mode = "wb")

file = list.files(path=".", pattern="pdf$")

pdf_text(file)

I am not sure whether there is a problem with the PDF file and the way it was scanned and saved that prevents PDF reading. Is there a workaround for PDF files like this or a better package/library that I should consider?

CodePudding user response:

I would guess that the issue is that it's a scanned document. So your probably need some OCR tools to extract the text and information from the document. One option would be the tesseract package:

library(tesseract)

url = "https://reporting.standardbank.com/wp-content/uploads/2022/02/SBS72-Pricing-Supplement.pdf"
eng <- tesseract("eng")
text <- tesseract::ocr(url, engine = eng)
#> Converting page 1 to file16a069b77ed2SBS72-Pricing-Supplement_1.png... done!
#> Converting page 2 to file16a069b77ed2SBS72-Pricing-Supplement_2.png... done!
#> Converting page 3 to file16a069b77ed2SBS72-Pricing-Supplement_3.png... done!
#> Converting page 4 to file16a069b77ed2SBS72-Pricing-Supplement_4.png... done!
#> Converting page 5 to file16a069b77ed2SBS72-Pricing-Supplement_5.png... done!
#> Converting page 6 to file16a069b77ed2SBS72-Pricing-Supplement_6.png... done!
#> Converting page 7 to file16a069b77ed2SBS72-Pricing-Supplement_7.png... done!
#> Converting page 8 to file16a069b77ed2SBS72-Pricing-Supplement_8.png... done!

text[[1]]
#> [1] "APPLICABLE PRICING SUPPLEMENT DATED 28 JANUARY 2022\nThe Standard Bank of South Africa Limited\n(dncorporated with limited liability under Registration Number 1962/000738/06\nin the Republic of South Africa)\nIssue of ZAR404,000,000 Senior Unsecured Floating Rate Notes due 02 February 2029\nUnder its ZAR110,000,000,000 Domestic Medium Term Note Programme\nThis document constitutes the Applicable Pricing Supplement relating to the issue of Notes described herein.\nTerms used herein shall be deemed to be defined as such for the purposes of the terms and conditions (the\n“Terms and Conditions\") set forth in the Programme Memorandum dated 24 December 2020 (the \"Programme\nMemorandum\"), as updated and amended from time to time. This Pricing Supplement must be read in\nconjunction with such Programme Memorandum. To the extent that there is any conflict or inconsistency between\nthe contents of this Pricing Supplement and the Programme Memorandum, the provisions of this Pricing\nSupplement shall prevail.\nDESCRIPTION OF THE NOTES\nl. Issuer The Standard Bank of South Africa\nLimited\n2. Debt Officer Amo Daehnke, Group Chief\nFinancial and Value Management\nOfficer of Standard Bank Group\nLimited\n3. Status of the Notes Senior Unsecured\n4. (a) Series Number 72\n(b) Tranche Number ]\n5. Aggregate Nominal Amount ZAR404,000,000\n6. Redemption/Payment Basis N/A\n7. Type of Notes Floating Rate Notes\n8. Interest Payment Basis Floating Rate\n9. Form of Notes Registered Notes\n10. Automatic/Optional Conversion from one Interest/Payment N/A\nBasis to another\nll. Issue Date 2 February 2022\n12. Business Centre Johannesburg\n13. Additional Business Centre N/A\n14. Specified Denomination ZAR]1,000,000\n15. Calculation Amount ZAR1,000,000\n16. Issue Price 100%\n17. Interest Commencement Date 02 February 2022\n18. Maturity Date 02 February 2029\n19. Maturity Period N/A\n1\n"
  • Related