Home > OS >  r pdf_text() split into lines and words
r pdf_text() split into lines and words

Time:03-20

I can't upload the file into stackoverflow but I have a PDF containing a table spanning 3 pages. After using library(pdftools) and pdf_text(), it creates a 3 element character list where each element is a long string of all text from each page.

library(pdftools)
df <- pdf_text(file.pdf)

The data I need is on the 2nd page. I get the output:

df[2]
All Households                                                 19,015    10,030      8,985    3,635     585     3,055   19.1    5.8    34.0\n\nHousing above standards                                        12,365     8,225      4,145       0        0        0     0.0    0.0     0.0\n\nBelow one or more housing standards                             6,650     1,805      4,845    3,640     585     3,055   54.7   32.4    63.1\n\nBelow affordability standard12                                  4,885     1,230      3,660    3,125     535     2,590   64.0   43.5    70.8\n\nBelow adequacy standard13                                       1,360      555        810      425       75      350    31.2   13.5    43.2\n\n\n\n\n

I want to isolate the row "Below one or more housing standards" and the 8th column which contains the value "54.7".

I believe the next steps are to split the long string into lines by the line break character "\n", identify the applicable line, split the line into words, and select the 8th word.

I've tried splitting into lines using:

library(stringr)
lines <- df[2] %>% str_split("\n")

It returns a "List of 1" and I'm not sure how to work with it. Any suggestions on the syntax?

It's a bit convoluted to get to the original file. enter image description here

  •  Tags:  
  • r
  • Related