I scraped a table from a pdf and everything went into one element of the data frame. I managed to separate everything into separate columns, but r got confused with the column names. The first column is "State" and should include all the state names, but after separation, is blank. The second column is "State Drug Formulary," and after separation, incorrectly includes the state names in it. It’s also missing a lot of other information. Any possible fixes?

For simplicity, I renamed the column "x".

library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session:42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"

out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
  rename(x = `out[[93]]`) %>% 
    mutate(x=strsplit(x, "\n")) %>%
    unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

What the table should look like. Pg 82 of the original document if you visit the source.

I also tried this, which kept the col names, but removed the data

df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))

CodePudding user response：

Page 82 includes other contents like 21. Drug Product Selection Laws or etc.

You'd better remove them like,

dummy <- strsplit(df$`out[[93]]`, '\\n\n')

This process will split that page into four part and table what you are looking for is second object of that list.

df2 <- df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3), ]

Now df2 is the table contents. So, splitting this with more than two whitespace,

df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

will give the result. 'a' is dummy that result from separate have blank value at the front. Here's some part of the result.

  State   `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
   <chr>   <chr>            <chr>            <chr>            <chr>           
 1 Alabama None             Yes              P, BBB           A               
 2 Alaska  None             No               P                B               
 3 Arizona None             No               P                I               
 4 Arkans… None             No               P                B               
 5 Califo… None             No               P                EE              
 6 Colora… None             No               P                J               
 7 Connec… None             No               P                E, F            
 8 Delawa… None             No               P                E               
 9 Distri… Positive         No               P                B               
10 Florida Negative L       No               M                B

Do it in one line from `df`

df %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

You may try this

as.data.frame(pdf_text(url4)[[93]],header=F) %>%
  rename(x = `out[[93]]`) %>%
  mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
  mutate(x = strsplit(x, '\\n')) %>%
  unnest() %>%
  .[-c(1:3),] %>%
  separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
  select(-a)

Do it in one line from df

You may try this

Do it in one line from `df`