I scraped a table from a pdf and everything went into one element of the data frame. I managed to separate everything into separate columns, but r got confused with the column names. The first column is "State" and should include all the state names, but after separation, is blank. The second column is "State Drug Formulary," and after separation, incorrectly includes the state names in it. It’s also missing a lot of other information. Any possible fixes?
For simplicity, I renamed the column "x".
library(tabulizer)
library(pdftools)
library(rJava)
library(tidyverse)
url4 = "https://oppe.pharmacy.washington.edu/PracticumSite/forms/2019_Survey_of_Pharmacy_Law.pdf?-session=Students_Session:42F94F5D0a61a20754trv33D875D&fbclid=IwAR0qeK2tYmyI7T_8ict1Hnew9JxPkpt0bvajI3KL3IFDWg6JHNSSFWGlKY4"
out <- pdf_text(url4)
df=as.data.frame(out[[93]],header=F)
df = df %>%
rename(x = `out[[93]]`) %>%
mutate(x=strsplit(x, "\n")) %>%
unnest(x)
df=df[-c(1:2),]
df2=df %>% separate(x, c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
What the table should look like. Pg 82 of the original document if you visit the source.
I also tried this, which kept the col names, but removed the data
df3 = df %>% separate(x, sep = " ", into = c("State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"))
CodePudding user response:
Page 82 includes other contents like 21. Drug Product Selection Laws
or etc.
You'd better remove them like,
dummy <- strsplit(df$`out[[93]]`, '\\n\n')
This process will split that page into four part and table what you are looking for is second object of that list.
df2 <- df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3), ]
Now df2
is the table contents. So, splitting this with more than two whitespace,
df2 %>% separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)
will give the result. 'a' is dummy that result from separate
have blank value at the front. Here's some part of the result.
State `State Drug Fo…` `Two-line Rx F…` `Permissive or…` `How to Preven…`
<chr> <chr> <chr> <chr> <chr>
1 Alabama None Yes P, BBB A
2 Alaska None No P B
3 Arizona None No P I
4 Arkans… None No P B
5 Califo… None No P EE
6 Colora… None No P J
7 Connec… None No P E, F
8 Delawa… None No P E
9 Distri… Positive No P B
10 Florida Negative L No M B
Do it in one line from df
df %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)
You may try this
as.data.frame(pdf_text(url4)[[93]],header=F) %>%
rename(x = `out[[93]]`) %>%
mutate(x = stringr::str_split(x, '\\n\n',simplify = T)[2]) %>%
mutate(x = strsplit(x, '\\n')) %>%
unnest() %>%
.[-c(1:3),] %>%
separate(x, c("a","State", "State Drug Formulary","Two-line Rx Format","Permissive or Mandatory*","How to Prevent Substitution","Cost Savings Pass-on","Patient Consent**"), sep = "\\s{2,}") %>%
select(-a)