UPDATED
I web scraped a table online that wasn't actually structured as a table. I managed to separate the characters into multiple rows, but for future reference, would like to know of a more efficient way to do this for larger data sets.
I also was able to get everything into one column, but the entire code is wildly inefficient. Any suggestions for improvement?
library(rvest)
library(tidyverse)
library(dplyr)
url = "https://www.ncsl.org/research/health/state-laws-and-legislation-related-to-biologic-medications-and-substitution-of-biosimilars.aspx"
webpage=read_html(url)
mandatory_2014 = webpage %>%
html_element(css = "#dnn_ctr84472_HtmlModule_lblContent > div > table:nth-child(15)") %>%
html_table()
mandatory_2014 = data.frame(mandatory_2014)
df = mandatory_2014 %>%
mutate(X1=strsplit(X1, "\n\n\t\t\t")) %>%
unnest(X1) %>%
mutate(X2=strsplit(X2, "\n\n\t\t\t")) %>%
unnest(X3)%>%
mutate(X3=strsplit(X3, "\n\n\t\t\t")) %>%
unnest(X3)
df = df[-c(2)]
df = stack(df)
df = df[-c(2)]
df = data.frame(df[!duplicated(df),])
df = rename(df, States = df..duplicated.df....)
CodePudding user response:
This may be done in base R
more easily - unlist
the columns to a vector
, then replace one or more occurrence (
) of \n\t
with a single ,
as well as removing the characters that starts from the (
, then either use strsplit
or scan
to split the string into individual elements (using delimiter ,
), apply trimws
to remove any remaining leading/lagging spaces, and convert it to a data.frame
column
out <- data.frame(States = trimws(scan(text = sub("\\s \\(.*", "",
gsub("(\\n \\t )", ",", mandatory_2014)), what="", sep=",")))
-output
> out
States
1 Florida
2 Kansas
3 Kentucky
4 Massachusetts
5 Minnesota
6 Mississippi
7 Nevada
8 New Jersey
9 New York
10 Pennsylvania
11 Puerto Rico
12 Rhode Island
13 Washington
14 West Virginia