How to parse or extract the table from the php into a dataframe. I only need to see the table
I tried this
theurl <- "http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
results <- xpathSApply(doc, "//*/table[@id='table_results_r_1']")
results <- readHTMLTable(results[[1]])
rm(doc)
It doesn't work as a failed attempt
http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php
CodePudding user response:
library(RSelenium)
library(rvest)
library(xml2)
#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE )
server <- driver$server
browser <- driver$client
#goto url in browser
browser$navigate("http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php")
#get all tables
doc <- xml2::read_html(browser$getPageSource()[[1]])
all.table <- rvest::html_table(doc)
#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)
all.table[[2]]
> all.table[[2]]
# A tibble: 22,397 x 4
# CUI Term Dictionary SemanticType
# <chr> <chr> <chr> <chr>
# 1 C0003865 Arthritis, Adjuvant NDFRT Experimental Model of Disease
# 2 C0004426 avian sarcoma CSP Experimental Model of Disease
# 3 C0004565 B16 Malignant Melanoma NCI Experimental Model of Disease
# 4 C0007098 Carcinoma 256, Walker NDFRT Experimental Model of Disease
# 5 C0007125 Carcinoma, Ehrlich Tumor NDFRT Experimental Model of Disease
# 6 C0007128 Carcinoma, Krebs 2 NDFRT Experimental Model of Disease
# 7 C0009075 Cloudman S91 Malignant Melanoma NCI Experimental Model of Disease
# 8 C0011853 Diabetes Mellitus, Experimental NDFRT Experimental Model of Disease
# 9 C0014072 autoimmune encephalomyelitis CSP Experimental Model of Disease
# 10 C0018598 Harding-Passey Malignant Melanoma NCI Experimental Model of Disease
CodePudding user response:
The table is not dynamically rendered or added through additional xhr requests. You don't need the expense of a browser. Simply use httr with an user-agent header (as the server checks for this) and then the following css selector, as below, to target the right table:
library(httr)
library(magrittr)
r <- httr::GET(
url = "http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php",
httr::add_headers(.headers = c("user-agent" = "Mozilla/5.0"))
) %>%
content()
r %>%
html_element("p table") %>%
html_table()