Home > Software design >  How to extract table from php site to R data frame
How to extract table from php site to R data frame

Time:11-09

How to parse or extract the table from the php into a dataframe. I only need to see the table

I tried this

theurl <- "http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php"
doc <- htmlParse(GET(theurl, user_agent("Mozilla")))
results <- xpathSApply(doc, "//*/table[@id='table_results_r_1']")
results <- readHTMLTable(results[[1]])
rm(doc)

It doesn't work as a failed attempt

http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php

CodePudding user response:

library(RSelenium)
library(rvest)
library(xml2)

#setup driver, client and server
driver <- rsDriver( browser = "firefox", port = 4545L, verbose = FALSE ) 
server <- driver$server
browser <- driver$client

#goto url in browser
browser$navigate("http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php")

#get all tables
doc <- xml2::read_html(browser$getPageSource()[[1]])
all.table <- rvest::html_table(doc)

#close everything down properly
browser$close()
server$stop()
# needed, else the port 4545 stays occupied by the java process
system("taskkill /im java.exe /f", intern = FALSE, ignore.stdout = FALSE)

all.table[[2]]
> all.table[[2]]
# A tibble: 22,397 x 4
# CUI      Term                              Dictionary SemanticType                 
# <chr>    <chr>                             <chr>      <chr>                        
# 1 C0003865 Arthritis, Adjuvant               NDFRT      Experimental Model of Disease
# 2 C0004426 avian sarcoma                     CSP        Experimental Model of Disease
# 3 C0004565 B16 Malignant Melanoma            NCI        Experimental Model of Disease
# 4 C0007098 Carcinoma 256, Walker             NDFRT      Experimental Model of Disease
# 5 C0007125 Carcinoma, Ehrlich Tumor          NDFRT      Experimental Model of Disease
# 6 C0007128 Carcinoma, Krebs 2                NDFRT      Experimental Model of Disease
# 7 C0009075 Cloudman S91 Malignant Melanoma   NCI        Experimental Model of Disease
# 8 C0011853 Diabetes Mellitus, Experimental   NDFRT      Experimental Model of Disease
# 9 C0014072 autoimmune encephalomyelitis      CSP        Experimental Model of Disease
# 10 C0018598 Harding-Passey Malignant Melanoma NCI        Experimental Model of Disease

CodePudding user response:

The table is not dynamically rendered or added through additional xhr requests. You don't need the expense of a browser. Simply use httr with an user-agent header (as the server checks for this) and then the following css selector, as below, to target the right table:

library(httr)
library(magrittr)

r <- httr::GET(
  url = "http://www.medindex.am/glossary/semantic_types/B2.2-disease-syndrome-pathologic-function.php",
  httr::add_headers(.headers = c("user-agent" = "Mozilla/5.0"))
) %>%
  content()

r %>%
  html_element("p   table") %>%
  html_table()
  • Related