Home > Software design >  How can I crawl/scrape (using R) the non-table EPA CompTox Dashboard?
How can I crawl/scrape (using R) the non-table EPA CompTox Dashboard?

Time:12-08

The EPA CompTox Chemical Dashboard received an update, and my old code is not longer able to scrape the Boiling Point for chemicals. Is anyone able to help me scrape the Experimental Average Boiling Point? I need to be able to write an R code that can loop through several chemicals.

Example webpages:
Acetone: https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482
Methane: https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545

I have tried read_html() and xmlParse() without success. The Experimental Average Boiling Point (ExpAvBP) value does not show up in the XML.

I have tried using ContentScraper() from the RCrawler, but it only returns NA whatever I try. Furthermore, this would only work for the first webpage listed, as the cell id changes with each chemical.

ContentScraper(Url="https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8021482", XpathPatterns = "//*[@id='cell-225']")

I have tried using readLines(), but the information is all crammed into the last script tag, and I am unsure how to isolate just the ExpAvBP value. And it looks like the value is stored elsewhere? For example, below is what I believe is the boiling point information within the last script tag.

Acetone:

{unit:c_,name:"Boiling Point",predicted:{rawData:[{value:c$,minValue:e,maxValue:e,source:am,description:an,modelName:"TEST_BP",modelId:T,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:B,link:"https:\u002F\u002Fs3.amazonaws.com\u002Fepa-comptox\u002Ftest-reports\u002FDTXCID101482-TEST_BP.html",showLink:a},qmrf:{value:e,link:e,showLink:d}},{value:44.8,minValue:e,maxValue:e,source:ci,description:cj,modelName:"EPISUITE_BP",modelId:dV,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:46.458,minValue:e,maxValue:e,source:ad,description:V,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:e,hasQmrfPdf:d,details:{value:M,link:e,showLink:d},qmrf:{value:e,link:e,showLink:d}},{value:da,minValue:e,maxValue:e,source:aL,description:bo,modelName:"OPERA_BP",modelId:dS,hasOpera:a,globalApplicability:q,hasQmrfPdf:a,details:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=21482",showLink:a},qmrf:{value:B,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}}],count:bu,mean:47.06289999999999,min:c$,max:da,range:[c$,da],median:45.629},experimental:{rawData:[{value:db,minValue:e,maxValue:e,source:aN,description:aO,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:ck,description:cl,experimentalDetails:[]},{value:ak,minValue:ak,maxValue:ak,source:"Food and Agriculture Organization of the United Nations",description:"The Joint FAO\u002FWHO Expert Committee on Food Additives (JECFA) is an international expert scientific committee that is administered jointly by the Food and Agriculture Organization of the United Nations (FAO) and the World Health Organization (WHO). Website: \u003Ca href="http:\u002F\u002Fwww.fao.org\u002Fhome\u002F" target="_blank"\u003Ehttp:\u002F\u002Fwww.fao.org\u002Fhome\u002F\u003C\u002Fa\u003E",experimentalDetails:[]},{value:56.05,minValue:e,maxValue:e,source:"Abooali et al. Int. J. Refrig. 2014, 40, 282–293",description:"Abooali, D.; Sobati, M. A. Novel method for prediction of normal boiling point and enthalpy of vaporization at normal boiling point of pure refrigerants: A QSPR approach. (\u003Ca href="http:\u002F\u002Fdx.doi.org\u002F10.1016\u002Fj.ijrefrig.2013.12.007" target="_blank"\u003EInt. J. Refrig. 2014, 40, 282–293\u003C\u002Fa\u003E)\r\n",experimentalDetails:[]},{value:bO,minValue:bO,maxValue:bO,source:hI,description:hJ,experimentalDetails:[]}],count:dK,mean:55.98518333333333,min:db,max:bO,range:[db,bO],median:ak},arrKey:"BOILING_POINT"}

Methane:

{unit:cO,name:"Boiling Point",predicted:{rawData:[{value:at,minValue:f,maxValue:f,source:bB,description:bb,modelName:"ACD_BP",modelId:135,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}},{value:hl,minValue:f,maxValue:f,source:aF,description:ba,modelName:"OPERA_BP",modelId:dv,hasOpera:a,globalApplicability:s,hasQmrfPdf:a,details:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fcalculation_details?model_id=27&search=25545",showLink:a},qmrf:{value:O,link:"http:\u002F\u002Fcomptox-dev.epa.gov\u002Fdashboard\u002Fdsstoxdb\u002Fdownload_qmrf_pdf?model=27",showLink:a}},{value:cP,minValue:f,maxValue:f,source:bZ,description:b_,modelName:"EPISUITE_BP",modelId:dy,hasOpera:d,globalApplicability:f,hasQmrfPdf:d,details:{value:ag,link:f,showLink:d},qmrf:{value:f,link:f,showLink:d}}],count:bH,mean:-129.25300000000001,min:at,max:cP,range:[at,cP],median:hl},experimental:{rawData:[{value:at,minValue:at,maxValue:at,source:hm,description:hn,experimentalDetails:[]},{value:cQ,minValue:f,maxValue:f,source:bC,description:bD,experimentalDetails:[]}],count:H,mean:ho,min:at,max:cQ,range:[at,cQ],median:ho},arrKey:"BOILING_POINT"}

Any help or insight would be greatly appreciated!

CodePudding user response:

As the data is in no table format we have to extract text and extract the boiling temperature by matching pattern BoilingPoint.

library(rvest)
library(dplyr)
library(RSelenium)
    
 url = 'https://comptox.epa.gov/dashboard/chemical/properties/DTXSID8025545'
driver = rsDriver(browser = c("firefox"))
remDr <- driver[["client"]]
remDr$navigate(url)

df = remDr$getPageSource()[[1]] %>% 
  read_html() %>% html_nodes(xpath = '//*[@id="__layout"]/div/div[5]/div[2]/main/div/div[3]/div[2]/div/div[2]/div[2]/div[3]') %>% 
  html_text()

Now get boiling temp. Refered https://stackoverflow.com/a/35936065/12135618

df1 = df %>% str_remove_all( '\n') %>% str_replace_all( ' ', '')
as.numeric(sub(".*?BoilingPoint.*?(\\d ).*", "\\1", df1))
[1] 163

You may have to do further fine tuning to get the decimal points of boiling temperature.

  • Related