Web Scrape titles in r-CodePudding

I am trying to make a function get_CIDname()

Each chemical compound has a designated CID, Compound ID, from PubChem's chemical database.

For example, Acetic Acid is 176, and water is 962

I have a dataframe with a column of these CIDs, and some other character value columns. I would like to mutate a new column that names each CID as the column's title name from the site.

Example:

i.e. all instances of 962 in this identifier column is replaced with 'Water', and all instances of 176 is replaced with 'Acetic Acid', the main name on the website https://pubchem.ncbi.nlm.nih.gov/compound/CID

example dataset:

df <- data.frame("Compound" = c(176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(12, mean = 500000, sd = 600000))

desired output:

df <- data.frame("Compound" = c(176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(12, mean = 500000, sd = 600000),
Match = c("Acetic Acid", "Dihydromyrcenol", etc....))

Currently, I have:

get_CIDname <- function(CID){
read_html(paste0("https://pubchem.ncbi.nlm.nih.gov/compound/",
           CID)) 

}

but do not know how to decipher the HTML of the PubChem's website. What comes next? What is this type of syntax/programming called?

CodePudding user response：

We can use their PUG REST API to extract the JSON datafiles and link the CID to the compound title.

#libraries
library(jsonlite)
library(data.table)

#data
df <- data.frame("Compound" = c(10413, 176,29096,6341,8914,5366204,98464,11572,9231,535144,15669393,1738127,1738124), "Value" = rnorm(13, mean = 500000, sd = 600000))


#set to data.table
df <- as.data.table(df)

#set up progressbar
pb <- txtProgressBar(min = 0, max = nrow(df), style = 3)

#loop through df rows
for(i in 1:nrow(df)){
  #update progressbar
  setTxtProgressBar(pb, i)  
  
  #extract compound data 
  data <- fromJSON(readLines(paste0("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/", df[i,]$Compound, "/JSON/?response_type=save&response_basename=compound_CID_", df[i,]$Compound)))
   
  #extract title
  compound_title <- data$Record$RecordTitle
  
  #add to df
  df[i, name := compound_title]
}
head(df)

   Compound    Value                   name
1:    10413 898404.7 4-Hydroxybutanoic acid
2:      176 174150.1            Acetic Acid
3:    29096 516514.0        Dihydromyrcenol
4:     6341 499010.7             Ethylamine
5:     8914 783220.9             Nonan-1-ol
6:  5366204 217092.8  (Z)-1-Methoxy-2-buten

If you have duplicates of Compound in your dataset it might be faster to loop through unique compounds, i.e. for(i in unique(df$compounds) and adjust the code accordingly.

Edit: They note in the description of the PUG REST API that PUG REST is not designed for very large volumes (millions) of requests. They ask that any script or application does not make more than 5 requests per second, in order to avoid overloading the PubChem servers. See https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest Something to keep in mind.