Home > Back-end >  Scrape journal article title from staff web-page
Scrape journal article title from staff web-page

Time:11-04

I would like to scrape the title and authors of journal articles from all staff-members' official web-pages. e.g.

https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah

The specific part in question that I'm trying to access is this:

enter image description here

I'm following this guide: https://www.datacamp.com/community/tutorials/r-web-scraping-rvest but it refers to HTML tags which this site doesn't have. Can any point me in the right direction please?

CodePudding user response:

The page loads these citations dynamically using an XHR call that returns a json object. In this case, we can replicate the query and parse the json ourselves to get the publication list:

library(httr)
library(rvest)
library(jsonlite)

url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
       "uniqueid=00970757",
       "&tries=0", 
       "&hash=f6a214dc99686895d6bf3de25507356f", 
       "&citationStyle=1")

GET(url) %>% 
  content("text") %>%
  fromJSON() %>%
  `[[`("publications") %>%
  `[[`("journal_article") %>%
  lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
  unlist() %>%
  as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Update

It is possible to get the json url from the html of the faculty member's homepage with a bit of text parsing:

get_json_url <- function(url)
{
   carveout <- function(string, start, end)
   {
      string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
                 strsplit(end)   %>% `[[`(1) %>% `[`(1)
   }
   
   params <- GET(url) %>% 
      content("text") %>% 
      carveout("var dataGetQuery = ", ";")
   
   id <- carveout(params, "uniqueid: '", "'")
   tries <- carveout(params, "tries: ", ",")
   hash <- carveout(params, "hash: '", "'")
   citationStyle <- carveout(params, "citationStyle: ", "\n")

   paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
          "uniqueid=", id,
          "&tries=", tries, 
          "&hash=", hash,
          "&citationStyle=", citationStyle)
}

Which allows:

url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"

get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"

And, if you want to be able to just lapply a vector of homepage urls to get the final publication list:

publications_from_homepage <- function(url)
{
   get_json_request(url) %>%
   GET() %>% 
     content("text") %>%
     fromJSON() %>%
     `[[`("publications") %>%
     `[[`("journal_article") %>%
     lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
     unlist() %>%
     as.character()
}

So you have:

publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"                        
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"                                                                                                
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"                                                                 
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"                                                                  
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"                         
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"                                         
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"                                                                                   
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"

Created on 2021-11-04 by the enter image description here

  • Related