I would like to scrape the title and authors of journal articles from all staff-members' official web-pages. e.g.
https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah
The specific part in question that I'm trying to access is this:
I'm following this guide: https://www.datacamp.com/community/tutorials/r-web-scraping-rvest
but it refers to HTML tags which this site doesn't have. Can any point me in the right direction please?
CodePudding user response:
The page loads these citations dynamically using an XHR call that returns a json
object. In this case, we can replicate the query and parse the json ourselves to get the publication list:
library(httr)
library(rvest)
library(jsonlite)
url <- paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
"uniqueid=00970757",
"&tries=0",
"&hash=f6a214dc99686895d6bf3de25507356f",
"&citationStyle=1")
GET(url) %>%
content("text") %>%
fromJSON() %>%
`[[`("publications") %>%
`[[`("journal_article") %>%
lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
unlist() %>%
as.character()
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"
Update
It is possible to get the json url from the html of the faculty member's homepage with a bit of text parsing:
get_json_url <- function(url)
{
carveout <- function(string, start, end)
{
string %>% strsplit(start) %>% `[[`(1) %>% `[`(2) %>%
strsplit(end) %>% `[[`(1) %>% `[`(1)
}
params <- GET(url) %>%
content("text") %>%
carveout("var dataGetQuery = ", ";")
id <- carveout(params, "uniqueid: '", "'")
tries <- carveout(params, "tries: ", ",")
hash <- carveout(params, "hash: '", "'")
citationStyle <- carveout(params, "citationStyle: ", "\n")
paste0("https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?",
"uniqueid=", id,
"&tries=", tries,
"&hash=", hash,
"&citationStyle=", citationStyle)
}
Which allows:
url <- "https://eps.leeds.ac.uk/civil-engineering/staff/581/samuel-adu-amankwah"
get_json_request(url)
#> [1] "https://eps.leeds.ac.uk/site/custom_scripts/symplectic_ajax.php?uniqueid=00970757&tries=0&hash=f7266eb42b24715cfdf2851f24b229c6&citationStyle=1"
And, if you want to be able to just lapply
a vector of homepage urls to get the final publication list:
publications_from_homepage <- function(url)
{
get_json_request(url) %>%
GET() %>%
content("text") %>%
fromJSON() %>%
`[[`("publications") %>%
`[[`("journal_article") %>%
lapply(function(x) paste(x$authors, x$title, x$journal, sep = " ; ")) %>%
unlist() %>%
as.character()
}
So you have:
publications_from_homepage(url)
#> [1] "Adu-Amankwah S, Zajac M, Skocek J, Nemecek J, Haha MB, Black L ; Combined influence of carbonation and leaching on freeze-thaw resistance of limestone ternary cement concrete ; Construction and Building Materials"
#> [2] "Wang H, Hou P, Li Q, Adu-Amankwah S, Chen H, Xie N, Zhao P, Huang Y, Wang S, Cheng X ; Synergistic effects of supplementary cementitious materials in limestone and calcined clay-replaced slag cement ; Construction and Building Materials"
#> [3] "Shamaki M, Adu-Amankwah S, Black L ; Reuse of UK alum water treatment sludge in cement-based materials ; Construction and Building Materials"
#> [4] "Adu-Amankwah S, Bernal Lopez S, Black L ; Influence of component fineness on hydration and strength development in ternary slag-limestone cements ; RILEM Technical Letters"
#> [5] "Adu-Amankwah S, Zajac M, Skocek J, Ben Haha M, Black L ; Relationship between cement composition and the freeze-thaw resistance of concretes ; Advances in Cement Research"
#> [6] "Zajac M, Skocek J, Adu-Amankwah S, Black L, Ben Haha M ; Impact of microstructure on the performance of composite cements: Why higher total porosity can result in higher strength ; Cement and Concrete Composites"
#> [7] "Adu-Amankwah S, Black L, Skocek J, Ben Haha M, Zajac M ; Effect of sulfate additions on hydration and performance of ternary slag-limestone composite cements ; Construction and Building Materials"
#> [8] "Adu-Amankwah S, Zajac M, Stabler C, Lothenbach B, Black L ; Influence of limestone on the hydration of ternary slag cement ; Cement and Concrete Research"
#> [9] "Adu-Amankwah S, Khatib JM, Searle DE, Black L ; Effect of synthesis parameters on the performance of alkali-activated non-conformant EN 450 pulverised fuel ash ; Construction and Building Materials"