I want to capture the links to references from an article on this page: https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es
I have tried this:
library(rvest)
library(dplyr)
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
links <- page %>%
html_nodes("a") %>%
html_text()
But these are not the links that I want to.
There are 68 references so I want the 68 links attached to those references
CodePudding user response:
I have been looking the site and found that the [ links ] labels runs some javascript at onclick event that sends you to an intermediate site, page etc. Thus so far it is not easy to scrap from them. I found this solution that matches 65 of the 68 links written as text in the "#article-back" section. It seems three links are not well formatted thus not matched (i.e. "h ttp://"). I hope it has been helpful.
Edit: Regexp taken from this answer
library(rvest)
library(dplyr)
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
text <- page %>% html_node("#article-back") %>%
html_text()
matches <- gregexpr(
"\\b(https?|ftp|file)://)?[-A-Za-z0-9 &@#/%?=~_|!:,.;] [-A-Za-z0-9 &@#/%=~_|]",
links)
links <- regmatches(links, matches)
Edit 2 For scrap from the javascript in onclick:
library(rvest)
library(dplyr)
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
text <- page %>% html_node("#article-back") %>%
html_nodes("a") %>% html_attr("onclick")
links <- gsub(".*(/[^'] ).*", "https://www.scielo.org.mx/\\1", text[!is.na(text)])
links_pid <- gsub(".*pid=([^&] )&.*", "\\1", links)
links_pid