Home > Net >  webscraping: capture links of references with R
webscraping: capture links of references with R

Time:10-22

I want to capture the links to references from an article on this page: https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es

I have tried this:

 library(rvest)
library(dplyr)
 link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)
 links <- page %>% 
    html_nodes("a") %>%
    html_text()

But these are not the links that I want to.

There are 68 references so I want the 68 links attached to those references

CodePudding user response:

I have been looking the site and found that the [ links ] labels runs some javascript at onclick event that sends you to an intermediate site, page etc. Thus so far it is not easy to scrap from them. I found this solution that matches 65 of the 68 links written as text in the "#article-back" section. It seems three links are not well formatted thus not matched (i.e. "h ttp://"). I hope it has been helpful.

Edit: Regexp taken from this answer

library(rvest)
library(dplyr)
 
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)

text <- page %>% html_node("#article-back") %>% 
    html_text()

 
matches <- gregexpr(
  "\\b(https?|ftp|file)://)?[-A-Za-z0-9 &@#/%?=~_|!:,.;] [-A-Za-z0-9 &@#/%=~_|]",
  links)
 
links <- regmatches(links, matches)

Edit 2 For scrap from the javascript in onclick:

library(rvest)
library(dplyr)
 
link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)

text <- page %>% html_node("#article-back") %>% 
    html_nodes("a") %>% html_attr("onclick") 

links <- gsub(".*(/[^'] ).*", "https://www.scielo.org.mx/\\1", text[!is.na(text)])

links_pid <- gsub(".*pid=([^&] )&.*", "\\1", links)
links_pid

  • Related