Home > database >  webscrapping Scielo for references of an articulo with rvest
webscrapping Scielo for references of an articulo with rvest

Time:10-22

I want to extract the references from an article on this page:

https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es

I have tried this:

library(rvest)
library(dplyr)
product_names = simple %>% 
  html_nodes(xpath= '//*[contains(concat( " ", @class, " " ), concat( " ", "references", " " ))]') %>%
  html_text()

but did not work

How can I extract the references?

CodePudding user response:

Here is a way.
The main complication is the presence of multi-byte characters at the end of each string.

suppressPackageStartupMessages({
  library(rvest)
  library(dplyr)
})

link <- "https://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S2448-76782022000100004&lang=es"
page <- read_html(link)

page %>%
  html_elements(xpath = '//*[@id="article-back"]') %>%
  html_elements("p") %>%
  html_text() %>%
  gsub("[\n\t]", "", .) %>%
  gsub("\\[|\\]", "", .) %>%
  gsub("Links", "", .) %>%
  iconv(from = 'UTF-8', to = 'ASCII//TRANSLIT') %>%
  trimws() -> refs

refs <- refs[3:70]

head(refs)
#> [1] "Alaie, S. A. (2020). Knowledge and learning in the horticultural innovation system: A case of Kashmir valley of India. International Journal of Innovation Studies, 4(1), 116-133. https://doi.org/10.1016/j.ijis.2020.06.002."                                                                  
#> [2] "Andersson, U., Dasi, A., Mudambi, R., & Pedersen, T. (2016)Technology, innovation and knowledge: The importance of ideas and internationalconnectivity. Journal of World Business,51(1), 153-162.https://doi.org/10.1016/j.jwb.2015.08.017."                                                     
#> [3] "Arroyo, F. J., Sanchez, J., & Sole, M. L. (2017). La calidad e innovacion como factores de diferenciacion para el comercio electronico de ropa interior de una marca latinoamericana en Espana. Contabilidad y Negocios, 12(23), 52-61. h ttps://doi.org/10.18800/contabilidad.201701.004."      
#> [4] "Bach, H., Makitie, T., Hansen, T., & Steen, M. (2021). Blending new and old in sustainability transitions: Technological alignment between fossil fuels and biofuels in Norwegian coastal shipping. Energy Research & Social Science, 74(1), 101957. https://doi.org/10.1016/j.erss.2021.101957."
#> [5] "Bodas, I. M., Marques, R. A.., & Silva, E. M. (2013). University-industry collaboration and innovation in emergent and mature industries in new industrialized countries. Research Policy, 42(2), 443-453. https://doi.org/10.1016/j.respol.2012.06.006."                                        
#> [6] "Bourke, J., & Roper, S. (2017). Innovation, quality management and learning: Short-term and longer-term e?ects. Research Policy, 46(1), 1505-1518. https://doi.org/10.1016/j.respol.2017.07.005."

Created on 2022-10-21 with reprex v2.0.2

  • Related