Extracting web link titles in order to name lists-CodePudding

I have the following data:

url = getURL("https://www.fomento.gob.es/be2/?nivel=2&orden=34000000")

parsed <- htmlParse(url)

links <- xpathSApply(parsed,path = "//a", xmlGetAttr, "href")
hrefLinks = xpathSApply(parsed,path = "//a")

xlsLinks <- hrefLinks[grep('.XLS|.xls', links, value = FALSE)]

xlsLinks

Which gives me something like:

[[31]]
<a target="_blank" href="sedal/34020150.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3   Valor medio de las transacciones inmobiliarias de vivienda libre</a> 

[[32]]
<a target="_blank" href="sedal/34020160.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3.1  Valor medio de las transacciones inmobiliarias de vivienda libre nueva</a> 

[[33]]
<a target="_blank" href="sedal/34020170.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3.2  Valor medio de las transacciones inmobiliarias de vivienda libre de segunda mano</a> 

[[34]]
<a target="_blank" href="sedal/34020180.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.4   Valor medio de las transacciones inmobiliarias de vivienda libre de extranjeros residentes en España</a>

I want to extract the "titles" of each of the list links. i.e. for the above 4 examples I would like to extract:

3.3 Valor medio de las transacciones inmobiliarias de vivienda libre
3.3.1 Valor medio de las transacciones inmobiliarias de vivienda libre nueva
3.3.2 Valor medio de las transacciones inmobiliarias de vivienda libre de segunda mano
3.4 Valor medio de las transacciones inmobiliarias de vivienda libre de extranjeros residentes en España

In the full list there are 34 such types that I would like to extract. They all correspond to the URLS with the extension xls. The idea is to use these values to name the lists of URLS.

CodePudding user response：

Split and get 3rd item:

x <- list(
  '<a target="_blank" href="sedal/34020150.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3   Valor medio de las transacciones inmobiliarias de vivienda libre</a>',
  '<a target="_blank" href="sedal/34020160.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3.1  Valor medio de las transacciones inmobiliarias de vivienda libre nueva</a>',
  '<a target="_blank" href="sedal/34020170.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.3.2  Valor medio de las transacciones inmobiliarias de vivienda libre de segunda mano</a>',
  '<a target="_blank" href="sedal/34020180.XLS" title="Enlace a un archivo con extensión .xls. Este enlace abre una ventana nueva">3.4   Valor medio de las transacciones inmobiliarias de vivienda libre de extranjeros residentes en España</a>')

sapply(x, function(i) strsplit(i, "[<>]")[[ 1 ]][ 3 ])
# [1] "3.3   Valor medio de las transacciones inmobiliarias de vivienda libre"                                    
# [2] "3.3.1  Valor medio de las transacciones inmobiliarias de vivienda libre nueva"                             
# [3] "3.3.2  Valor medio de las transacciones inmobiliarias de vivienda libre de segunda mano"                   
# [4] "3.4   Valor medio de las transacciones inmobiliarias de vivienda libre de extranjeros residentes en España"

CodePudding user response：

You can use rvest and then an attribute = value css selector with $ ends with operator to target the desired elements (adding a parent element id as an anchor for safety net). As the i case insensitivity flag isn't supported in rvest implementation, you will need to provide OR syntax to cater for upper/lowercase substrings within href attribute. You can then use html_text() for the titles of the returned nodeList.

library(magrittr)
library(rvest)
library(purrr)

url <- "https://www.fomento.gob.es/be2/?nivel=2&orden=34000000"
df <-
  map_dfr(
    read_html(url) %>%
      html_elements('#app_camaleon [href$=".xls"], #app_camaleon [href$=".XLS"]'),
    ~ data.frame(
      title = .x %>% html_text(),
      link = .x %>% html_attr("href") %>% url_absolute(url)
    )
  )