I am trying to collect a number of links from a website.
For example I have the following and my idea was to collect the link where it says leer más
which is where I get the xpath
from.
url = "https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181"
x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
x %>%
read_html() %>%
html_nodes(xpath = '//*[@id="App"]/div[2]/div[1]/main/div/div[3]/section/article[1]/div/a/p/span[2]')
This gives me the following but not the link:
{xml_nodeset (1)}
[1] <span >Leer más</span>
Additionally, I thought about collecting all links:
x %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
This gives me a lot of links but not the links to the individual webpages I want.
I would like to have a list of links such as:
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-terraza-trastero-ascensor-amueblado-internet/162262978/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-trastero-ascensor-amueblado/159750574/d
https://www.fotocasa.es/es/alquiler/vivienda/madrid-capital/aire-acondicionado-calefaccion-jardin-zona-comunitaria-ascensor-patio-amueblado-parking-television-internet-piscina/162259162/d
CodePudding user response:
Those links are stored inside a JavaScript object within a script
tag. You can regex out the string defining that object, do some unescapes to enable jsonlite to parse, then apply a custom function to extract just the urls of interest to the json object
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
link <- 'https://www.fotocasa.es/es/alquiler/viviendas/madrid-capital/todas-las-zonas/l/181'
p <- read_html(url) %>% html_text()
s <- str_match(p, 'window\\.__INITIAL_PROPS__ = JSON\\.parse\\("(.*)".*?;')[,2]
data <- jsonlite::parse_json(gsub('\\\\\\"', '\\\"', gsub('\\\\"', '"', s)))
links <- purrr::map(data$initialSearch$result$realEstates, ~ .x$detail$`es-ES` %>% url_absolute(link))